Previously we described how Naive Bayes classification works using the classic golf example. There we mentioned that the algorithm is designed to work primarily with categorical attributes such as Outlook=sunny, overcast or rain; humidity=low, medium or high and so on. However, in many practical situations, variables or attributes are seldom categorical. For example, humidity and temperature can both be quantified and expressed as continuous variables.
So how does Naive Bayes handle numeric attributes? Specifically, data mining programs like RapidMiner will apply the technique on datasets which have a combination of attribute types without skipping a beat!
Solution 1: Discretize numeric variables
A simple solution is obviously to discretize the continuous variables into a few categories. However doing so is sometimes subjective. For instance, in categorizing temperature, someone may select 80 degrees as the cutoff at which temperature can be considered as "High", whereas another person (from the tropics!) may choose to select 90 degrees as the border between "Medium" and "High". This subjectivity causes obvious loss of information. But it can still be used as a quick way to get going before applying Naive Bayes classification.
Solution 2: Use Probability Density functions
A more rigorous method is to use probability density function (pdf) values, where we preserve the continuous values as such. We start by assuming the probability distribution for an attribute follows a normal or Gaussian distribution. If it is known to follow some other distribution, such as Poisson's, the equivalent probability density function can be used.
Using the attributes' mean and standard deviation (which can be easily computed from the dataset), for each of the two class outcomes (Play=Yes and Play=No), we estimate the pdf value at any given decision point. For example, suppose we are required to determine if play will occur at, say a humidity level of 85. All we need to do is to plug in the value of 85, the mean and standard deviation of humidity levels for all instances when Play=Yes into the pdf equation to obtain a pdf value. We can do a similar computation for all instances when Play=No (using corresponding mean and STD of humidity for all instances when play=No).
The two pdf values will then be used in the Naive Bayes formula to obtain the corresponding probabilities (of Play=Yes and No).
There is a sample dataset within RapidMiner which can be used to illustrate this working. Use the graphic below to set up the model and apply the following steps:
Step 1: Retrieve the Golf data set
Step 2: Using Select Attributes, choose only the label attribute (Play) and the two numeric attributes (Temperature and Humidity)
Step 3: Connect the Naive Bayes operator
Step 4: Apply Model to the training dataset
Step 5: Retrieve the Golf Test dataset and connect it to the "unl" port of the Apply Model operator
Step 6: Run the model and check the Plot View in the Results perspective. Select Humidity in the pull down on the left and you will see the following graph.
As you can see, for humidity level at 84, the probability of Play=No is high whereas for humidity level at 77, the probability of Play=Yes is high. By checking the "Data View" under "Example Set (Retrieve)" tab in results, you can see the application of Naive Bayes to all the test records and observe the predicted values.
Are you interested in a datamining cookbook that explains many of these techniques and shows you how to apply them using open source products like RapidMiner? Take the anonymous survey below to give us feedback!