Email spam is something that everyone has encountered. Todays email tools all have built in features to filter this spam. The technology behind most spam filters is the Naive Bayes classifier. Email spam filtering is a great example of the application of this classifier. However there are a wide variety of classification problems for which Naive Bayes is a good tool. It is fast, robust and relatively insensitive to missing values and even data imbalance problems.
Despite its many strengths and robust performance, there are three well known weaknesses with this classification technique.
Issue 1: Incomplete training data
Recall that in order to implement it, we need to compute several conditional probabilities. Specifically, the class conditional probability, which states the probability that an attribute assumes a particular value, given the outcome or response class. In the classic naive bayes example of golf data, there are no instances of "Play = No" when the attribute "outlook" is "overcast". So the class conditional probability would be zero and the whole construction collapses.
To overcome this problems, all implementations use something called a Laplace correction to assign arbitrarily low probabilities in such cases so that the probability computation does not become zero.
Issue 2: Continuous variables
When an attribute is continuous, computing the probabilities by the traditional method of frequency counts is not possible. In this case we would either need to convert the attribute to a discrete variable or use probability density functions to compute probability densities (not actual probabilities!). Most standard implementation automatically account for nominal and continuous attributes so the user does not need to worry about these transformations. However as a data scientist, it is important to be aware of the subtleties in the tool application.
Issue 3: Attribute independence
This is by far the most important weakness and something which requires a little bit of extra effort. In the calculation of outcome probabilites using the classical Bayes theorem, the implicit assumption is that all the attributes are mutually independent. This allows us to multiply the class conditional probabilities in order to compute the outcome probability.
When it is known beforehand that a few of the attributes are correlated (for example, overcast conditions may be correlated to medium temperatures), it is easy to ignore one of the correlated attributes. However, what do you do when you do not know which attributes are dependent on one another? For continuous variables, we can run a Pearson correlation test. For nominal or categorical attributes we can perform a chi-square test of independence. There are many tools available for doing this, including a spreadsheet program. There are also interactive online tools available for doing this check.
Get our book on Predictive Analytics and Data Mining for a full discussion of this topic and also a couple of Naive Bayes classifier examples which drive home the points.