When to use mutual information correlation based feature selection
Posted by Bala Deshpande on Thu, Apr 12, 2012 @ 08:00 AM
Two main types of feature selection methods
There are two main types of feature selection or dimensionality reduction algorithms: filter type and wrapper type. The filter model does not require any learning algorithm, where as the wrapper model is optimized for a particular learning algorithm. Examples of wrapper models include Forward Selection or Backward Elimination in multiple regression. In other words, filter model is unsupervised versus wrapper model being supervised feature selection methods.
The filter model works best when the following conditions are met:
- When the number of features or attributes is really large
- When computational expense is a criterion
But there are also instances when filter type mutual information based feature selection methods must be used with caution. This article highlights two scenarios when KeyConnect (to be launched soon), a mutual information based unsupervised feature selection tool, must be used with caution.
Scenario 1: When there are outliers in the dataset
Outliers result in an artificially high value of mutual information. This causes KeyConnect to spuriously select the variables involved as important features. The fix is very simple: use a program like RapidMiner to detect and eliminate the offending samples.
Scenario 2: When attributes contain known strong correlations
This can happen for example when one column (or attribute) in a data set is derived from another column. For instance, when you have two columns such as Gross Profit and %Gross Profit. This is of course a very simplistic example and can be manually eliminated before applying feature selection. However, in cases where such correlations are not known beforehand, we can once again use RapidMiner to detect and remove correlated features.

The reason we need to eliminate highly correlated features before using mutual information is that such attributes will dominate the overall information exchange computed in the analysis. The strength of a program such as KeyConnect is to detect weak interactions which may be missed by linear correlations, but still account for valuable information within the dataset. If two variables are collinear, there is no new information that is added by keeping both of the variables and hence one of them may be removed.
We have used this process to reduce a dataset which had 300+ attributes to a more manageable dozen or so which may then be used for building usable predictive models, for example.

Sign up to become our beta tester and win a chance to use KeyConnect free!