When to use mutual information correlation based feature selection

Posted by Bala Deshpande on Thu, Apr 12, 2012 @ 09:00 AM

Two main types of feature selection methods

There are two main types of feature selection or dimensionality reduction algorithms: filter type and wrapper type. The filter model does not require any learning algorithm, where as the wrapper model is optimized for a particular learning algorithm. Examples of wrapper models include Forward Selection or Backward Elimination in multiple regression. In other words, filter model is unsupervised versus wrapper model being supervised feature selection methods.

The filter model works best when the following conditions are met:

  1. When the number of features or attributes is really large
  2. When computational expense is a criterion

But there are also instances when filter type mutual information based feature selection methods must be used with caution. This article highlights two scenarios when KeyConnect (to be launched soon), a mutual information based unsupervised feature selection tool, must be used with caution.

Scenario 1: When there are outliers in the dataset

Outliers result in an artificially high value of mutual information. This causes KeyConnect to spuriously select the variables involved as important features. The fix is very simple: use a program like RapidMiner to detect and eliminate the offending samples.

using rapidminer to detect eliminate outliers resized 600 Scenario 2: When attributes contain known strong correlations

This can happen for example when one column (or attribute) in a data set is derived from another column. For instance, when you have two columns such as Gross Profit and %Gross Profit. This is of course a very simplistic example and can be manually eliminated before applying feature selection. However, in cases where such correlations are not known beforehand, we can once again use RapidMiner to detect and remove correlated features. 

using rapidminer to remove correlated attributes resized 600

The reason we need to eliminate highly correlated features before using mutual information is that such attributes will dominate the overall information exchange computed in the analysis. The strength of a program such as KeyConnect is to detect weak interactions which may be missed by linear correlations, but still account for valuable information within the dataset. If two variables are collinear, there is no new information that is added by keeping both of the variables and hence one of them may be removed.

We have used this process to reduce a dataset which had 300+ attributes to a more manageable dozen or so which may then be used for building usable predictive models, for example.

using keyconnect for key driver detection and analysis resized 600

Sign up to become our beta tester and win a chance to use KeyConnect free!

Tags: keyconnect, mutual information, feature selection