Feature Selection for predictive analytics using mutual information

Posted by Bala Deshpande on Tue, Dec 06, 2011 @ 10:01 AM

There are several classes of data that predictive analytics must potentially deal with: categorical, numerical, textual, image, video (which is also a type of image) data and so on. At the core however, there are only two data types from an algorithmic point of view: categorical and numeric. (We consider binary data as a special case of categorical data. For example, image, voice and video data are essentially binary).

Within a data set, we can of course have a mix of these two types of data and it is usually a rare case where all the data is of only one of two types indicated. Specific methodologies for feature selection or dimension reduction work best with a given data type. We briefly describe a feature selection technique using the concept of mutual information that works robustly across multiple data types. Recall that feature selection essentially reduces the number of attributes (or variables) in a dataset to the so called "essential" few. 

All Categorical data: The chi-squared test of independence would work best in this case: assuming that a target variable is selected, every parameter may be checked in turn to see if the chi-square test detects if any dependency (or a relationship) exists between the parameter and the target. If the target variable is continuous, it can be converted into a categorical variable by a simple "binning" process. 

All Numeric data: There are many possible feature selection techniques when the data is purely numeric. Principal component analysis or PCA, for example captures those attributes which explain the majority of the variance within a dataset and short lists them as the most important variables. Some problems can crop up if PCA is indiscriminately chosen for all cases. Other modeling techniques such as multiple linear regression have in-built mechanisms for variable reduction that can also be handy.

A mix of categorical and numeric data: You may ask "why not simply split the mixed data set into two pure sets and run a feature selection on both sets?" Clearly, this will potentially ignore any interactive effects between the two types of data and we may end up deleting an attribute from one of the data classes which may have an impact on an attribute from the other class.

We have found that using bivariate analysis, one can employ mutual information to reduce a mixed dataset. Mutual information has two unique advantages:

1. It works similar to the chi-squared test of independence by counting the number of samples in each "bin", and thus can be used with categorical data.

mutual information feature selection joint probability

2. It can efficiently handle both linear and highly nonlinear relationships between variables because it does not depend upon fitting a function (linear or otherwise). 

mutual information non linear relationships

The bivariate analysis helps to aggregate all pairwise relatioships which show a high degree of information exchange as measured by mutual information. We can then run a Pareto analysis to filter out, for example all those variables which account for 80% of the information exchanged. This is in contrast to PCA which selects those variables which account for 80% of the variances (or noise). By selecting only those variables which exhibit high degree of information content (and not variance or uncertainty), we can build models which have low uncertainty or high predictability.

If you are interested in beta testing a mutual information based feature selector on your data, simply sign up below. You will be among the first to try it out when the tool is ready.

mutual information taxonomy analytics pathway

Tags: data mining tools, mutual information, feature selection