One of the first steps in data mining or business analytics problem solving is the process of eliminating variables which are not significant. There are a couple of reasons for taking this step. The most obvious reason is that going from a few hundred variables to a handful will make the interepretation of the results easy. The second and probably more critical reason is that many modeling techniques become useless as the number of parameters increases. This is known as the curse of dimensionality.
Probably the simplest way of determining significant variables is to compute the correlation coefficient (r) between all pairs of parameters and only select those that exceed a certain cut-off value (say 0.6). However there are two problems with this method
- As the number of variables increases, the data storage requirement for saving these coefficients increases as (nearly) the square of the number of variables
- More importantly, for relationships that are non-linear, r is not a very good indicator of correlation.
To overcome these issues, the chi-square technique can be used. It is easy to see how the chi-square technique would work in this case: assuming that a target variable is selected, every parameter is checked in turn to see if the chi-square test detects the existence of a relationship between the parameter and the target. If the target variable is continuous, it can be converted into a categorical variable by a simple "binning" process.
Chi-squared test of independence is a very useful tool for any predictive analytics professional. What other type of business problems are best solved by using these tools? Download our ebook below to read more.
If all the variables are continuous, the binning process can still be applied and then the chi-square test be used. However, entropy based methods can be applied here much more easily. The advantage of entropy based methods is that they will work even if there is no target variable. The process is involves computing Shannon entropy for all variables. For every pair of variables, for a total of p*(p-1)/2, mutual information is computed. Finally, those variables which contribute to more than a given fraction of the overall information exchanged within the data set, are selected as the key variables. This method is somewhat similar to the more traditional F-value technique which ensures that the key variables account for a significant amount of the total variance of the target variable.
No matter which technique is used, the first step of any data mining process remains the same: to identify a subset of parameters which capture the essential features of the dataset.
If you like tips like these, we invite you to sign up for visTASC, "a visual thesaurus of analytics, statistics and complex systems" which provides a portal for aggregrating analytics wisdom and current applications.