Feature selection with mutual information, Part 2: PCA disadvantages
This is the second and concluding part of the article which shows how one of the disadvantages of principal component analysis (PCA) for feature selection or dimension reduction can be addressed using mutual information based tools.
Just to recap, one disadvantage of PCA lies in interpreting the results of dimension reduction analysis. This challenge will become particularly telling when the data needs to be normalized. Here is part 1 of this series which explains this in detail.
A reason why we need to normalize before applying PCA is to mitigate the effects of scale. For example, if one of the attributes is orders of magnitude higher than others, PCA tends to ascribe the highest amount of variance to this attribute and thus skews the results of the analysis. By normalizing, we can get rid of this effect. However normalizing results in spreading the influence across many more principal components. In others words, more PCs are required to explain the same amount of variance in data. The interpretation of analysis gets muddied.
Mutual information based feature selection overcomes all of those challenges. The advantages it offers for dimension reduction or feature selection are:
- It is easy to interpret
- It is not sensitive to scale effects
The data from the cereal example is analyzed in just a couple of simple process steps using KeyConnect, a mutual information based tool as explained in the series of graphics below.
Instead of using the cut-off after the analysis, one could have changed the filter threshold in Step 2 and achieved similar results. However several runs using the tool may be helpful in gaining some understanding of the result.
This was the result of the first PCA analysis (without normalization): Potass, Sodium, Vitamins, Calories and Rating. Mutual information ranks variables by the amount of useful information they contain. However PCA simply ranks attributes by the total amount of variance that each variable contributes. So a very noisy attribute could overshadow more useful, better structured data. For building predictive models, you want to gather variables that contain more information, not more noise. This article explains why PCA may not always be the best technique for predictive analytics.
Sign up to become our beta tester for KeyConnect, a mutual information based feature selection web-application.