ranking KPI

In a previous article we discussed the application of principal component analysis (PCA) using RapidMiner to reduce the dimension of a dataset. One of the things which was pointed out was that in many instances raw data is not the best form for running a PCA and how a normalization (based on z-scores or ranges) needs to be applied to the data before running PCA. In this article we will describe the challenges and suggest an alternative way: feature selection using mutual information. The RapidMiner process below illustrates the PCA set up first.

Feature selection using PCA
RapidMiner PCA setup

While PCA is a very useful tool to identify the attributes of a dataset which are responsible for the most variance, interpreting PCA can get tricky. Furthermore, the results of a PCA with and without normalization may be significantly different. For example in the above set up which uses this dataset, the objective is to reduce the initial dimension of 13 variables. As the results show, the outcome with and without normalization are quite different. There are two interpretation issues in this case.

Feature selection using PCA
PCA disadvantages if not normalized
  1. After normalizing, we end up with more principal components that are needed to explain the same amount of variance (95%). 
  2. Each principal component (eigenvector) now has many more attributes that are more or less equally weighted.
Feature selection using PCA
PCA disadvantages after normalization

The result is that when we build predictive models we will need to use the principal components and the associated weights to generate a new set of independent variables from the original data. Unfortunately, this new set may not be significantly smaller than the original set (in this example, 7 new ones in place of 13 original attributes), in addition to being hard to interpret.

The value of predictive analytics is in being able to explain to (often times) non-mathematically inclined audience, the impact of changing different measures or attributes on a business objective. PCA makes this harder. 

Next we will see how a mutual information based dimension reduction overcomes this challenge and makes it easy and intuitive to explain to anyone, the impact of different attributes on a business objective. 

This is section describes how feature selection using mutual information can be set up.

Just to recap, one disadvantage of PCA lies in interpreting the results of dimension reduction analysis. This challenge will become particularly telling when the data needs to be normalized.

A reason why we need to normalize before applying PCA is to mitigate the effects of scale. For example, if one of the attributes is orders of magnitude higher than others, PCA tends to ascribe the highest amount of variance to this attribute and thus skews the results of the analysis. By normalizing, we can get rid of this effect. However normalizing results in spreading the influence across many more principal components. In others words, more PCs are required to explain the same amount of variance in data. The interpretation of analysis gets muddied.

Mutual information based feature selection overcomes all of those challenges. The advantages it offers for dimension reduction or feature selection are:

  1. It is easy to interpret
  2. It is not sensitive to scale effects

The data from the cereal example is analyzed in just a couple of simple process steps using KeyConnect, a mutual information based tool as explained in the series of graphics below.


loading data
Step1: Using KeyConnect for Mutual information based feature selection


Setting up KeyConnect
Step 2: Analysis using mutual information


Results from dimension reduction
Step 3: Pareto chart based on mutual information

Instead of using the cut-off after the analysis, one could have changed the filter threshold in Step 2 and achieved similar results. However several runs using the tool may be helpful in gaining some understanding of the result.

This was the result of the first PCA analysis (without normalization): Potass, Sodium, Vitamins, Calories and Rating. Mutual information ranks variables by the amount of useful information they contain. However PCA simply ranks attributes by the total amount of variance that each variable contributes. So a very noisy attribute could overshadow more useful, better structured data. For building predictive models, you want to gather variables that contain more information, not more noise. 

Originally posted on Thu, May 17, 2012 @ 09:05 AM

No responses yet

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.