Mutual information is a very intuitive calculation. It allows you to verify quickly if two quantities are trending together or not. In other words it basically performs the equivalent of the chi square test of independence, but for non-categorical quantities, i.e. numeric variables. In this article we will show how to use this feature in three different ways to identify variables which carry the most information in your dataset and therefore are useful, and eliminate variables which are random noise factors.
Consider the charts shown below. In the first chart, we can see that as one variable increases, the second also increases, except for the region around the center of the chart. This information cannot be easily captured by a standard statistical correlation, unless we know beforehand the nature of the relationship. Furthermore as data becomes noisier (see the middle chart), the strong trend may disappear, but however the two variables are still moving together somewhat. A standard correlation analysis will again miss this depending upon your “cutoff”. But with mutual information, not only capturing fuzzy, non-linear or non-monotonic relationships is possible, but the strength of the connection can also be graded. In fact, two variables can be forming a fuzzy circle, where standard correlation is zero, but because the two variables are still exchanging information, this can be captured by mutual information.
Now that we have a simple primer on mutual information, we will see how to leverage its analytic power.
Analysis 1: Full Sensitivity
When you have a dataset that is not significantly large, but nonetheless you still want to capture all the major channels and rank them. Additionally, you may want to explore how the individual attributes influence one another. For such as scenario, you can best use a Full Sensitivity analysis.
The process is simple. For every pair of variables, you compute a normalized mutual information. Simply add up these normalized values. For example, in a 10 variable dataset, you will need to add up 45 normalized mutual information values. Then you remove one of the 10 variables from the dataset and recompute the summation of normalized mutual information values – in this case it is a sum of 36 values. If the new sum is significantly lower than the previous summation, you have eliminated one of the variables responsible for a major portion of the information. You can then repeat this with replacement: put back the variable removed in step 1 and remove another variable. By running this sensitivity study, you will be able to rank all varibles from highest to lowest information content.
With Full Sensitivity analysis, you have all the key performance indicators ranked.
Fortunately you dont have to do this by hand! KeyConnect is a web based app which will do this automatically. See the video below to understand how this works.
Analysis 2: Basic Pareto
When you have a “lot” of variables, running a full sensitivity analysis can take time and may not be the most optimal method. In this case, you want to quickly identify the handful of variables which account for a major chunk of the total information. So you apply the Pareto 80-20 rule as described in this article. This is useful for many datasets where a good majority of data channels (or attributes) contribute very little to the total information. Many times, these channels are random noise factors.
With Basic Pareto you only rank the key performance indicators which account for 80% or more of the information.
This video explains how KeyConnect can be used for a Basic Pareto analysis for ranking attributes.
Analysis 3: Target Analysis
Sometimes you want to take a focused approach for identifying all data channels which impact a specific variable – called the target variable. This analysis is very similar to the full sensitivity analysis, except that instead of computing mutual information for every pair of variables, you run the computation for all attributes vis-a-vis the target. This is simply a process of identifying the key performance indicators for a specific business objective.
Originally posted on Fri, Jul 20, 2012 @ 10:30 AM