A well-known thumb rule in data mining and predictive analytics is that 80% of an analyst's time is spent on cleaning and preparing the data for analysis. A well prepared dataset is indeed more than half the work done. Recently, one of our regular blog readers wrote back saying that some tips and tools to help in this context would be very helpful. Of course, we agree whole heartedly.
There are several facets of data preparation that are important, ranging from missing value handling to outlier detection. This article is the first in a series dedicated to the "heavy lifting" portion of predictive analytics. Future articles will discuss selecting the best techniques based on variable types, stratifying data so that there is no bias in the analysis, and of course, missing values and outliers.
One of the easier ones to take on is the task of identifying which variables to actually include in the analysis. More specifically, in this article we will demonstrate the task of weeding out variables which carry no information. Take for example the case of a constant: if one or more of the attributes are constant, then clearly they carry no predictive information and can be discarded. A mind numbing way to identify these constants would of course be plotting each of the attributes and then eyeballing the ones which are flat-liners or show very little trend information.
If you have a small enough dataset, you can use the "Sparklines" feature in the latest Microsoft Excel to identify such attributes. The following graphic illustrates this.
However if the dataset is "large" and if the attributes are not strictly constant, as in the example above, sparklines cannot help much. Additionally, by simply looking at the sparklines above, we cannot really say much about any of the other non-constant attributes.
A faster, automated and more robust way to weed out such attributes is by using mutual information. We have several articles in this blog on how mutual information works and how KeyConnect leverages this technique. Here we show a 2-step process to accomplish the task of selecting the variables which are valuable.
Step 1: Run a Full Sensitivity analysis using KeyConnect
This procedure will rank all the variables in the dataset by the order of total information they carry. When the above dataset is analyzed, the bar chart in the results page shows this ranking. As seen in the chart, the attribute "const" has zero information content!
Step 2: Run a Basic Pareto analysis
With the same data set, you can run a Pareto analysis which will simply identify the variables which are responsible for 80% or more of the total information within the dataset. Compare the bar charts from the two analyses: the ranking is more or less identical, except that the Pareto analysis only shows 5 (out of 9) attributes where as the full sensitivity analysis showed all the attributes ranked by their information content.
Sign up for KeyConnect today!