In this article, we briefly describe a 5-step process that will allow anyone to extract a key performance indicator(KPI) from diverse datasets: KPI using mutual information. The process will employ open source and software-as-a-service tools that are affordable and easy to deploy.

The objective is to demonstrate how a large, typical socio-economic statistics dataset can be explored, reduced in dimension and prepared for further processing (such as predictive modeling) or for extracting insights such as identifying key drivers of economic growth, for example.

The dataset used is the “Health, Nutrition and Population Statistics” from the World Bank and consists of 46000 rows for 240 countries from 1960 to 2010. The flow chart below illustrates the process and the corresponding results and benefits of each step.

We use RapidMiner for steps 1 through 4 and KeyConnect for step 5. Here is a short description of the steps, full details are described in the accompanying free white paper at the end of this article.

Step 1: Building the data

General data must be first organized into meaningful tables. The dataset we considered had nearly 50 years of data, but in this case, we are interested in identifying KPIs for a particular year and not the entire time history. In other words, we are interested in a cross sectional data and not a time series analysis. So the first step is to simply extract all data for the year of interest (in this case 2006). A standard process for doing this is via pivot tables. Here we make use of RapidMiner’s pivot table operator.

Steps 2 and 3: Preparing the data

This step really involves 3 sub processes: handling missing values, identifying outliers and removing outliers. The key question we have to answer is how to treat missing values. This is really dependent upon the nature of the data, the analysis objectives and can vary.

Step 4: Removing collinear variables

In socio-economic datasets, there is usually a lot of redundancy. For example, there will naturally be strong correlations between measures such as percentage of male population and percentage of female population. When the goal of the analysis is to identify key performance indicators for a specific objective such as GDP growth, for example, it is useful to remove such redundancies. This is accomplished with the “Remove Correlated attributes” operator.

Step 5: Identify the key performance indicators or most important variables

KeyConnect is used for this last step to rank the remaining variables according to their relative importance and then extract parameters which strongly influence a specific objective. This is somewhat equivalent to a dimension reduction or feature selection exercise using tools such as principal component analysis. However PCA has certain weaknesses such as scale dependence which require normalization. After normalization, it is difficult to rank relative importance of different variables, in spite of using an intuitive tool such as the RapidMiner PCA tool. KeyConnect, is a mutual information based process and is scale independent. It does not require normalization and the resulting ranking is easy to interpret. Finding and ranking KPI using mutual information is a very straightforward process.

Originally posted on Thu, Sep 20, 2012 @ 08:16 AM

Categories:

Tags:

#### No responses yet

This site uses Akismet to reduce spam. Learn how your comment data is processed.