One major challenge facing goverment agencies is the process of making publicly available data easy to consumer and utilize. Tax dollars typically finance the accumulation of vast quantities of structured data and one of the requirements for analyzing this data or making this data consumable by the public is to be able to shortlist critical variables or attributes within the dataset, for a given policy or business objective. Such an exercise is a needed first step in any data analysis process that requires building predictive analytics models. Or it may the ultimate goal in itself – to identify and extract the key factors which influence a policy’s effectiveness, for example.
Both of these tasks are frequently executed by analysts belonging to government agencies. However, with easy access to data thanks to the internet, the general public’s appetite for using data for their personal and business needs is also increasing. The government role thus may not be limited to only making data available. Providing tools (via apps) to make the consumption of this data easy is something that agencies must also consider.
A typical data analysis task for government agencies could be to identify a short list of factors which are critical toward meeting policy objectives. For example, an agency dealing with labor may want to employ the data to identify the key reasons for high unemployment in a particular economic region so that they can focus on those reasons with a view toward reducing unemployment. Similarly, an agency tasked with public health may want to identify a short list of critical parameters which would reduce the incidence of communicable diseases. In either case, the process required to get to the analysis objective are fairly standard.
In many cases, the general public also has a need to consume much of this data. For example, health care providers may access public databases to download health data for business decisions. Working professionals may want to access and understand economic data for employment related reasons. In either case, providing tools to pare down and dissect data would be an added benefit the government agency could provide the tax paying public, in addition to the raw data itself.
According to the view of the President’s Council of Science and Technology advisors, the US is under investing in tools for “analysis and effective utilization” of this massive resource. Looking forward the ability of a government to effectively utilize its data for decision making is going to be a critical factor. Many analysts and experts believe that “data is the next crude oil ”. If that is indeed a correct assessment, then the tools to refine and make the data consumable are akin to the oil refineries and pipelines.
Using KeyConnect, a mutual information based key driver analysis tool to identify the key variables from a fairly large dataset is quite simple. After following standard data preparation steps and removing collinear variables, the original data set has been reduced to about 38 attributes.
If the objective is to further reduce the dataset from 38 attributes to a much shorter list and to rank this list of variables in the order of relative importance within the dataset, mutual information based analysis provides a much more easy-to-deploy and easy-to-interpret option.
The reduced dataset of 220 samples and 38 attributes are now run through KeyConnect. Loading the data (a csv file) is fairly easy as shown below.
Since a Basic Pareto analysis was run, KeyConnect identifies the variables responsible for 80% of the total information content within the 38 variables. It has generated the following output which shows that 10 variables account for more than 80% of the total information and also gives their relative ranking. KeyConnect also shows how these variables interact with each other via a circle chart.
KeyConnect also generates a simple tabular report that shows the ranking of these 10 important variables among the 38 and their relative contribution to the overall information content (see Table 1).
Table 1. Ranking the reduced attribute dataset by information content
With KeyConnect, it is also possible to run a targeted ranking analysis. For example, if it is required to measure the influence of the various factors on an economic growth indicator such as GNP or Labor Force, it is possible to run a “Target Analysis” by selecting the desired target variable. KeyConnect then ranks the relative influence of each of the remaining attributes in the dataset on the target variable. The result of running a target analysis on the 2006 GNI per capita is shown in table 2. As seen here, per capita health expenditure seems to have a significantly high influence on the GNI followed by life expectancy of male population.
Table 2. Target analysis of 2006 GNI per capita
Get a free basic account on KeyConnect and try it out on your datasets!