When the data.gov website was launched in 2009, it had a measly 47 datasets. Four years later it has exploded to nearly 100,000 data sets in more than 50 formats. This is merely the public facing data which the government makes available to the tax paying citizenry. The "other" government data (still funded by taxes) which are not openly available to all, due to security and other reasons is clearly significantly larger. EMC Corporation recently released a report where they indicated that only about a quarter of this data is tagged and analyzed by the government currently. Officials have been quoted as saying that in the next 5 years, the feds will spend about $13 billion (16% of the total IT budget) to improve big data infrastructure and develop data mining best practices for this data. The report also summarized the top three areas where large government agencies can best leverage big data and analytics: improving process and efficiency, enhancing security and predicting trends.
In any business, strategic decisions can be made only after we know what factors have the greatest impact on the bottomline. Key performance indicators or KPI are a term generally reserved for those factors which signal the health of a business. The business in question could be the overall functioning of an entity or it could be a specific division within a larger activity. For example, for the CEO of a large corporation the "business" interest is the overall profitability of the company. But for a manager of production in a smaller organization, the business interest would be number of defective units, number of days of accident free production, average throughput volume, average time for repairs and so on. So the quantities of interest would be some metrics or KPIs that succinctly summarize the division's performance in those terms.
The purpose of using analytics is to help make sound decisions using data as opposed to making shoot-from-the-hip decisions using instinct or gut-feel. This is all well and good, however there is one pitfall to watch out for before starting on the analytics journey: wrong data or wrong questions will derail the best efforts. What do we mean by this?
Imagine if you are the store owner of a drugstore and want to optimize your shelf space to improve cross-selling. Suppose you have customer transaction data from the sale of cosmetics: typically such data could contain the items purchased together by different customers. Specifically let us suppose we have a dataset that records what each cosmetics customer purchases during one transaction. (Such a data set is described in this data mining book). There are seven different cosmetics related items that are sold in a drugstore and the manager wants to use the transaction data described above to make best use of the shelf space.
Feature selection or dimension reduction is a data preparation activity for most predictive analytics and data mining work. One can argue that feature selection is one side of the coin and extracting key performance indicators (KPI) is the other side. Both of these activities require us to parse through available data and identify the big hitters or key players within the dataset. Where they differ is the final objective: in data mining, the final objective is to simply reduce the dimension of the data to optimize model building. In KPI analysis, the objective is to finalize what metrics to track.
In this article, we briefly describe a 5-step process that will allow anyone to extract a key performance indicator from diverse datasets. The process will employ open source and software-as-a-service tools that are affordable and easy to deploy.
While Key Performance Indicators (KPI) offer a rational basis for judging performance, there are two main challenges. The first challenge is selecting the right key performance indicators for your business. How do you know which are the best KPIs for your needs? The standard solution is to monitor several KPIs simulataneously. This can lead to data overload and become a barrier for effective communication of business performance to executives.
One scenario which market researchers try to get a good understanding of is the influence of gender on big ticket decision making. For example in our family, my wife a strong say in all the high value decision choices such as buying or remodeling our house to buying a new car! A convertible is ruled out in favor of a minivan, a man-cave in the basement is strongly contested by an extra guest bedroom and so on.
A well-known thumb rule in data mining and predictive analytics is that 80% of an analyst's time is spent on cleaning and preparing the data for analysis. A well prepared dataset is indeed more than half the work done. Recently, one of our regular blog readers wrote back saying that some tips and tools to help in this context would be very helpful. Of course, we agree whole heartedly.
The chi squared test of independence is usually used to check if two categorical (or nominal) variables are independent or not. However when you have a dataset that consists of several nominal variables, you can still apply the test to answer the more general question "are any of these variables related to one another?"