Twice weekly articles to help SMB companies optimize business performance with data analytics and to improve their analytics expertise.
When executives do not clearly understand how a forecast or prediction works, they naturally tend to be suspicious about its usage. This suspicion gets stronger if the forecasts or predictions are very good! A common challenge that one has to address when using certain "black box" techniques, for example, artificial neural networks among others, is that they are difficult to explain to non-analytics people and therefore spread doubt and confusion about the real benefits businesses can derive from analytics.
Tableau, the prom queen of data is finally going out with R, the alpha-geek of analytics. This is a moment a lot of us have been waiting for. Tableau will soon release their version 8.1 which allows super easy integration with R. I had the opportunity to test drive the beta version of 8.1 with really cool results. Below are a few initial impressions along with a simple workbook you can download and play with (if you have the beta version).
When the data.gov website was launched in 2009, it had a measly 47 datasets. Four years later it has exploded to nearly 100,000 data sets in more than 50 formats. This is merely the public facing data which the government makes available to the tax paying citizenry. The "other" government data (still funded by taxes) which are not openly available to all, due to security and other reasons is clearly significantly larger. EMC Corporation recently released a report where they indicated that only about a quarter of this data is tagged and analyzed by the government currently. Officials have been quoted as saying that in the next 5 years, the feds will spend about $13 billion (16% of the total IT budget) to improve big data infrastructure and develop data mining best practices for this data. The report also summarized the top three areas where large government agencies can best leverage big data and analytics: improving process and efficiency, enhancing security and predicting trends.
We indicated that there are two main types of feature selection algorithms: wrapper type and filter type. A wrapper algorithm works within another machine learning program such as multiple linear regression. Good examples are Backward Elimination and Forward selection. Each iteration using a regression model either removes or introduces a variable which improves model performance. The iterations stop when a preset performance criterion (such as adjusted r-square or RMS error) is reached or exceeded. The inherent advantage of wrapper type methods are that multi-collinearity issues are automatically handled. However, you get no prior knowledge (or will be interested in afterwards) about the actual relationship between the variables.
We have described how cost models can be developed using regression. We have also described how to use these models for forecasting. Thus the model is used to not only develop a better understanding of the data (explanatory statistics), but it is then also used to make forecasts (predictive analytics).
Supposing you are building a multiple linear regression model using many factors, a first step is reducing the number of factors or predictors. This process is known as feature selection or dimension reduction among others. Two key points need to be kept in mind:
In part 1, we described a basic three workstation assembly line which is the "business end" of a supply chain. Here are the key management challenges in this supply chain: product assembly lead times are increasing and revenues are dropping (see chart below). Much of this complexity is driven by variability in the inventory levels of raw material parts coming into the assembly process.
Feature selection or data dimension reduction or variable screening in predictive analytics refers to the process of identifying the few most important variables or parameters which help in predicting the outcome. In today's charged up world of high speed computing, one might be forgiven for asking, why bother? The most important reasons all come from practicality.
When Dirk Nannes was forced out of injury, Royal Challengers Bangalore actually struck a gold mine by signing up the ever-explosive Chris Gayle to replace him. Gayle's base price was $400,000 and according to IPL rules, Chris could not be paid more than $650,000 because that was the value of the player he was replacing. Let us assume that was what he was finally signed for.
Principal component analysis (PCA) is a technique according to Wikipedia that "uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of uncorrelated variables called principal components."