Chris Anderson of “The Long Tail” fame ignited this controversy in 2008. His argument was that in the coming age of petabytes and exabytes, it is redundant trying to “fit” a model to data, thus making all predictive models futile. He believes in a fast brute-force approach where you sift through big data to identify meaningful correlations and make decisions based on this.
Practitioners of advanced business analytics are accustomed to building surrogate “models” from data. There is a lot of importance placed on making the models accurate at the risk of making them so highly specific to the data that was used to develop them. In fact there is even a technical term for this – “overfitting”.
On the other hand, there are many valid reasons for continuing to build models, probably the best articulated reason comes from John Sterman “…when experimentation in real systems becomes infeasible, [modeling and] simulation becomes the main, perhaps the only way […] to discover how complex systems work”.
So here are two sides of the debate and the accompanying arguments:
1. Model building is futile in the age of big data
a. Models assumptions are inherently non testable – until too late. A small outlier can wreak havoc with a model and result in chaos. Witness the sub-prime catastrophe: risk models were built on an assumption of extremely low (less than 2%) defaults in the sub-prime market. When this assumption was violated, the (house of cards of the) lending industry based on this model collapsed.
b. Nature is power law (or long tail) dominated Especially in the socioeconomic fields. For example, we could build a model to predict an individual’s net financial worth as a function of number of years of college for a given city. This model might be very accurate for many populations, until either a Bill Gates or Marc Zuckerberg or Steve Jobs were to relocate to that neighborhood one fine day!
2. Model building is simply a valuable tool for predictive analytics problems
c. Models improve decision making. As an example, if electricity consumption of households is predicted based on the season and occupancy, i.e. a predictive model is built, it can help in forecasting the utility company revenue estimate for the next quarter.
d. Models help advance science and knowledge. Theory is about predicting what has not yet been seen or observed. Without models, one cannot ask questions or run “what-if” scenarios on data. Furthermore, simply relying on correlations to predict future states also has flaws. The now classic “My TiVo thinks i’m gay” example illustrated how the TiVo recommendation engine which works on large data and correlations, resulted in a humorous mismatch for a customer.
George Box made this comment about models several decades back, but it appears to be truly eternal, “All models are wrong, but some are useful”.
Which side are you on in this debate? Do you think no amount of data can replace good predictive models?
Originally posted on Wed, Mar 02, 2011 @ 12:46 PM