When Principal Component Analysis makes sense in business analytics
Principal component analysis (PCA) is a technique according to Wikipedia that "uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of uncorrelated variables called principal components."
What does that mean from a business analytics point of view? Let us assume that we have a dataset with M parameters (or variables). These could be for example, commodity prices, weekly sales figures, number of hours spent by assembly line workers; in short any business parameter that can have an impact on the performance. The question that PCA helps us to answer fundamentally is this: Which of these M parameters explain a signficant amount of variation contained within the data set? PCA essentially helps to apply an 80-20 rule: can a small subset of parameters (say 20%) explain 80% or more of the variation in the data?
Wiki provides enough mathematical details of how PCA accomplishes this, so we dont need to repeat it here. But we have to point out a couple of key issues to bear in mind while applying PCA. Let us assume that we have sufficient number of samples coming from a historical series or some random experiment so that our data set looks like a M x N matrix, where the M columns are the different parameters, N are the samples and N>M.
1. The results of a PCA must be evaluated in the context of the data.
If the data is extremely noisy, then PCA may end up suggesting that the noisiest variables are the most significant because they account for most of the variation!
An analogy would be the total sound energy in a rock concert. If the crowd noise drowns out some of the high frequency vocals or string notes, PCA might suggest that the most significant contribution to the total energy comes from the crowd - and it will be right! But this does not add any clear value if one is attempting to distinguish which musical instruments are influencing the harmonics, for example.
Measuring entropy of variables is a good way to identify if some parameters are too noisy to be included in the analysis.
2. Adding uncorrelated data does not always help. Neither does adding data that may be correlated, but irrelevant.
When we add more parameters to our data set, and if these parameters happen to be random noise, we are effectively back in the same situation as the first point above. On the other hand, as business analysts we also have to exercise caution and watch out for spurious correlations.
As an extreme example, it may so happen that there is a correlation between the number of hours worked in a garment factory, say and pork prices (an unrelated commodity) within a certain period of time. Clearly this correlation is probably pure coincidence. Such correlations again can muddy the results of a PCA.
Care must be taken to winnow the data set to include variables that make business sense and are not subjected to many random fluctuations before applying a technique like the PCA.
Image courtesy: Wikipedia
If you like tips like these, we invite you to sign up for visTASC, "a visual thesaurus of analytics, statistics and complex systems" which provides a portal for aggregrating analytics wisdom and current applications.