In this three part series, we explore how one can use RapidMiner 5.0, the open source analytics package to run a Principal Component Analysis (PCA). In part 1 we will quickly review the background for a PCA and explain the application logic. In part 2 we will do a PCA on non-standardized data and in part 3 we will show how to standardize data before running a PCA (and also why one should standardize).
Background - Why do a PCA?
In a previous article we discussed how PCA can add value in business analytics and also pointed out a couple of cautionary issues. To recap, PCA is a technique which will allow reducing the dimension of a dataset by identifying a few most influential parameters (if they exist). This sort of variable screening or feature selection will make it easy to apply other predictive modeling techniques and also make the job of interpreting the results easier.
PCA captures the parameters which explain the greatest amount of variation in the dataset. It does this by transforming the existing variables into a set of "principal components" or new variables which have the following properties:
- They are uncorrelated with each other
- They cumulatively contain/explain a large amount of variance within the data
- They can be related back to the original variables via weightage factors. Original variables with very low weightage factors in their principal components can be removed from the dataset.
The following schematic illustrates how PCA can potentially help in reducing data dimensions with a hypothetical dataset of m variables.
In part 2 we will apply this logic to a real dataset that can be downloaded. Using RapidMiner we will explain how to set up the main process and interpret the results.
Sign up for our analytics portal, visTASC for datasets, examples, and customizable business analytics content!