In part 1 we started off with a brief introduction into principal component analysis (PCA) and its application logic for a business analytics project. In this part, we will start with a real data set and use Rapidminer 5.0 to perform the PCA. Furthermore, for illustrative reasons, we will work with non-standardized or non-normalized data. In the next part we will standardize the data and explain why it may be important sometimes to do so.
The dataset includes information on ratings and nurtritional information on 77 breakfast cereals. There are a total of 15 variables, including 13 numerical parameters. The objective is to reduce this set of 13 numerical predictors to a much smaller list using PCA. The data comes from a publicly available statistical database and can also be downloaded below at the end of this article.
Step1 (data prep): Remove non-numeric parameters "Cereal name", "Manufacturer" and "Type (hot or cold)". These are columns A, B and C.
Step 2 - Read excel file into RapidMiner: This can be done using the standard "Read Excel" operator as illustrated below. Start by typing "excel" into the operator search field.
Step 3 - PCA Operator: Type in the keyword "pca" in the operator search field and drag and drop the PCA operator into the main process window. Connect output of Read Excel into the "Example set input" or exa port of the PCA operator.
The three available parameter settings for dimensionality reduction are: none, keep variance and fixed number. Here we use keep variance and leave the variance threshold at default value of 0.95 or 95%.
Step 4 - Build a PCA "model": Type the keywords "apply model" into the operator search field. Drag and drop the "Apply Model" operator into the main process window. Connect the following ports:
- from PCA operator, "ori"ginal port TO Apply Model "unl"abled data port
- from PCA operator, "pre"processing port TO Apply Model "mod"el port
- from Apply Model operator, output "lab" port to "res" port and output "mod" to "res" port
Step 5 - Interpret the results: By running the analysis as configured above, RapidMiner will output several tabs in the results panel. By clicking on the "PCA" tab, we will see three PCA related radio-buttons - Eigenvalues, Eigenvectors and Cumulative Variance Plot.
Using Eigenvalues, we can obtain information about the contribution to the data variance coming from each principal component individually and cumulatively.
If, for example, our variance threshold is 95%, then PC 1, PC 2 and PC 3 are all the components that we need to consider because they are sufficient to explain nearly 97% of the variance.
We can then "deep dive" into these three components and identify how they are linearly related to the actual or real parameters from the dataset. At this point we consider only those real parameters which have significant weightage contribution to the each of the first 3 PCs. These will ultimately form the subset of reduced parameters for further predictive modeling.
As seen from the above graphic, we have (rather arbitrarily!) chosen the highlighted real parameters - calories, sodium, potassium, vitamins, and rating to form the reduced dataset.
In the final part, we will see why we need to standardize the original dataset before doing the PCA and how this affects the results.
visTASC, is "a visual thesaurus of analytics, statistics and complex systems" which provides a portal for aggregrating analytics wisdom and current applications, in addition to tutorials, example models and datasets. Download the dataset used in this example and (optionally) sign up for visTASC.