The Analytics Compass Blog

Twice weekly articles to help SMB companies optimize business performance with data analytics and to improve their analytics expertise.

Subscribe via E-mail

Your email:

Search SimaFore

FREE SMB Survey Report

describe the image
  • Our report (pdf) is a survey of analytics usage and needs of more than 100 SMBs
  • Find out which analytics applications have
    • Highest value
    • Most demand
    • Best ROI

Affordable Analytics for SMB

 

Browse by Tag

Current Articles | RSS Feed RSS Feed

How to run Principal Component Analysis with RapidMiner - Part 2

  
  
  

In part 1 we started off with a brief introduction into principal component analysis (PCA) and its application logic for a business analytics project. In this part, we will start with a real data set and use Rapidminer 5.0 to perform the PCA. Furthermore, for illustrative reasons, we will work with non-standardized or non-normalized data. In the next part we will standardize the data and explain why it may be important sometimes to do so.

The dataset includes information on ratings and nurtritional information on 77 breakfast cereals. There are a total of 15 variables, including 13 numerical parameters. The objective is to reduce this set of 13 numerical predictors to a much smaller list using PCA. The data comes from a publicly available statistical database and can also be downloaded below at the end of this article.

Step1 (data prep): Remove non-numeric parameters "Cereal name", "Manufacturer" and "Type (hot or cold)". These are columns A, B and C.

Step 2 - Read excel file into RapidMiner: This can be done using the standard "Read Excel" operator as illustrated below. Start by typing "excel" into the operator search field.rapidminer-read-excel

 

Step 3 - PCA Operator: Type in the keyword "pca" in the operator search field and drag and drop the PCA operator into the main process window. Connect output of Read Excel into the "Example set input" or exa port of the PCA operator.

rapidminer gui pca operator settingsThe three available parameter settings for dimensionality reduction are: none, keep variance and fixed number. Here we use keep variance and leave the variance threshold at default value of 0.95 or 95%.

rapidminer gui pca parameter settings

Step 4 - Build a PCA "model":  Type the keywords "apply model" into the operator search field. Drag and drop the "Apply Model" operator into the main process window. Connect the following ports:
- from PCA operator, "ori"ginal port TO Apply Model "unl"abled data port

- from PCA operator, "pre"processing port TO Apply Model "mod"el port

- from Apply Model operator, output "lab" port to "res" port and output "mod" to "res" port

rapidminer gui apply pca modelStep 5 - Interpret the results: By running the analysis as configured above, RapidMiner will output several tabs in the results panel. By clicking on the "PCA" tab, we will see three PCA related radio-buttons - Eigenvalues, Eigenvectors and Cumulative Variance Plot.

Using Eigenvalues, we can obtain information about the contribution to the data variance coming from each principal component individually and cumulatively.

rapidminer gui pca resultsIf, for example, our variance threshold is 95%, then PC 1, PC 2 and PC 3 are all the components that we need to consider because they are sufficient to explain nearly 97% of the variance.

We can then "deep dive" into these three components and identify how they are linearly related to the actual or real parameters from the dataset. At this point we consider only those real parameters which have significant weightage contribution to the each of the first 3 PCs. These will ultimately form the subset of reduced parameters for further predictive modeling.

rapidminer gui pca results reduced variable setAs seen from the above graphic, we have (rather arbitrarily!) chosen the highlighted real parameters - calories, sodium, potassium, vitamins, and rating to form the reduced dataset.

In the final part, we will see why we need to standardize the original dataset before doing the PCA and how this affects the results.

visTASC, is "a visual thesaurus of analytics, statistics and complex systems" which provides a portal for aggregrating analytics wisdom and current applications, in addition to tutorials, example models and datasets. Download the dataset used in this example and (optionally) sign up for visTASC.

download-pca-example-dataset

Comments

Hi, 
This is a good method to reduce the number of input variables. 
But, can you please suggest a good method to select the input variables from PC scores instead of selecting them arbitrarily, according to the example stated above. 
 
Thanks
Posted @ Saturday, March 02, 2013 7:41 PM by Shailesh Patil
i need uml diagrams and algorithm for principal components analysis.. 
its pretty urgents i would be thankful if anyone responds back.. plz help me out m running out of time!!!
Posted @ Wednesday, July 03, 2013 1:41 PM by sara
Link to Part 3 goes to Part 2!
Posted @ Thursday, August 15, 2013 2:52 PM by V
Thanks! 
This post help me to clean a little bit my data base.
Posted @ Friday, September 13, 2013 9:36 AM by Matias
anyone tell me about the method on which the eigenvector values against the attributes are selected. 
e.g. how 0.62 value is selected for calories.
Posted @ Friday, January 17, 2014 1:27 AM by wajeeha
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics