The Analytics Compass Blog

Twice weekly articles to help SMB companies optimize business performance with data analytics and to improve their analytics expertise.

Subscribe via E-mail

Your email:

Search SimaFore

FREE SMB Survey Report

describe the image
  • Our report (pdf) is a survey of analytics usage and needs of more than 100 SMBs
  • Find out which analytics applications have
    • Highest value
    • Most demand
    • Best ROI

Affordable Analytics for SMB

 

Browse by Tag

Current Articles | RSS Feed RSS Feed

How to run Principal Component Analysis with RapidMiner - Part 1

  
  
  

In this three part series, we explore how one can use RapidMiner 5.0, the open source analytics package to run a Principal Component Analysis (PCA). In part 1 we will quickly review the background for a PCA and explain the application logic. In part 2 we will do a PCA on non-standardized data and in part 3 we will show how to standardize data before running a PCA (and also why one should standardize).

Background - Why do a PCA?

In a previous article we discussed how PCA can add value in business analytics and also pointed out a couple of cautionary issues. To recap, PCA is a technique which will allow reducing the dimension of a dataset by identifying a few most influential parameters (if they exist). This sort of variable screening or feature selection will make it easy to apply other predictive modeling techniques and also make the job of interpreting the results easier.

PCA captures the parameters which explain the greatest amount of variation in the dataset. It does this by transforming the existing variables into a set of "principal components" or new variables which have the following properties:

  1. They are uncorrelated with each other
  2. They cumulatively contain/explain a large amount of variance within the data
  3. They can be related back to the original variables via weightage factors. Original variables with very low weightage factors in their principal components can be removed from the dataset.

The following schematic illustrates how PCA can potentially help in reducing data dimensions with a hypothetical dataset of m variables.

principal component analysis logic flow 

In part 2 we will apply this logic to a real dataset that can be downloaded. Using RapidMiner we will explain how to set up the main process and interpret the results.

Sign up for our analytics portal, visTASC for datasets, examples, and customizable business analytics content!

vistasc blog sign up - how to use rapidminer for pca

Comments

I love this site, I find everything on rapidminer here!
Posted @ Wednesday, May 28, 2014 10:18 AM by Alina GHERMAN
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics