Simafore provides tools and expertise to:
The Analytics Compass Blog is aimed at two types of readers:
individuals who want to build analytics expertise and
small businesses who want to understand how analytics can help them improve their business performance.
If you fall into one of these categories, join hundreds of others and subscribe now!
In this three part series, we explore how one can use RapidMiner 5.0, the open source analytics package to run a Principal Component Analysis (PCA). In part 1 we will quickly review the background for a PCA and explain the application logic. In part 2 we will do a PCA on non-standardized data and in part 3 we will show how to standardize data before running a PCA (and also why one should standardize).
Background - Why do a PCA?
In a previous article we discussed how PCA can add value in business analytics and also pointed out a couple of cautionary issues. To recap, PCA is a technique which will allow reducing the dimension of a dataset by identifying a few most influential parameters (if they exist). This sort of variable screening or feature selection will make it easy to apply other predictive modeling techniques and also make the job of interpreting the results easier.
PCA captures the parameters which explain the greatest amount of variation in the dataset. It does this by transforming the existing variables into a set of "principal components" or new variables which have the following properties:
The following schematic illustrates how PCA can potentially help in reducing data dimensions with a hypothetical dataset of m variables.
In part 2 we will apply this logic to a real dataset that can be downloaded. Using RapidMiner we will explain how to set up the main process and interpret the results.
Sign up for our analytics portal, visTASC for datasets, examples, and customizable business analytics content!
Allowed tags: <a> link, <b> bold, <i> italics