This article introduces a very useful feature of RapidMiner for data aggregation: the "Pivot" operator. This works very similar to the standard XL pivot table and therefore it is helpful to know how to set the parameters of the operator to get similar results.
An important fact to keep in mind is that a pivot operation simply transforms a data table's structure and should not reduce the number of entries (or cells in a spreadsheet context). However, the pivot operator may increase the number of cells. As explained in the graphic below, the raw data has only 18 entries, however the pivot table will have 20 cells.
A pivot table created using XL is shown below. Customer ID is selected as "row label" and Product ID is selected as "column label". CPI makes up "Values" of each cell. Observe that there are now 20 total data cells (ignoring the Grand Total column).
Comparing the XL Pivot with RapidMiner Pivot, we will notice 2 main differences.
First difference is that non-numeric terms (called "missing values" in the example) are not replaced by 0's in RapidMiner as they are by default in XL. Second difference: column label has a prefix which corresponds to the header of the data ("Values") column. Finally, just like in XL, missing entries also become missing values or blank cells.
Setting up the RapidMiner Pivot operator process
We illustrate in detail below the process for setting up the pivot operator using a real world example. There may be additional steps needed for other cases. First figure below is a snapshot of the Health, Nutrition and Population (HNP) statistics data set which includes data from 240 countries for the last 50 years. This data is downloadable freely from this site.
Our objective is to set up a subset of data for one year – in this case we selected 2006. Each row of this new reduced dataset will hold statistics for one of the 240 countries and each column holds one of the 253 HNP attributes for the year 2006. Obviously we will need a pivot table operation and the figures below show the sequence of steps followed to get to the final outcome.
STEP 1: Import the raw data from above and select attributes to pivot on
This is similar to choosing the fields in XL. If this process was done using XL, Country_Name would go into the "Row Labels" field and Indicator_Name into the "Column Labels". The 2006 column would go into the "Values" field.
STEP 2: Set up the Pivot operator parameters in RapidMiner
Therefore, the "group attribute" corresponds to "Row Label" in XL and "index attribute" corresponds to "Column Label" in XL. Since we already selected the 2006 column using the "Select Attributes" operator, we are ready to run the process. The graphic below shows the output from the above RapidMiner process.
As seen above, each attribute in the resultant pivot table is prefixed by "2006_". We have 240 rows and 253 attributes - Rapidminer also considers the Country_Name as an attribute here, but for the purposes of further analysis, this column could be ignored.
Sign up for our blog to get regular "How to" articles such as these.