How to use k-means clustering to simplify cost modeling: part 2 of 2
In part 1 of this series, we described how a small manufacturer of automotive parts is trying to develop cost models for 22 different product lines. Rather than develop an individual cost model for each of the products, we suggested the use of a cluster analysis technique to bunch several similar products together and work on a reduced set.
In this part, we will demonstrate how RapidMiner can be used to run a K-means cluster analysis with the data shown in the last article. At the end of this article, you will have an opportunity to download the dataset yourself and try it on your own.
Step 1: Indicate ID column in data
When you read the spreadsheet into RapidMiner, make sure that the first column is selected to be of type "ID" as this refers to the product line. This will be helpful while reviewing the results. This is done in Step 4 of the data import process.
Step 2: Normalize the data
If you notived the Revenue column above in the spreadsheet has a range in thousands (of dollars) while some of the others range from 0 to 1. Normalizing data of this kind will make sure that the distances computed for separating into clusters will not be influenced by scale factors. Among the many handy tools that RapidMiner offers for data transformation is the Normalize tool. Simply hook it up to the output of your 'Read Excel' operator.
Step 3: Select k-means clustering operator
The main point here is to figure out how many initial clusters to separate data into. The parameters option dialog box on the right (after you select the Cluster operator in the main window), shows a field for "k". We selected 5, mainly because this was a fair number of target for the business problem.
Step 4: Set up evaluation operator
The only reason this is a separate step is because the Cluster operator generates two outputs and the Performance operator needs both of them. However, the upper "clu" output on the Cluster operator is the one which should connect to the "clu" input on the Performance operator which is below the "exa" input. Thus the two input lines coming into Performance have a "crossed" look. See below.
Connect the outputs from the Performance operator to the main window "res" sockets and you are all set to run this model.
Step 5: Results Display (and later Interpretation)
One of the best things about RapidMiner is that it provides so many options for splicing and dicing the results. For an analysis of this sort, we recommend the following process for viewing and understanding the results:
1. Check the "Folder View" under cluster model. This is the fastest way to see which products have been grouped together.
2. Check the "Centroid Table" and "Centroid Plot View" under the cluster model. This helps you see how the clusters separate on the different attributes.
3. Do a visual check: This is only practical if we have very few attributes (or variables). Check the "data view" option under "Example Set(Normalize)" and run a "Scatter Matrix" with plots separated into color coded clusters.
A very important step of course is to test the validity of the resulting clustering operation. This requires a separate discussion on its own and is not dealt with here.
Download the dataset used in this example and (optionally) sign up for visTASC. If you like tutorials like these, we invite you to sign up for visTASC, "a visual thesaurus of analytics, statistics and complex systems" which provides a portal for aggregrating analytics wisdom and current applications, in addition to tutorials, example models and datasets.
NOTE: DATASET IS ALREADY NORMALIZED.