“Truth is a Pathless Land”

...but finding an effective solution to your business problem does not have to be. Business analytics landscape does actually appear so, with a myriad techniques and vendor tools in the market.

Simafore provides tools and expertise to:

  • Integrate data
  • Select and deploy appropriate analytics
  • Institutionalize processes

About this Blog

The Analytics Compass Blog is aimed at two types of readers:

  • individuals who want to build analytics expertise and 

  • small businesses who want to understand how analytics can help them improve their business performance. 

If you fall into one of these categories, join hundreds of others and subscribe now!

Subscribe via E-mail

Your email:

Search SimaFore

FREE SMB Survey Report

describe the image
  • Our report (pdf) is a survey of analytics usage and needs of more than 100 SMBs
  • Find out which analytics applications have
    • Highest value
    • Most demand
    • Best ROI

Affordable Analytics for SMB

 

Browse by Tag

Blog - The Analytics Compass

Current Articles | RSS Feed RSS Feed

How to use k-means clustering to simplify cost modeling: part 2 of 2

  
  
  

In part 1 of this series, we described how a small manufacturer of automotive parts is trying to develop cost models for 22 different product lines. Rather than develop an individual cost model for each of the products, we suggested the use of a cluster analysis technique to bunch several similar products together and work on a reduced set.

In this part, we will demonstrate how RapidMiner can be used to run a K-means cluster analysis with the data shown in the last article. At the end of this article, you will have an opportunity to download the dataset yourself and try it on your own.

Step 1: Indicate ID column in data

When you read the spreadsheet into RapidMiner, make sure that the first column is selected to be of type "ID" as this refers to the product line. This will be helpful while reviewing the results. This is done in Step 4 of the data import process.

rapidminer gui select id for clusters

Step 2: Normalize the data

If you notived the Revenue column above in the spreadsheet has a range in thousands (of dollars) while some of the others range from 0 to 1. Normalizing data of this kind will make sure that the distances computed for separating into clusters will not be influenced by scale factors. Among the many handy tools that RapidMiner offers for data transformation is the Normalize tool. Simply hook it up to the output of your 'Read Excel' operator.

rapidminer gui normalize dataStep 3: Select k-means clustering operator

The main point here is to figure out how many initial clusters to separate data into. The parameters option dialog box on the right (after you select the Cluster operator in the main window), shows a field for "k". We selected 5, mainly because this was a fair number of target for the business problem.

Step 4: Set up evaluation operator

The only reason this is a separate step is because the Cluster operator generates two outputs and the Performance operator needs both of them. However, the upper "clu" output on the Cluster operator is the one which should connect to the "clu" input on the Performance operator which is below the "exa" input. Thus the two input lines coming into Performance have a "crossed" look. See below.

rapidminer gui cluster performance

Connect the outputs from the Performance operator to the main window "res" sockets and you are all set to run this model.

Step 5: Results Display (and later Interpretation)

One of the best things about RapidMiner is that it provides so many options for splicing and dicing the results. For an analysis of this sort, we recommend the following process for viewing and understanding the results:

1. Check the "Folder View" under cluster model. This is the fastest way to see which products have been grouped together.

2. Check the "Centroid Table" and "Centroid Plot View" under the cluster model. This helps you see how the clusters separate on the different attributes.

3. Do a visual check: This is only practical if we have very few attributes (or variables). Check the "data view" option under "Example Set(Normalize)" and run a "Scatter Matrix" with plots separated into color coded clusters.

A very important step of course is to test the validity of the resulting clustering operation. This requires a separate discussion on its own and is not dealt with here.

Download the dataset used in this example and (optionally) sign up for visTASC. If you like tutorials like these, we invite you to sign up for visTASC, "a visual thesaurus of analytics, statistics and complex systems" which provides a portal for aggregrating analytics wisdom and current applications, in addition to tutorials, example models and datasets.

NOTE: DATASET IS ALREADY NORMALIZED.

download-excel-dataset

Comments

what is the step 5?!
Posted @ Wednesday, March 14, 2012 2:33 AM by m
sorry about the typo - step 5 was incorrectly labeled as step 6
Posted @ Wednesday, March 14, 2012 8:14 PM by Bala Deshpande
I got an email with a wrong dataset, I did not receive the same one as in this tutorial.
Posted @ Sunday, April 28, 2013 1:19 PM by Alex
Where can I download the correct dataset?
Posted @ Sunday, April 28, 2013 1:21 PM by Alex
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics