The Analytics Compass Blog

Twice weekly articles to help SMB companies optimize business performance with data analytics and to improve their analytics expertise.

Subscribe via E-mail

Your email:

Search SimaFore

FREE SMB Survey Report

describe the image
  • Our report (pdf) is a survey of analytics usage and needs of more than 100 SMBs
  • Find out which analytics applications have
    • Highest value
    • Most demand
    • Best ROI

Affordable Analytics for SMB

 

Browse by Tag

Current Articles | RSS Feed RSS Feed

How to best use RapidMiner Plot functions before complex modeling

  
  
  

Serious data miners explain that nearly 80-90% of the effort in an analytics project is spent on data preparation and exploration and a very small part of the effort actually goes toward building advanced models. To support this paradigm, RapidMiner offers a wide variety of data visualization tools for exploration. In this brief article we will show how by simply using some of these visual tools, we can get a good idea of the behaviour of the data before applying any type of modeling techniques.

The data visualization tools are simply called "plotters" in RapidMiner terminology and can be accessed in the "Results" perspective. A very easy way to get to these plotters is of course by simply loading a data file and connecting it to an output port and running the analysis. For example, you can load the Iris dataset in the "Samples" folder that ships with your RapidMiner installation.

rapidminer sample data loading resized 600

When you run this analysis, you will be taken to the "Results" perspective or view from where you can quickly select any plotting function that you want. The objective of this article is to show how by using a couple of these basic plot functions, you can get a good idea of the behavior of the data before doing any serious modeling.

For example, if we were to plot the data using the "Parallel" plotter and set the "Color Column" to "Label", we see that the variables indicate how the data is classified. As you can see the classes separate nicely along variables "a3" and "a4", but not so well along "a1" and "a2".

rapidminer parallel plot uses resized 600

A similar story can be extracted if you use the "Scatter" plot option. 

rapidminer scatter plot visualization resized 600

The best modeling technique to understand this data is probably a decision tree model. Surely enough when we build a decision tree model to explore this data, we will see that a3 and a4 are the root nodes.  The point of this simple article is to show that before spending a lot of time into a specific modeling technique, we can learn quite a bit about the data by using basic visualization tools such as the "Plotters" in RapidMiner.

If you want to see how to build a decision tree, download our free ebook here.

Comments

Dear Sir,  
really useful information to me. 
Thank you
Posted @ Wednesday, June 13, 2012 12:49 PM by Shashikumartechniu
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics