The Analytics Compass Blog

Twice weekly articles to help SMB companies optimize business performance with data analytics and to improve their analytics expertise.

Subscribe via E-mail

Your email:

Search SimaFore

FREE SMB Survey Report

describe the image
  • Our report (pdf) is a survey of analytics usage and needs of more than 100 SMBs
  • Find out which analytics applications have
    • Highest value
    • Most demand
    • Best ROI

Affordable Analytics for SMB

 

Browse by Tag

Current Articles | RSS Feed RSS Feed

How to use decision trees for credit scoring using RapidMiner: Part 1

  
  
  

Credit scoring is a fairly common business analytics application. Some types of problems where credit scoring could be applied are:

  1. Prospect filtering: Identify which prospects to extend credit to and how much credit would be an acceptable risk
  2. Default Risk detection: Decide if a particular customer is likely to default  
  3. Bad debt collection: Sort out those debtors who will yield a good cost (of collection) to benefit (of receiving payment) performance.

We use a data set from here and describe how to use the open datamining business analytics handbooksource software RapidMiner to build a decision tree for addressing prospect filtering problem. If you have this book, you can compare the solution shown here to the one given in the book, which uses STATISTICA, a commercial analytics tool.

Setting up a decision tree analysis in Rapidminer is presented very nicely here in a video by Thomas Ott. However our article lays down the steps for applying decision trees for credit scoring applications and some common problems encountered by non-expert users applying these immensely valuable open source tools for analysis.

There are four main steps in setting it up:

1. Read in the data (from a spreadsheet)

2. Split data into training and testing samples

3. Train the decision tree

4. Apply the model and evaluate the performance

Get the complete set of articles on decision trees using RapidMiner in one place. Download FREE ebook below

ebook decision trees using rapidminer

This first part of the series focuses on step 1, which may seem rather elementary, but can consume a lot of time if not done properly. The next few parts will describe other steps in detail.

Step 1: Read in the data

RapidMiner's easy interface allows quick importing of spreadsheets. The best part about the interface is the panel on the left, called the "Operators". By simply typing in text in the box provided automatically pulls up all available RapidMiner operators that match the text - pretty handy! In this case, we need an operator to read an XL spreadsheet, and so we simply type "excel" in the box. As you can see, the two XL operators are immediately shown below: one for reading and one for exporting data.

rapidminer gui reading excel

Either double click on the "Read Excel" operator or drag and drop it into the "Main Process" panel - the effect is the same. Once the Read Excel operators appears in the main process window, we need to configure the data import process. What this means is telling RapidMiner which columns to import, what is contained in the columns and if any of the columns need special treatment.

rapidminer gui read excel configuration

This is probably the most "cumbersome" part about this step. RapidMiner has a feature to automatically detect (or Guess Value types). But it is a good exercise for the analyst to make sure that the right columns are picked (or excluded). Also, sometimes, the tool is unable to treat the first row of the spreadsheet as names in which case the user has to manually enter the names for all the columns in the attribute field as shown here. 

rapidminer loading excel problems

Once the data is imported, we must assign the target variable for analysis, also known as a "Label". In this case, it is the Credit Rating (column A) - see sub-step 2 shown in figure above. Finally it is a good idea to "run" RapidMiner and generate results to ensure that all columns are read correctly as demonstrated by Tom Ott's video above.

In the next part, we will split the available data into a training and testing components and set up the decision tree analysis. In the last part, we will validate our model and evaluate the performance to make some conclusions. We will also compare our results with the solution provided in the book.

If you would like to have access to similar tutorials on demand, consider signing up for our online tool below.

predictive analytics expertise

Comments

Great post. Hope you guys finish the series.
Posted @ Tuesday, April 26, 2011 4:07 PM by cb
@cb - Thanks much! We will finish it soon and make available an ebook which you can download that summarizes all parts.
Posted @ Wednesday, April 27, 2011 10:53 AM by Bala Deshpande
Morning 
willing to get a step by step guideline on using rapidminer with data sample application.Also would like to know how to retrieve Java program or pseudo program or using Matlab extension in Rapidminer
Posted @ Saturday, May 21, 2011 11:23 AM by Solidarite Oblige
Hi 
 
I've just spent 2 hours trying to find the credit scoring dataset online, to be able to follow this exercise, to no avail.  
 
For some reason Elsevier put out only selected datasets on book companion site 
http://www.elsevierdirect.com/companion.jsp?ISBN=9780123747655 
Could you kindly make this xls available ? 
 
I tried to use the German original which is referred in gthe book (http://www.stat.uni-muenchen.de/service/datenarchiv/kredit/kredit_e.html) but it seems very different to the columns that are seen from your screenshots. 
 
Thanks a lot for your Rapidminer tutorials anyhow.
Posted @ Thursday, July 14, 2011 1:35 PM by ad
nice
Posted @ Saturday, August 24, 2013 9:13 PM by olu
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics