Credit scoring is a fairly common business analytics application and decision trees are some of the most intuitive algorithms that allow you to understand how it is done. Some types of problems where credit scoring could be applied are:
- Prospect filtering: Identify which prospects to extend credit to and how much credit would be an acceptable risk
- Default Risk detection: Decide if a particular customer is likely to default
- Bad debt collection: Sort out those debtors who will yield a good cost (of collection) to benefit (of receiving payment) performance.
We use a data set from here and describe how to use the open source software RapidMiner to build a decision tree for addressing prospect filtering problem. If you have this book, you can compare the solution shown here to the one given in the book, which uses STATISTICA, a commercial analytics tool.
Setting up a decision tree analysis in Rapidminer is presented very nicely here in a video by Thomas Ott. However our article lays down the steps for applying decision trees for credit scoring applications and some common problems encountered by non-expert users applying these immensely valuable open source tools for analysis.
There are four main steps in setting it up:
1. Read in the data (from a spreadsheet)
2. Split data into training and testing samples
3. Train the decision tree
4. Apply the model and evaluate the performance
This first part of the series focuses on step 1, which may seem rather elementary, but can consume a lot of time if not done properly. The next few parts will describe other steps in detail.
Step 1: Read in the data
RapidMiner’s easy interface allows quick importing of spreadsheets. The best part about the interface is the panel on the left, called the “Operators”. By simply typing in text in the box provided automatically pulls up all available RapidMiner operators that match the text – pretty handy! In this case, we need an operator to read an XL spreadsheet, and so we simply type “excel” in the box. As you can see, the two XL operators are immediately shown below: one for reading and one for exporting data.
Either double click on the “Read Excel” operator or drag and drop it into the “Main Process” panel – the effect is the same. Once the Read Excel operators appears in the main process window, we need to configure the data import process. What this means is telling RapidMiner which columns to import, what is contained in the columns and if any of the columns need special treatment.
This is probably the most “cumbersome” part about this step. RapidMiner has a feature to automatically detect (or Guess Value types). But it is a good exercise for the analyst to make sure that the right columns are picked (or excluded). Also, sometimes, the tool is unable to treat the first row of the spreadsheet as names in which case the user has to manually enter the names for all the columns in the attribute field.
Once the data is imported, we must assign the target variable for analysis, also known as a “Label”. In this case, it is the Credit Rating (column A). Finally it is a good idea to “run” RapidMiner and generate results to ensure that all columns are read correctly as demonstrated by Tom Ott’s video above.
In the next part, we will split the available data into a training and testing components and set up the decision trees. In the last part, we will validate our model and evaluate the performance to make some conclusions. We will also compare our results with the solution provided in the book.
Originally posted on Thu, Apr 07, 2011 @ 07:54 AM