This is the concluding part of the credit scoring with RapidMiner series where we discuss the results of the decision tree analysis and suggest some additional techniques that could have been used to address this business analytics problem.
In part 1 we showed how to import the data set and discussed some common issues users may have in dealing with spreadsheets within RapidMiner. In part 2 we showed the Split Data Validation operator. In part 3 we discussed the settings for the decision tree learner or solver.
Recapping the four steps we mentioned in the previous posts:
1. Read in the data (from a spreadsheet) ----> Completed here
2. Split data into training and testing samples ---> Completed here
3. Train the decision tree and apply the model ---> Completed here
4. Evaluate the performance --- Follows below
Click on the "Performance" tab to see this table below
RapidMiner Performance operator provides several options to check the model validity: accuracy, precision, recall, ROC and AUC charts. A discussion of common methods to assess classification model quality is available here.
Finally click on the "Tree" tab to see the Decision Tree itself. Note the instructions in the graphic below.
Get the complete set of articles on decision trees in one place. Download FREE ebook below
Several important points must be highlighted:
1. The root node - Balance of Current Account - manages to classify nearly 94% of the data set. This can be verified by hovering the mouse over each of the 3 terminal leaves for this node. The total occurrences (of good and bad) for this node alone are 937. Specifically, if someone has a Balance of Current Account >= $300, then the chances of them having a "good" score is 88% (=348/394, see graphic below).
2. However, the tree is unable to clearly pick out good or bad scores, if there is "no running account" (only a 51% chance). A similar conclusion results if someone has "no balance".
3. If the Balance of Current Account is less than $300, then the other parameters come into effect and play an increasingly important role in deciding if someone is likely to have "good" or "bad" credit.
4. However, the fact that there are numerous terminal leaves with frequencies of occurrence as low as 2 (for example, "Duration of credit"), it implies that the tree suffers from "overfitting". One way we could have avoided this situation is by changing the Decision Tree criterion "Minimal leaf size" to something like 10 (instead of default, 2). But doing so, we would also lose the classification influence of all the other parameters, except the root node [try it!]
How does this solution compare to the book solution (link to Amazon)? There they follow a slightly different route: first they eliminate some of the variables which are deemed unimportant by a feature selection method. With the resulting 10 (out of 17) predictors, they still obtain an accuracy of about 63% using a CHAID model. With a boosting tree model, they achieve accuracy of 66%.
In addition to assessing the model's performance by static measures such as accuracy, we can also use Gain/Lift charts, Receiver Operator Characteristic (ROC) charts, and Area Under ROC curve (AUC) charts. An explanation of how these charts are constructed and interpreted is available here.
RapidMiner provides AUC chart comparisons: typically when a classifier is unable to distinguish between two classes, the AUC will be closer to 0.5. In this exercise, the AUC ranged from a pessimistic estimate of 0.559 to an optimistic estimate of 0.81, with an average of 0.684.
If you like tutorials like these, we invite you to sign up for visTASC, "a visual thesaurus of analytics, statistics and complex systems" which provides a portal for aggregrating analytics wisdom and current applications.