In part 1 we showed how to import the data set and discussed some common issues users may have in dealing with spreadsheets within RapidMiner. In part 2 we showed the Split Data Validation operator. Continuing in this part, we will begin the mechanics of constructing the decision tree for the analysis. In the concluding part, we will discuss the accuracy of the analysis, compare it with the book solution and identify other data mining techniques that could be used for this analysis.
Recall the four steps we mentioned in the previous post:
1. Read in the data (from a spreadsheet) ----> Completed here
2. Split data into training and testing samples ---> Completed here
3. Train the decision tree and apply the model
4. Evaluate the performance
We will show how to implement step 3 in this article. We will also discuss the various parameters to pay attention to while using the Decision Tree operator and what they mean. Upon following steps 3.1 to 3.3 below, we can modify tree parameters.
The main parameters to pay attention to are the "Criterion" pull down menu and the minimal gain box. The criterion is essentially a splitting decision factor that answers when a node should be split.
Get the complete set of articles on decision trees using RapidMiner in one place. Download FREE ebook below
As discussed here, decision trees are built up in a simple five step process by increasing information contained in the reduced data set following each split. If that sounds a bit topsy-turvy, think about it like this. Data by its nature contains uncertainties. We may be able to systematically reduce uncertainties and thus increase information by activities like sorting or classifying. When we have sorted or classified to achieve the greatest reduction in uncertainty, we have basically achieved greatest increase in information.
This article explained why entropy is a good measure of uncertainty and how keeping track of it allows us to quantify information. So this brings up back to the 3 options which are available within RapidMiner for splitting decision trees.
- Information Gain: Simply put this is computed as the information before the split minus information after the split. It works fine for most cases, unless you have a few variables which have a large number of values (or classes). Then these variables tend to end up as root nodes. This is not a problem, except in extreme cases. For example, each customer ID is unique and thus the variable has too many classes (each ID is a class). A tree that is split along these lines has no predictive value.
- Gain Ratio (default): is usually a good option. Gain ratio overcomes the problem with Information gain by taking into account the number of branches that would result before making the split.
- Gini Index: is also used sometimes, but does not have too many advantages over gain ratio.
The other important parameter is the "minimal gain" value. Theoretically this can take any range from 0 upwards. In practice, a minimal gain of 0.2-0.3 is considered usable. Default is 0.1.
The other parameters (minimal size for a split, minimal leaf size, maximal depth) are determined by the size of the data set. In this case, we proceed with default values.
The last step in training the decision tree is to connect the input ports ("tra"ining) and output ports ("mod"el) as shown.
The model is ready for training. Next add two more operators: Apply Model and Performance and we are ready to run the analysis.
Remember to connect the ports correctly as this is another common newbie confusion:
- "mod"el of the Training window to "mod" on Apply Model
- "tes"ting of the Training window to "unl"abeled on Apply Model
- "lab"eled of Apply Model to "lab"eled on Performance
- "per"formance on Performance operator to "ave"rageable on output port
The final step before running the model is to go back to the main perspective by clicking on the blue up arrow (see step 3.5) and connect the output ports "mod"el and "ave" of Validation operator to the main outputs.
In the concluding article, we will show the tree, discuss the results and compare to the text book solution.
If you like tutorials like these, we invite you to sign up for visTASC, "a visual thesaurus of analytics, statistics and complex systems" which provides a portal for aggregrating analytics wisdom and current applications.