One of the first questions one has to answer while building a decision tree using RapidMiner is the selection of the parameters which would yield the "best" possible model. RapidMiner offers at least seven different parameter options and selecting proper values for these to yield a good decision tree model can be difficult.
There are some parameters such as the type of decision criterion which may be a bit more obvious. For example, the default "gain ratio" criterion is usually a safe choice over the other three (information gain, gini index and accuracy). However the remaining six parameters can accept a range of numerical values and one "size" does not necessarily fit all data.
The solution for such a situation is to optimize the parameter selection by using one of the optimization operators within RapidMiner. This article discusses the set up and analysis of this process.
Objective: Identify the best settings of parameters for a decision tree analysis.
We will use the liver disease patient dataset from the UCI Machine Learning site to illustrate this process. (Both the dataset and the RapidMiner process can be downloaded at the end of this article).
This data set contains 416 liver patient records and 167 non liver patient records. The label or target variable is used to divide the data set into two classes or groups (liver disease: Yes or No). The labeling is done based on expert decision. The objective of classification is to predict the class of an example based on the underlying predictors.
There are 10 attributes or predictors and include age of the patient, gender and several blood and serum laboratory analysis data.
After the data set is loaded into the RapidMiner repository, the are three operations necessary before running the parameter optimization: Set Role to identify the label variable, Discretize to convert the label from a numeric to nominal attribute because decision tree works only on nominal target variables, and finally Split Data to create a hold out or unseen dataset to verify if there is any overfitting.
A typical predictive analytics modeling effort requires dividing the dataset into a training and validation set. The decision tree model is built on the training set and then the classification accuracy is tested using the validation set. In RapidMiner this is accomplished by using one of the Validation operators (such as Split Validation or Cross Validation). However by splitting the data into three parts: training, validation and holdout further reduces the chances of overfitting the data.
Using the Optimization Operator:
RapidMiner offers three choices for optimizing parameters: a simple grid search, a quadratic interaction search and an evolutionary or genetic algorithm based search. All three require as input an example data set and output the final optimized parameters' settings, performance of the optimization and a result meta data. The Optimize Parameter is a nested operator, which means that we must insert additional operators within it. Each of the operations nested inside are performed and the results are compared against a stopping criterion. The optimize operator requires a Performance evaluation operator nested inside. The main criterion selected for the Performance operator will be used to decide if optimization has been achieved or not.
What does the Optimization operator optimize?
Using the Edit Parameter Settings, we must specify all the parameters which must be optimized. These parameters may include any of the parameter settings of the operators nested inside that can be varied. In this example, the operators nested inside are the Split- Validation which has a further nesting of Decision Tree, Apply Model and Performance. Note that only Split-Validation and Decision Tree operators have parameters that can be dynamically changed from iteration to iteration. This is shown in the graphic below.
In this example, we are interested in finding the best combination of parameters for the decision tree model. These include the following:
- Minimal size for split
- Minimal leaf size
- Minimal gain
- Maximal depth of tree
- Decision criterion
The Optimize Parameters operator will allow us to choose a range of values to cycle through for each of the above 5 choices and each analysis will be evaluated based on the criterion selected in the Performance operator. In this case we choose Classification Accuracy as the optimization criterion.
Analyzing and applying results from Optimization
The optimization yields values for the 5 parameters which give the best classification accuracy. These parameters may then be used in the final decision tree model. Finally, the best performing tree must be checked against the unseen or holdout dataset to verify if there is any overfitting.
Your analysis time will depend upon which optimizer you use, the number of parameters you select and the nested operators. Use caution not to go overboard with too many selections here!
Download the RapidMiner process and dataset below to try it out on your own. Any questions? Feel free to comment below.