Parameter optimization in predictive analytics: getting it right
There are two very basic questions in predictive analytics and data mining which only experience can reliably answer. The first is selecting the best technique for a given problem: given the plethora of methods for classifying, clustering and predicting, how can you tell if you have the best one? Current thinking is that aggregate or ensemble models (such as random forests or adaboost to name two) are the way to go in order to avoid asking this question altogether!
The other issue which is a critical decision point is: have i chosen the best parameter settings for my analysis technique. Clearly every learner (analytics technique) can be optimized for a given training data set. But when you apply this optimized set of parameters to an unseen data set, we can safely expect an optimal prediction if and only if the new dataset shares several basic statistical features with the training set on which the learner was optimized.
But for many analysts, getting to the set of optimized parameters can itself be quite challenging. For example, one of our readers wrote this ...
"Let's say that in ID3 tree I want to optimize the gain ratio attribute. In my first try; in parameter optimization I use the range [1,100] and linear 10 steps for gain ratio value. The model will give me best perfroming tree, let's call it Tree1. And assume that its accuracy is 85%. In second try I use the range [-0.0001,1] and linear 100 steps for gain ratio value. The model will give me best perfroming tree, let's call it Tree2. And assume that its accuracy is 95%. So in order to find the best tree in a single pass how should I give the value range for the attributes. As you know infinite number of values can be tried. That is, how to decide whether to use range1 or range2?"
Great question! Let us try to answer this using the classic Iris dataset. Let us assume that we will use a decision tree as a learner and for simplicity in following the example, we want to optimize only two parameters (among the many which are available). In this case, we are trying to determine the best combination of "Minimum leaf size" and "Minimum size for a split". Both are critical input parameters which will affect the tree and its performance.
The best way to solve this problem using a tool like RapidMiner is to use the built in operators for Optimization. But even with this there are choices to be made: Grid or Quadratic or Evolutionary optimization?
Grid optimization is another name for a brute force or greedy search where every possible combination of values for the min leaf size and min split are explored in a linear fashion. So the natural question is to ask what should be the range of values over which we need to explore. This was the question raised by our reader above.
For the Iris dataset, we have run a grid search with the following ranges: min split= [1, 100] and min leaf size = [1,100]. The accuracy over the various combination of these two parameters is shown as a surface plot below. Such plots are called fitness landscapes.
As you can see, accuracy is quite high as long as we keep the minimum leaf size below 30. To come to this understanding however we have had to run 10000 analyses, which is quite ok for a small dataset like iris, but you can see this becoming impractical for larger and more complex data. The initial choice of the search range will influence whether you hit the optimal point or not. If we had started at a "min leaf size" above 30, we would probably not have ever reached a "true" optimum.
This is where evolutionary optimization outperforms a grid search. With evolutionary you can start by giving a very wide search range, the algorithm quickly hones in on the optimum. Additionally, we can set up proper escape clauses such as "use early stopping" where the algorithm stops if there is no improvement in performance after a set number of trials and there is no danger of getting stuck in an infinite loop.
When we run the same problem with evolutionary search (parameter range exactly the same as in grid), we get the following fitness landscape where we have hit a couple of optimal points with less than 50 analyses.
The final solution for the optimal parameters in both cases are quite similar, except that a grid optimization required a more exhaustive search and there was the risk not knowing if we have "missed a spot"!
In conclusion: in order to tell if you got the parameters for your model set up the right way, you need to optimize. Use a grid search for a small dataset with only a few parameters, but use evolutionary search for anything else.
Download the RapidMiner process and another dataset to try it out on your own. Any questions? Feel free to comment below.