A simple explanation of how entropy fuels a decision tree model
Decision tree models can be effectively used to determine the most important attributes in a dataset. The figure below shows an example of using a decision tree (in this case a C4.5 algorithm), to identify the most important parameters which would influence whether a banking customer would accept a personal loan offer during a campaign. The dataset contains 12 customer attributes ranging from Income, Education in years, mortgage, average credit card balance, family size, geographic data among others. The response (or target) variable is the binary condition of whether they would accept the loan offer - a "Yes" or a "No".
A frequent question beginners have when trying to understand the workings of a decision tree model is "how does the model know at what value to split the top node?" Another common question is "how does the model determine the order of the nodes from top to bottom?"
Let us explore these questions in simple language and without describing too much math!
The core idea behind decision tree algorithms is that of dataset homogeneity. In the series of plots below, the red points indicate loan acceptance ("Yes") and blue indicate "No" or non-acceptance. From a visual inspection, we can see that if the data is split along the "Income" axis at approximately 166.5 (and education at 2) we can separate the dataset into two homogeneous groups of mostly all reds and mostly all blues. But one could perhaps make a similar (subjective) argument for Mortgage at 400.0 and Family at 3 or even CCAvg at 3.00.
Clearly we need a way to quantify the dataset homogeneity. This is where the concept of entropy comes in. For the data subset that is 100% homogeneous (all Yes or all No), the entropy of the target variable is zero and for a subset that is a perfect 50-50 mixture, entropy of the target is 1.0. In our dataset, the target variable entropy will lie somewhere in this range (0.0 to 1.0). If we compute the entropy for each of the above three subsets within the green rectangles, we will find that the subset with the lowest entropy is the green rectangle in the Income-Education plot. A program such as RapidMiner uses Gain Ratio, which is the ratio of the entropy after split to the entropy before split, to quantify the homogeneity. It may also use Information Gain which is the difference between the entropies before and after.
The attribute which maximizes the gain ratio or information gain the most is selected as the root node or top node. The value of the attribute at which this gain occurs is obviously the split point. Rank ordering the gain will yield the other nodes or attributes where splitting the data resulted in increased information gain.
This partitioning may continue along the same attribute to yield smaller and smaller subsets of higher "purity". As we see in the tree above, the dataset is split along the Income attribute at several points.
If you liked this, download our decision tree ebook to collect all our other articles on this topic below.