The Analytics Compass Blog

Twice weekly articles to help SMB companies optimize business performance with data analytics and to improve their analytics expertise.

Subscribe via E-mail

Your email:

Search SimaFore

FREE SMB Survey Report

describe the image
  • Our report (pdf) is a survey of analytics usage and needs of more than 100 SMBs
  • Find out which analytics applications have
    • Highest value
    • Most demand
    • Best ROI

Affordable Analytics for SMB


Browse by Tag

Current Articles | RSS Feed RSS Feed

Decision tree accuracy: effect of unbalanced data


A frequent situation encountered in classification problems is that of unbalanced data. A training dataset consisting of a disprortionately high number of examples from one class will result in a classifier that is biased towards this majority class. When a classification algorithm trained on such data is applied to a test dataset that is also unbalanced, the classifier will yield a very optimistic accuracy estimate. This phenomenon is very common in binary classification. The plot below shows an example of imbalanced data: the "positive" class indicated by blue is disproportionately higher than the "negative" class in red.

classification with unbalanced data resized 600

What is the effect of data imbalance?

Let us explore this using a simple example. The data set shown in the process below is available in RapidMiner's Samples repository and called "Weighting". This is a balanced data set consisting of about 500 examples with the label variable consisting of roughly 50% "positive" and 50% "negative" classes respectively. Thus it is a balanced data set. When we train a decision tree to classify this data, we get an overall accuracy of 84%. The main thing to note here is that the decision tree accuracy on both the classes are roughly the same ~ 80%. See graphic below.

decision tree accuracy on balanced data resized 600

As stated earlier, if we use unbalanced data to train a classifier, prediction will be biased towards the more frequent class. We now introduce a sub process called "Unbalance" which will resample the original data to introduce a skew: the resulting data set has more "positive" class examples than "negative" class examples. Specifically, we now have a data set with 92% belonging to the positive class and 8% belonging to negative class. The process and the results are shown below.

decision tree accuracy on unbalanced data resized 600

How to address data imbalance?

There are several ways to fix this situation. The most commonly used method is to resample the data to restore the balance. This involves undersampling the more frequent class - in our case, the "positive" class and oversampling the less frequent "negative" class.

The "rebalance" sub process achieves this in our final RapidMiner process. As seen below, the overall accuracy is now back to the level of the original balanced data. The decision tree also looks a little bit similar to the original whereas for the unbalanced dataset it was reduced to a stub.

decision tree accuracy on rebalanced data resized 600

An additional check to ensure that accuracy is not compromised by unbalanced data is to replace the accuracy by what is called "balanced accuracy". It is defined as the arithmetic mean of the class recall accuracies, which represent the accuracy obtained on positive and negative examples, respectively. If the decision tree performs equally well on either class, this term reduces to the standard accuracy (i.e., the number of correct predictions divided by the total number of predictions). 

Download our free ebook on setting up decision trees using RapidMiner to refresh the basics.

decision tree using rapidminer - ebook


Very good presentation, Surely most people will be benifitted, I will recommend for others, but the site address not so attractive to machine learners
Posted @ Saturday, February 09, 2013 5:27 AM by David
Thanks for this post.this is a common problem in data mining. 
but where is "rebalance operator" ? 
is this a core Rapidminer operator or this is user created? 
I cant find this operator in version 5.3. if you created this sub process manually can you share your process file?  
Posted @ Tuesday, April 30, 2013 2:54 AM by e
It is not a core operator. I will share this in a future article. Thanks.
Posted @ Wednesday, May 01, 2013 7:52 AM by Bala Deshpande
Hi, thanks a lot for the article (and for the quality website). Please I am stucked with unbalanced data in a current classification problem I am working on. Could you please share how did you do to the balancing and unbalancing of data? I would really appreciate your help!!
Posted @ Friday, May 31, 2013 11:55 AM by DDR
Post Comment
Website (optional)

Allowed tags: <a> link, <b> bold, <i> italics