customer segmentation

Two of the many ways descriptive and predictive analytics add value to businesses, is by helping them understand their customers better through customer segmentation and classification or customer profiling. Segmentation and profiling have been quite widely employed by big businesses in industries such as finance, insurance and database marketing. It is only now, with the availability of cheap computing, coupled with low cost or no-cost (as in open source) software, businesses of all shapes and sizes can afford to take advantage of these processes. Here we describe one simple process for conducting a fundamental segmentation analysis called the Recency-Frequency-Monetary Value (RFM) analysis.

It is easy to grasp that customers who engage more frequently and at higher monetary levels are clearly more valuable to business. However, it is not so obvious what the value of a customer who has recently engaged is. It turns out that in many instances, the potential value of someone who has recently conducted transactions with a business is significantly higher than someone who has not. RFM analysis is therefore a critical tool to segment customers by their potential value so that they can be reached out to with targeted messages. The RFM process can be a purely descriptive analytics exercise or it may be combined with simple predictive model to connect the descriptors to future customer behavior.

RFM has been applied to a variety of scenarios where businesses need to build marketing campaigns. It can be applied to both commercial and non-profit situations. In a non-profit scenario our objective could be to segment people by their potential to donate. The example illustrated here (data comes from the UCI Machine Learning database) pertains to a blood drive. Data includes five attributes: R (Recency – months since last donation), F (Frequency – total number of donation), M (Monetary – total blood donated in c.c.), T (Time – months since first donation), and a binary variable representing whether someone donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood). This last variable will allow us to build a machine learning model to predict or classify new cases. There are 748 samples in this dataset.

4 ways to segment data

Given that there are 4 descriptor (and one predictor) attributes, we can segment our “customer” database along these four dimensions. The best way to do this using a tool such as RapidMiner is to simply connect the “Discretize by Binning” operator to the dataset. This operator discretizes the four selected numerical attributes to nominal attributes. For example, we can create the following 4 bins for the Recency attribute:

  • 0-9 months since last donation
  • 9-18 months since last donation
  • 18-27 months since last donation
  • >28 months since last donation

If we set the number of bins parameter to 8, we will segment the data along these 9 month ranges. If we set the number of bins to 5, we end up segmenting along 15 month ranges. A bit of trial and error will help us select the right bin numbers. With 8, we see that 53.6% of the data is in the 0-9 month range and 85% of the data is within the 0-18 month range. 

However, if you want to specify the ranges manually, some amount of programming is required using a tool such as R.

If we stick to automatically generating the bins, we see that 8 bins are sufficient to segment the data meaningfully along the four dimensions. We find the following insights:

  1. Frequency (total number of donations): 78% of the data is within 0-7 months range, 92% of the data is within 0-13 months range
  2. Monetary (total blood donated in c.c.): 78% of the data is within 0-1781 cc range, 92% of the data is within 0-3312 cc range
  3. Time (months since first donation): is very uniformly spread among all the bins, hinting that this attribute is perhaps not such an important one for segmenting.

Using RFM attributes for predictive analytics

We can now use this information about the attributes to predict (actually, classify) if a person is likely to make a blood donation. Before we build this model, we can run a basic feature weighting analysis to rank the RFM attributes. We can do this by connecting the segmented data to a “Weight by Chi-square” and “Weight by Information Gain” operators. In either case, we will see that Recency emerges as the most important factor, followed by Frequency and Monetary value at equal levels of weighting.

Finally, we can connect the weighted results to a basic classifier such as a logistic regression model, to develop a machine learning model to help us classify future cases according to their tendency to donate or not. 

RFM is only one way to perform customer segmentation and classification. Another useful technique is the famous Pareto or 80 20 rule based classification.

Originally posted on Tue, Dec 03, 2013 @ 09:03 AM

No responses yet

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.