Subscribe to Email Updates

How to improve customer segmentation with text mining in 5 steps

Posted by Bala Deshpande on Wed, Feb 19, 2014 @ 07:03 AM

In a previous article we discussed how text mining can help with customer segmentation whencustomer segmentation using survey data used in conjunction with traditional survey analysis data. In this article we will demonstrate that doing so can significantly improve the predictability of models.

The goal of the predictive model in this context is simple: use the responses to a standard survey to quickly and automatically categorize the respondents into different classes. One of our customers ran a survey which asked questions about how the respondents used technology. The objective was to rank a respondent as "innovative" or "conservative" or "average" when it comes to adopting new technologies. When you collect hundreds or thousands of responses, it becomes tedious to pore over every response in order to do this categorization. You need a tool that would do this quickly.

The survey contains questions which require numeric responses, such as grading on a scale of 1 to 5. However, numeric questions tend to pigeon hole respondents. This is where free form text responses to open ended questions can add some value. However analyzing text responses manually is even more tedious. The solution is to combine the two analyses. The good thing about this merger is that any predictive model that is built upon this data, improves in accuracy with the addition of text mining dimension. 

There are 5 steps to building predictive models on survey data. As always data cleanup and preparation are the pre-steps and we will not focus on that. Further, we will also assume that we have training and testing data separated and that the training data has been properly labeled. By this we mean that one must manually examine the responses in the training data and assign the respective labels to the respondents (such as "high", "medium" or "low" in terms of the respondents' innovation potential in this example). Steps 1 to 4 deal only with training data, and step 5 applies the model built in steps 1 through 4 on the testing data which was unseen by the model:

Step 1: Separate numerical responses from text responses. This chart below shows the numerical portion of the survey responses. Note that the ranking for the training data has been created under the column "Rank". This needs to be done only one and the model trained on this data can be used to categorize all future surveys.

numerical responses from survey data resized 600

Step 2: Convert text data into a Term Document Matrix (TDM). A TDM is nothing but a sparse matrix whose columns are the key text terms gathered from the responses. Each row of the TDM represents one respondent. 

term document matrix for text mining survey data resized 600

Step 3: Merge the numerical data with TDM. This involves simply combining the two tables shown above.

Step 4: Build a predictive model using the joined text and numeric data. The model may be validated using a cross validation process to record the prediction accuracies. The table below shows the accuracy of a neural network model when trained using both the text and numeric data.

neural network model accuracy with text mining data resized 600The second table below shows the confusion matrix for another neural network model trained without the text data. In other words using only the first data table shown earlier and ignoring the text data.

neural network model accuracy without text mining resized 600

As you can see, simply using the TDM to provide additional information can improve the model by nearly 20%. The tables above  clearly demonstrate the advantage of using text mining in building better predictive models for customer segmentation or customer profiling which was the ultimate objective of this survey. 

Step 5: Apply the model on the unseen test data to predict the categories. This is where all the hard work in preparing the data and building the model will pay off. 

One aspect which we did not elaborate was data reduction that may sometimes be necessary. Most surveys contain dozens of numerical responses, however not all of these questions may influence the segmentation equally. Some responses tend to have a much stronger influence on customer profiles than others. How do you identify these and remove the unimportant ones? One way to accomplish this is by identifying the key performance indicator using a tool such as keyconnect. 

Sign up for a free 30-day trial for KeyConnect. 

KeyConnect to optimize KPIs using chi squared calculator

Topics: text mining, text analytics, artificial neural networks

Most Recent

Most Popular