As data science extends its domain over more and more types of businesses, it is inevitable that we will need to work with unstructured text data. This means dirty data takes on many different avatars. Cleaning structured data usually means accounting for "incorrectly" entered numbers, which implies outliers; accounting for missing values, and so on. With unstructured data, the opportunities for creating dirty data are significantly more. Especially when data is created by human involvement!
For example, data entry folks, in an attempt to be more descriptive tend to combine written words with numbers. We are working with some claims data where price data was generously mixed up with comments. For instance, here was one entry under the Price column: "$44,381?? could not read clearly". Here was another entry under price "This was a lease" - instead of a number, we find ourselves interpreting something else.
Another attribute in this same data set was supposed to capture the name of a law firmused in the claim. Here are some of the data samples:
Adam K****, K*** & M***, LTD
K*** Et M***, Ltd
K*** & M***, Ltd
Eric K***, K*** ft M***,LTD.
A human reader immediately recognizes that these are all the same (or equivalent) values for the attribute "Law firm", but if you want to build a dashboard or perhaps some machine learning models, this will not work. We need a way to quickly identify and group such similar entries to correspond to possibly a single value.
The simplest way to do this is using clusteringand text mining. We first select the attribute in question, parse it and break it down into tokens, and then group them into natural clusters based on the tokens. Using a tool like RapidMiner makes this very straightforward and intuitive. The image below shows the process we used. We need to pay attention to a couple of items here:
- Use Nominal to Text operator to convert the raw data into the right format for text mining
- Use of Filter Stopwords by dictionary for selection of certain common words and their removal is important. Otherwise clusters will be formed on these common words such as "group" or "associates" or "esquire" or "LLC" and so on. This will not help in identifying unique features of each law firm name. (This operator is nested inside the Process Documents from Data operator above).
- Use the DBSCAN clustering technique because you do not want to predefine the number of clusters. Also try to keep minimum cluster size as low as 2 to prevent lumping together. In this instance, overfitting is actually a benefit!
On a side note, most clustering algorithms (in RapidMiner) tend to put all singleton entries into cluster_0. That is if you have many instances of non-repeating law firm names, all of these tend to end up in cluster_0.
Here is the result of running the above process. As you can see all the "K*** & M***" law firm entries are automatically grouped into cluster_1.
For a detailed introduction to text mining, refer to Chapter 9 of our book.