text mining job search

Inspired by the really cool video series on text mining by Vancouver Data Blog, we are going to kick off our article series on text mining (also) using RapidMiner. Neil McGuigan does a great job covering this topic in those compact 10-min videos. In our article series we will try to get into a little bit more of the nitty-gritty details. We will also take a bottom up approach, in that we will describe the unique text mining tool kit that one has to get familiar with before running a meaningful analysis. But before jumping into the details, let us try to make this topic a bit more relevant by giving it some context, by describing how anyone could use it in their own personal application.

One example of a personal “application” is a job search. We will explore how text mining can help make your own job search a bit more streamlined and purposeful. For instance there are at least three ways in which you can use text mining in your next job search:

1. Separate job postings into general groupings using clustering

The one thing that most recruiters will likely tell you is to focus on your niche to generate the most success. This focus is true in any search, obviously not just for jobs. In order to focus, you will need to segment properly and clustering is one good way to do this. For example, the same job title may require somewhat different skills. One Data Scientist job asked for CRM, Applied Operations Research and Database marketing skills with no emphasis on programming skills, while another Data Scientist job required Java, C++, Python etc. programming expertise! On the other hand there was a job posting for Predictive Analytics Manager which almost exactly matched the first Data Scientist job description. Thus there may be two different job titles which require closely matched skills. Clustering will help you narrow down and group such similar posting irrespective of the title.

2. Short list the best fit jobs for your interest by using similarity matching

Assuming you have separated the postings into various clusters, you may want to take it to the next step by getting into each of the clusters and find out which two or more postings have the greatest similarity to each other. Possibly they are the same job sent out by different recruiters! On the bright side, there may be multiple jobs which require almost the same skill set, in which case your chances have improved.

3. Identify the skill keywords which are strongest indicators for a given job

Finally if you have decided that a given career path is the best fit for you, it might be a good idea to see if there are a bunch of core skill sets that are critical for a job with that description and enumarate them. In case you are missing one or more of those skills, you have an opportunity to fix that gap.

These three separate tasks will be used as examples over the three or four article series to explore the way in which you can use text mining and text analytics. However we will start by first looking into the basic tools and terminology that is important for text mining.

Transformation of text into data

The fundamental process in most text mining requires converting text into data. Once you convert your text into data, there is nothing to stop you from applying all your analytics techniques to classify, cluster and predict. The unstructured text has been converted into a semi structured dataset so that you can find patterns and even better, train models to detect patterns in new and unseen text. 

Preprocessing tools

But before converting a random text document or web document into data, there are additional preprocessing steps required which will make our analysis easier. Below are the main tools which will be used in subsequent tasks and a graphical explanation of what they do. All of these steps come after you have successfully “read” your documents into the tool of your choice.


Usually the first step in imposing some structure to your document data is to break up the document into discrete bits or tokens. The simplest token token is a character, however the simplest meaningful (to a human) token is a word. The graphic below shows the effect of tokenizing on a single paragraph of a job description text. You can see that punctuation marks (characters) are removed and tokens are separated by alternating colors.

Filter Stop words

Probably the next most commonly used preprocessing step is the remove what are called “stop words”. These are common words such as prepositions, conjunctions, articles, adverbs and so on. 


Stemming reduces the words to their barest minimum so to speak. For example in a job posting context the words “responsibilities” and “responsible” indicate the same thing. When you stem these words, you are only keeping the core of the characters which convey effectively the same meaning.


In documents there may be pairs of words which always go together. Going back to our example, the pair “data” and “mining” may appear together several times in the same posting. Identifying such word (or word root) pairs will allow us to parse through the document in a more intelligent way. Of course n-gramming is not limited to 2-word pairs. You can have 3-word terms such as “large data sets” or “interpret data models” and so on. 

One choice you may have to make is regarding the sequence of these operations. Sometimes it would be more meaningful to generate the n-grams before stemming. Decisions like these require some amount of iterative analysis. You can see that with each operation, we are reducing the free flowing nature of documents into a more consistent and easily identifiable set of minimum patterns.

In the next article we will dive into the first of the three tasks mentioned above and also introduce some quantitative tools which will impose more structure into the analysis. But hope this one gave you the jump start for using text mining in your next job search!

Keep up with these articles! Sign up for our blog to stay informed.

Originally posted on Fri, Dec 14, 2012 @ 09:05 AM

No responses yet

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.