This is the second in our article series on text mining using RapidMiner. In the first article we discussed general features of text mining and how it relates to the overall area of predictive analytics. We used job searching as a practical example of using text mining in your own personal application.
To recap, a text mining project typically has three component processes: gather unstructured information, convert it into a structured or semi-structured format and finally apply any of the standard predictive analytics or data mining techniques to extract insights.
In this article we will focus on the first of the three processes and use a web crawling tool to build up our database. Our database will consist of all job postings which contain the words "Data Scientist". Then we can apply the standard preprocessing toolkit to generate a structured data format on which we can apply our data mining techniques.
RapidMiner offers a number of tools to extract information from the web whose reach and power are quite impressive. The basic idea is to initiate a web crawl on a specified url. But you can also drill down into the web page by following links. We look at two different operators here. For both operators, you need to specify crawling rules which will allow you to drill down into embedded links within the page.
In our case, we will provide http://www.indeed.com/jobs?q="data+scientist" as the url and specify that we need to follow every link in that main url which has the keywords "Data Scientist" in the anchor text.
Operator 1. Process Documents from Web: This is slightly faster of the two because it only stores the followed url information unless "add pages as attributes" is checked.
Operator 2. Crawl Web: This allows you to store each followed link as a text file. You will need to specify the directory into which you wish to write these files (need to create the directory if it does not exist). This is slightly slower, but it is better suited for what we are doing.
Word of caution: RapidMiner wiki asks you to "please be friendly to the web site owners and avoid causing high traffic on their sites", clearly if you try to crawl too many pages and links, be warned that it the process will be slow. Another point to note is in defining crawling rules. RapidMiner allows you to apply any regular expression in the "rule value" - see image above. This means that you could have a wild card expression like *Data Scientist* which will pull up any link that has the words in between the * signs. This can also slow things down. For our application we will not use the wild cards but just stick to the two main keywords: data and scientist.
Our objective is to crawl all the job posting which have the word "Data Scientist" in their anchor text. A simple search on indeed.com results in about 400+ jobs posted for this sort of position. Clearly these are listed over several pages and we need to traverse through these pages one by one. This is where we will need to use the Loop operator.
The 400+ jobs are spread over 38 pages and each page is accessed by a slightly different url as seen below:
We will need to embed the Crawl Web operator inside the Loop operator and specify 37 iterations for the loop counter.
As we go through each iteration, we will need to update a variable called "pagePos". This is accomplished by using a "Generate Macro" operator. After each iteration, we will add "10" to this variable as shown below. You may also want a Log operator to keep track.
The pagePos value will then be appended to two fields within the Crawl Web operator:
1. the url field so that we advance to the next page for crawling
2. the directory to store the data. Each time we finish crawling through one of the job posting pages (of the 37 in this case), the Crawl Web operator will write the output from a link (which is a html page) into a text file. The names of these files are by default 0.txt, 1.txt ,... and so on. So we will have to store each page in its own directory otherwise RapidMiner will overwrite the saved pages.
In the next part we will examine the output of this process and run the second main step in text mining: converting the raw data into a semi structured format using Process Documents from Files. The final article will show how to run a variety of data mining algorithms on this digested information.
All articles and accompanying process will be available for a consolidated download as an ebook by the end of the series. Stay tuned!
In the meantime, take our survey and give us feedback on our upcoming RapidMiner text book, if you havent already.