The Analytics Compass Blog

Twice weekly articles to help SMB companies optimize business performance with data analytics and to improve their analytics expertise.

Subscribe via E-mail

Your email:

Search SimaFore

FREE SMB Survey Report

describe the image
  • Our report (pdf) is a survey of analytics usage and needs of more than 100 SMBs
  • Find out which analytics applications have
    • Highest value
    • Most demand
    • Best ROI

Affordable Analytics for SMB

 

Browse by Tag

Current Articles | RSS Feed RSS Feed

Text mining: How to fine tune job searches using web crawling: 2 of 4

  
  
  

This is the second in our article series on text mining using RapidMiner. In the first article we discussed general features of text mining and how it relates to the overall area of predictive analytics. We used job searching as a practical example of using text mining in your own personal application. 

To recap, a text mining project typically has three component processes: gather unstructured information, convert it into a structured or semi-structured format and finally apply any of the standard predictive analytics or data mining techniques to extract insights. high level text mining process

In this article we will focus on the first of the three processes and use a web crawling tool to build up our database. Our database will consist of all job postings which contain the words "Data Scientist". Then we can apply the standard preprocessing toolkit to generate a structured data format on which we can apply our data mining techniques. 

Web Crawling 

RapidMiner offers a number of tools to extract information from the web whose reach and power are quite impressive. The basic idea is to initiate a web crawl on a specified url. But you can also drill down into the web page by following links. We look at two different operators here. For both operators, you need to specify crawling rules which will allow you to drill down into embedded links within the page. 

In our case, we will provide http://www.indeed.com/jobs?q="data+scientist" as the url and specify that we need to follow every link in that main url which has the keywords "Data Scientist" in the anchor text.

web crawling rules to build dataset for text mining resized 600

Operator 1. Process Documents from Web: This is slightly faster of the two because it only stores the followed url information unless "add pages as attributes" is checked. 

Operator 2. Crawl Web: This allows you to store each followed link as a text file. You will need to specify the directory into which you wish to write these files (need to create the directory if it does not exist). This is slightly slower, but it is better suited for what we are doing.

Word of caution: RapidMiner wiki asks you to "please be friendly to the web site owners and avoid causing high traffic on their sites", clearly if you try to crawl too many pages and links, be warned that it the process will be slow.  Another point to note is in defining crawling rules. RapidMiner allows you to apply any regular expression in the "rule value" - see image above. This means that you could have a wild card expression like *Data Scientist* which will pull up any link that has the words in between the * signs. This can also slow things down. For our application we will not use the wild cards but just stick to the two main keywords: data and scientist.

Looping

Our objective is to crawl all the job posting which have the word "Data Scientist" in their anchor text. A simple search on indeed.com results in about 400+ jobs posted for this sort of position. Clearly these are listed over several pages and we need to traverse through these pages one by one. This is where we will need to use the Loop operator. 

The 400+ jobs are spread over 38 pages and each page is accessed by a slightly different url as seen below: 

http://www.indeed.com/jobs?q=%22data+scientist%22&start=10

http://www.indeed.com/jobs?q=%22data+scientist%22&start=20

...

http://www.indeed.com/jobs?q=%22data+scientist%22&start=370

We will need to embed the Crawl Web operator inside the Loop operator and specify 37 iterations for the loop counter. 

looping the web crawl for text mining resized 600

As we go through each iteration, we will need to update a variable called "pagePos". This is accomplished by using a "Generate Macro" operator. After each iteration, we will add "10" to this variable as shown below. You may also want a Log operator to keep track.

loop counter for advancing pages to crawl for text mining resized 600

The pagePos value will then be appended to two fields within the Crawl Web operator:

1. the url field so that we advance to the next page for crawling

2. the directory to store the data. Each time we finish crawling through one of the job posting pages (of the 37 in this case), the Crawl Web operator will write the output from a link (which is a html page) into a text file. The names of these files are by default 0.txt, 1.txt ,... and so on. So we will have to store each page in its own directory otherwise RapidMiner will overwrite the saved pages.

storing crawled pages for text mining resized 600

In the next part we will examine the output of this process and run the second main step in text mining: converting the raw data into a semi structured format using Process Documents from Files. The final article will show how to run a variety of data mining algorithms on this digested information.

All articles and accompanying process will be available for a consolidated download as an ebook by the end of the series. Stay tuned!

In the meantime, take our survey and give us feedback on our upcoming RapidMiner text book, if you havent already.


 

 

 

Comments

Dear Simafore, 
I think this article series on text mining by Bala Deshpande is indispensable for anyone who wants to start learning that amazing data mining tool. 
 
I can say: "Well done!". 
 
I succeed to populate a database for successive analysis. 
 
Now I'd like to get earthquake related italian article data from  
 
http://sitesearch.corriere.it/forward.jsp?q= . 
 
Searching for these words: " terremoto scosse" You will find 670 articles. 
 
The pagination system uses a javascript script to generate the pageNumber variable. 
 
The form uses POST Method and hidden inputed variables, instead of Bala Deshpande's GET method web crawling articles. 
 
Maybe for You is a simple question, I am a newbe in data mining field, so please explain to me how can I proceed. 
 
What R.M. operators have I to use?  
How can I set the javascript pageNumber variable to loop the article extraction?  
 
You could also write a new article about Web Crawling from on line data archive search engines that uses POST Method forms and Javascript, because it seems a not trivial topic. 
 
I wait for Your kind answer and wish to SimaFore a logarithmic success! 
 
Have a good day, 
Alex 
Posted @ Friday, May 03, 2013 3:18 AM by Alex
Alex 
 
Thanks for your kind words. I will take a look at this site and probably post an article in a few days describing how we could do this using RapidMiner.
Posted @ Friday, May 03, 2013 12:44 PM by Bala Deshpande
Hello, 
 
Realy great post, but when I try to start the process faild, it says that "Generation exception: 'Unrecognized symbol "pagePos"'" 
 
What does it means? 
 
Thanks for help
Posted @ Wednesday, January 22, 2014 5:08 AM by Daniel
Daniel 
In the "Context" tab in RapidMiner, a variable called pagePos needs to be defined and initialized. Please check for that and run the model again.
Posted @ Wednesday, January 22, 2014 11:14 AM by Bala Deshpande
Hello again, can you help me with definig and initalizing "pagePos" into Context tab. I have put into (Context tab) in Macros; Macro - pagePos, and into Value - %{pagePos}+10. 
Same error shows :( 
I realy need your help. Thanks in advance for any sugestions.
Posted @ Wednesday, January 22, 2014 11:28 AM by Daniel
All, 
Dont forget to add a Macro by going to the "Context" tab in RapidMiner. This macro must be titled "pagePos" and initialized to some number (for example 10).
Posted @ Thursday, January 23, 2014 1:00 PM by Bala Deshpande
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics