Subscribe to Email Updates

Text Mining: How to store/access results from web crawling

Posted by Bala Deshpande on Thu, Jan 10, 2013 @ 07:28 AM

This is the third in our article series on text mining using RapidMiner. In the first article we discussed general features of text mining and how it relates to the overall area of predictive analytics. We used job searching as a practical example of using text mining in your own personal application. 

In the second article we discussed how to set up a simple web crawling process to automatically read through hundreds of documents from a jobs website.

In this article we will complete the pre-processing step by gathering all the crawled data into a MySQL database table and then reading the collected data from the table into a document processor to convert the job descriptions into important keywords. 

Here again is the main theme: a text mining project typically has three component processes - gather unstructured information (part 2), convert it into a structured or semi-structured format (part 3, this article) and finally apply any of the standard predictive analytics or data mining techniques to extract insights (part 4).

text mining - high level process

Before we convert the crawled data into a semi structured format, we will need to address how to store it locally. What do the results from the web crawling process (described in part 2) look like?

Web crawling process revisited

Recall that our process was as follows:

Set pagePos = 0

For iteration (1 to n)


-- Crawl the top level page given by{pagePos}

---follow links to all subpages within this top level page which have the words "Data Scientist" in the link anchor text

--- save the resulting web page as a document in an unique directory locally

update pagePos by 10


We made sure that each resulting top level page is in a separate directory by specifying output dir = C:\Users\...\crawl%{pagePos} in the Crawl Web operator. Note that RapidMiner requires these directories to exist - it will not create them on the fly. Each subpage is a text file within this directory with names 0.txt, 1.txt, 2.txt and so on. If you dont write top level page in a unique directory, RapidMiner overwrites the files - for example, pagePos=10 may have 4 sub pages each stored as 0.txt, 1.txt, 2.txt and 3.txt. When we go to the next iterarion, pagePos=20, the sub pages in pagePos=20 will once again be stored with names starting from 0.txt, etc. To avoid this, we write each pagePos into a new directory.

The total number of iterations, n was equal to the total number of top level pages which are determined manually. All the above "code" was generated automatically by RapidMiner when we drag and drop varius operators in the main window and connect them.

Here is the output from one top level page crawl, for pagePos=30 (now stored in one directory, locally) 

results crawled pages for text mining resized 600

Processing crawl results for text mining in a database

The above process is useful for storing results from a small number of pages. As mentioned above, because RapidMiner requires that the directory into which you are writing the pages exists before hand, it becomes impractical to use this set up for crawling large number of pages. This is where creating and writing the output to a database will be of great value.

In the Crawl Web operator simply uncheck the "write pages into files" option and connect the output of the Crawl Web into Write Database operator. You will need to start a MySQL database and establish connection to RapidMiner.

writing web crawling results to a database for text mining resized 600

Selecting the database connection wizard will open a dialog box in which you must provide information for RapidMiner to access your database. When you execute the above process, all the examples shown in the output image above will be stored in the table called "textmine" in a database called "simafore".

All you need to do now is to create a new process where you Read Database, convert the data in the database into documents - remember that the third attribute in our dataset is the actual web page - using Data to Documents and finally, use the Process Documents operator to convert the information into word lists. The below process does that. 

reading web crawling results from db for text mining resized 600In the next article we will describe the operators inside the Process Documents operator which will bring us an output that looks like this.

text mining word list output from web crawling resized 600

All articles and accompanying process will be available for a consolidated download as an ebook by the end of the series. Stay tuned by subscribing to the blog!

In the meantime, take our survey and give us feedback on our upcoming RapidMiner text book, if you havent already.

Click me

Topics: data mining with rapidminer, text mining, text analytics