“Truth is a Pathless Land”

...but finding an effective solution to your business problem does not have to be. Business analytics landscape does actually appear so, with a myriad techniques and vendor tools in the market.

Simafore provides tools and expertise to:

  • Integrate data
  • Select and deploy appropriate analytics
  • Institutionalize processes

About this Blog

The Analytics Compass Blog is aimed at two types of readers:

  • individuals who want to build analytics expertise and 

  • small businesses who want to understand how analytics can help them improve their business performance. 

If you fall into one of these categories, join hundreds of others and subscribe now!

Subscribe via E-mail

Your email:

Search SimaFore

FREE SMB Survey Report

describe the image
  • Our report (pdf) is a survey of analytics usage and needs of more than 100 SMBs
  • Find out which analytics applications have
    • Highest value
    • Most demand
    • Best ROI

Affordable Analytics for SMB

 

Browse by Tag

Blog - The Analytics Compass

Current Articles | RSS Feed RSS Feed

Text Mining: How to store/access results from web crawling

  
  
  

This is the third in our article series on text mining using RapidMiner. In the first article we discussed general features of text mining and how it relates to the overall area of predictive analytics. We used job searching as a practical example of using text mining in your own personal application. 

In the second article we discussed how to set up a simple web crawling process to automatically read through hundreds of documents from a jobs website.

In this article we will complete the pre-processing step by gathering all the crawled data into a MySQL database table and then reading the collected data from the table into a document processor to convert the job descriptions into important keywords. 

Here again is the main theme: a text mining project typically has three component processes - gather unstructured information (part 2), convert it into a structured or semi-structured format (part 3, this article) and finally apply any of the standard predictive analytics or data mining techniques to extract insights (part 4).

text mining - high level process

Before we convert the crawled data into a semi structured format, we will need to address how to store it locally. What do the results from the web crawling process (described in part 2) look like?

Web crawling process revisited

Recall that our process was as follows:

Set pagePos = 0

For iteration (1 to n)

{

-- Crawl the top level page given by http://www.indeed.com/jobs?q=%22data+scientist%22&start=%{pagePos}

---follow links to all subpages within this top level page which have the words "Data Scientist" in the link anchor text

--- save the resulting web page as a document in an unique directory locally

update pagePos by 10

}

We made sure that each resulting top level page is in a separate directory by specifying output dir = C:\Users\...\crawl%{pagePos} in the Crawl Web operator. Note that RapidMiner requires these directories to exist - it will not create them on the fly. Each subpage is a text file within this directory with names 0.txt, 1.txt, 2.txt and so on. If you dont write top level page in a unique directory, RapidMiner overwrites the files - for example, pagePos=10 may have 4 sub pages each stored as 0.txt, 1.txt, 2.txt and 3.txt. When we go to the next iterarion, pagePos=20, the sub pages in pagePos=20 will once again be stored with names starting from 0.txt, etc. To avoid this, we write each pagePos into a new directory.

The total number of iterations, n was equal to the total number of top level pages which are determined manually. All the above "code" was generated automatically by RapidMiner when we drag and drop varius operators in the main window and connect them.

Here is the output from one top level page crawl, for pagePos=30 (now stored in one directory, locally) 

results crawled pages for text mining resized 600

Processing crawl results for text mining in a database

The above process is useful for storing results from a small number of pages. As mentioned above, because RapidMiner requires that the directory into which you are writing the pages exists before hand, it becomes impractical to use this set up for crawling large number of pages. This is where creating and writing the output to a database will be of great value.

In the Crawl Web operator simply uncheck the "write pages into files" option and connect the output of the Crawl Web into Write Database operator. You will need to start a MySQL database and establish connection to RapidMiner.

writing web crawling results to a database for text mining resized 600

Selecting the database connection wizard will open a dialog box in which you must provide information for RapidMiner to access your database. When you execute the above process, all the examples shown in the output image above will be stored in the table called "textmine" in a database called "simafore".

All you need to do now is to create a new process where you Read Database, convert the data in the database into documents - remember that the third attribute in our dataset is the actual web page - using Data to Documents and finally, use the Process Documents operator to convert the information into word lists. The below process does that. 

reading web crawling results from db for text mining resized 600In the next article we will describe the operators inside the Process Documents operator which will bring us an output that looks like this.

text mining word list output from web crawling resized 600

All articles and accompanying process will be available for a consolidated download as an ebook by the end of the series. Stay tuned by subscribing to the blog!

In the meantime, take our survey and give us feedback on our upcoming RapidMiner text book, if you havent already.

Comments

It seems like rapidminer enqueues all of the scraped pagecontent before writing it to the database. This is a problem if there is an error recovering one of the pages because it drops the entire operation leaving no results. I've tried to fix this by putting my results into an exampleloop for each recovered URL but, it seems that rapidminer still enqueues all data before writing it. Is there any fix or option for this?
Posted @ Sunday, July 21, 2013 4:49 AM by TW
This blog is very nice and informative….it helps me a lot…. 
Here is my website if you want to know more information- http://www.connectmining.com/ 
Posted @ Friday, September 20, 2013 2:19 PM by Connect Mining
i am new to rapid miner i installed rapid miner 5 i updated text processing and web mining extension it says its upto to date but text processing is not seen
Posted @ Thursday, November 21, 2013 1:35 AM by prabha
Prabha - have you tried to search for any of the text mining operators in the Operators view panel? Try something like "Crawl Web" or "Process Document ..."
Posted @ Thursday, November 21, 2013 8:48 AM by Bala Deshpande
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics