We have previously covered the basic process flow when it comes to text mining and how RapidMiner makes this very intuitive to set up. The figure below shows the highlevel process flow for most text mining activities. In this article - which contains some excerpts from our upcoming book on Predictive Analytics using RapidMiner - we will discuss a simple process to perform keyword clustering. The main goal is to crawl a website, identify keywords which characterize some of the most important pages of the site and then rank those words for each page of interest using clustering tools such as k-means or k-medoids.
The site we explored is hosted by a Public Television station and is meant to be used as a platform for reaching out to members of the local community who are interested in the Arts and Culture. The site has pages for several related categories: Music, Dance, Theatre, Film, and so on. Each of these pages contains articles and events related to that category. Our goal is to characterize each page in the site and identify the top keywords that appear in each page.
RapidMiner provides three different ways to crawl and get content from websites. The Crawl Web operator will allow setting up of simple crawling rules and based on these rules will store the crawled pages in a directory for further processing. The Get Page operator retrieves a single page and stores the content as an example set. The Get Pages operator works similarly, but can access multiple pages identified by their URLs contained in an input file.
Step 1 – Gather unstructured data: The first step in this process is to create an input text file containing a list of URLs to be scanned by the Get Pages operator. The first URL is the “Dance” category page and the second one is the “Film” category page in the website. The output from the Get Pages operator consists of an example set which will contain two main attributes: the URL and extracted HTML content.
Step 2 – Preprocessing: Next connect the output from this to a Process Documents from Data operator. This is a nested operator where all the preprocessing takes place. The first step in this preprocessing is removing all the HTML tags and only preserving the actual content. This is enabled by the Extract Content operator. Other operators within the text mining domain may be applied as needed inside the nested box.
Step 3 – Apply Clustering (Descriptive Analytics) technique: The output from the Process Documents from Data consists of 1) a word list and 2) a document vector. The word list is not needed for clustering, however the document vector is. The output from the Process Documents … is filtered further to remove attributes which are less than 5 (that is all words which occur less than 5 times in both documents).
Finally this cleaned output is fed into a k-medoids clustering operator as shown in figure of the entire process. The clustering outputs clearly show the top keywords from each of the two pages crawled.
The data, process and a more detailed description of the process is available in the chapter on Text Mining of our book. Please take a quick survey (1 required question, 2 optional) below to send your feedback. Thank you.