In the first article of this series we discussed how regular expressions, when appropriately set up can increase the power of RapidMiner's operators. We also compared the two most popular open source data mining tools for their preprocessing abilities applied to text mining. We found that R, with its very solid data framing capabilities can excel when it comes to exploring the contents of a data set, from counting lines to removing selected lines. However RapidMiner, with its specialized operators, such as "Remove Document Parts" or "Split File by Content" can simplify an analysts job in pretty much the same way. So the score is tied at 1-1!
In this second article, we will jump into the heart of text mining: creation of document vectors or term document matrices (TDM) using RapidMiner. Recall that at the end of step 3, we had 37 separate documents, each representing a complete work of Shakespeare(see this article for steps 1 to 3). In step 4, we remove some additional material from each of the works - the list of main characters or DRAMATIS PERSONAE and then recombine all the 37 documents into a single corpus (to match the process with the R example). Strictly speaking, this recombination is not necessary - it all depends upon the final objective of text mining. For example, if it was to simply study one or two complete works, this recombination is not needed.
Step 4: Creation of the Term Document Matrix (TDM) or Document Vector
In R, "the main structure for managing documents is a so-called Corpus, representing a collection of text documents" according to the CRAN manual for the text mining library - library(tm). This abstract corpus is created using the corpus(x, readercontrol) function where x is the document source - which can be a file system, a character vector or a data frame. readerControl parameters indicates which type of document is to be read (pdf, txt, doc, etc) along with the ISO encoding.
Once we create this corpus or a collection of documents, we can apply all the text mining library functions to reduce it to a term document matrix or in RapidMiner lingo: the Document Vector. The document vector or TDM is a data frame which has document name (or doc id or doc number) for rows and each individual words (or tokens or terms) found in that document as columns. Each entry in this matrix can be a simple term count, term frequency or tf-idf scores. The output from RapidMiner for the TDM is shown below.
So how do we generate this matrix from our 37 independent documents?
The trick is to use a "Loop Files" operator and "iterate over files" (check box in the corresponding parameter setting) by pointing it to the directory where our 37 files reside. Inside this nested loop operator we insert a "Read Document" which pipes the output to our old friend "Remove Document Parts". This operator with the necessary regular expression will remove the DRAMATIS PERSONAE from each of the 37 documents. The output of the Loop Files is then connected to a standard Process Documents operator. This nested operator will contain all the other text mining preprocessing steps which include: creating tokens, transforming into lower case, removing stopwords, and stemming.
The comparators below show the equivalence between R and RapidMiner for accomplishing the above steps.
Now let us look at some of the basic things we can do with our TDM. Let us try to replicate the tasks which the R code has been shown to do here.
The first task was finding the most frequent words in the corpus. Doing this in RapidMiner when we use the Process Documents operator is automatic, if we use Term Occurrences as the vector creation method (the other options are TF-IDF, term frequency and binary occurrences). There will be two outputs from a successful execution of this op. The first one is called WordList and contains the word (or term), the total occurrences in the entire corpus, and the number of distinct documents this term occurs in. Without any pruning (which is similar to not removing sparse terms in R (i.e. not running removeSparseTerms() on the TDM), we will end up with 14208 distinct terms in all the 37 documents. Sort this matrix by clicking on the Total Occurrences to see the top words as shown below.
One of the other tasks demonstrated in the R blog - identifying word associations, appears to be very simple in R, but has no equivalent operator in RapidMiner. This is again an area where R's longer tenure as an open source code has produced more built in functions.
Respond to our survey and receive a zip file containing all RapidMiner processes described in this blog series. Please leave an email address at the end of the survey to receive the zip file.