R-bloggers recently posted an interesting text mining article which attempts to text mine the entire collected works of William Shakespeare. R really shines at some aspects of data mining, particularly in preprocessing. The large number of functional shortcuts available for working with data frames virtually spoils the analysts!
RapidMiner, being a new kid on the block, does not have as rich a toolkit as R. While it excels at helping building solid processes which are very portable, you will need to apply some serious elbow grease to make the individual operators crank out what you want. But there in lies its unique power: many of the operators provide a range of options and your ability to build specialized data mining processes is only limited by your ability to leverage these operators. For RapidMiner, this means developing a high level of comfort using Regular Expressions or regex. (BTW here is a great resource for those who are not very familiar with regex: Regular Expressions in 10 minutes. You need to spend a bit more than 10 minutes with this book to get you up to speed).
Let us make this argument a lot clearer with an example. Specifically, we will take the same text mining example of the collected works of Bill Shakespeare and compare the two top open source data mining tools side by side.
Before we get started here is a high level description of the process (a more general structure for any text mining process) broken into steps:
- Import the raw data which is made freely available by Project Gutenberg
- Preprocess the raw text file by stripping out the non-literary portions such as headers, footers, and copyright statements
- Convert the single text file into 37 separate documents, each document representing one complete work of Shakespeare
- Create a document vector from the 37 documents
- Do cool text mining stuff such as clustering or classification on the document vector
Use the Get Page operator in RapidMiner to download the file from the Gutenberg url. Getting meta data information such as number of lines is not as easy as in R, which has the dedicated function: length(readLines()). The table below shows how the two tools compare. RapidMiner imports the data and the meta table shows the number of tokens (not lines).
This next step is to identify non-literary material that has been inserted by Project Gutenberg for legal matters. Because all 37 works are integrated into one single text document, there is only one header and one footer to contend with. In the R blog they have counted the number of lines of each and used available data frame functions in R to remove these lines. Because we do not know line numbers inside RapidMiner, we will need to use other methods for removing these extraneous texts. The copyright statement is a more interesting challenge because there is a bunch of these lines in between every one of the 37 books. Here is where regular expressions come to help and we need to use regex for either tool. This is what they used in the R process and the same can be used in RapidMiner.
|NOTE on regex above: The first two << are literal. The ^> indicates that any characters except the ">" can be selected and the * indicates that any number of characters (1 or more) must be selected. The last two >> are also literal. So everything between "<<" and ">>" is selected for removal.|
In R, they use the strsplit function and supply the regex as the argument. Where do we put the regex in RapidMiner? The operator we need is called, quite appropriately "Remove Document Parts". It takes only one parameter, which is the regex. Whereas the strsplit divides the document into many parts based on the regex logic, the Remove Document Parts only removes the selected tokens as indicated by the regex. For document division, we will need another operator. In our case, we want to split the document by each individual work of Shakespeare. Using strsplit with the argument provided will divide the document into more than 200 parts (there are only 37 complete works). More on this in step 3.
We can apply a similar logic for the header and footer for our RapidMiner process. Using a text editor we can identify the first and last words (or characters) which demarkate the start and end of the header and footer and change our regex to capture and remove the intra-book copyright statements.
An important note: use the (?m) tag at the start of the regular expression to enable the multiline mode. Otherwise it will not work.
The final step in this article will cover how to split the monolithic document appropriately. In the R blog, they used the copyright statement as a marker to separate into individual documents (with the help of strsplit) and as a result ended up with 182 documents. Obviously each complete work is now split into several pieces. We will avoid doing this.
The operator we need in RapidMiner is called "Split File by Content". It takes in a single file and subdivides it into as many individual files as needed based on the content. In our case, we see that every complete work starts with a title and the phrase "by William Shakespeare" and ends with a "THE END". We can use this in a regex to now split the single file into 37 constituent parts.
We need to provide the input file directory location in the "texts" field and the output file directory location in the "output" field. The "segment expression" value of $0 will number the first split file as seg0.txt, the second one as seg1.txt and so on until seg36.txt.
Now we have 37 individual documents which can be processed and converted into document vectors. This will allow us to work with each complete work by itself or all of them together if we choose. In the next article we will continue our comparison between the two codes as we get into the heart of text mining.