Text mining involves converting unstructured data into a semi-structured format before applying any standard machine learning algorithm. There are several intermediate steps that are necessary till we get to this point. With so many steps needed to deliver the final result, the process design can get pretty complicated which also makes debugging or experimenting tricky. The simplest way to reduce this complication is to break your process into two or more completely self-contained intermediate processes.
To streamline this, here are three tips which will ensure that you will not spend a lot of time debugging or while experimenting with different algorithms.
Tip 1: Once you create intermediate processes, store the intermediate results. This will ensure that if there is a bug, and you fix it, you will not have to re-run your entire process. You will simply start with the results of the completed earlier process and proceed. Let us take a simple example: if you are reading blogs or RSS feeds and mining them, 500-word blogs times a thousand blogs will result in half a million words or tokens. You do not want to reprocess this step each time you want to try a different algorithm (say Naive Bayes versus Support Vector Machine) or try different parameters using the same algorithm. The "Store" operator comes in very handy for all of these purposes. To access the data stored in this way, you will simply use the "Retreive" operator.
Tip 2: Pay attention to the number of attributes which result from converting text data into a document vector. As we have described earlier, converting text into document vectors is the key step in text mining. Every word or token in a text can theoretically become an attribute in the document vector. Recall that each row or example of a document vector is a unique text data file. So a 500-word article can essentially be transformed into a matrix of several hundred attribute columns and one row. When you mine a thousand or more articles, you can end up with a document vector which has tens or hundreds of thousands of attributes columns and thousands of rows. This will make the final rendering of the doc vector output very slow. To side step this issue, you may want to remove certain very common words which don't add substantial meaning, in addition to standard English stopwords. For example, if you are mining articles on automobiles, you may be able to remove some of the very common words which do not convey special meaning in this context such as "car", "vehicle", "automobile" etc. This can be easily done by pruning tokens ("prune above absolute" option in the Process Documents operator, see image above) or using a special stopword dictionary.
Tip 3: Weight attributes before applying learning algorithms. As a follow up to tip 3, to make the algorithms efficient, feature selection would be the next important intermediate step. We have discussed various feature selection methods such as information gain, mutual information before. Using the document vector created earlier, you may next want to apply one or more feature selection algorithms to rank the attributes. After doing this, you once again store the intermediate result which can then be used to later drive the final algorithm of choice.
By following these simple tips, you can greatly reduce effort on text mining and in the process have fun using RapidMiner.
Please take a quick survey to give us feedback on our upcoming book on data mining.