Combining power of R and RapidMiner for time series forecasting
Handling time series analysis in a tool like RapidMiner requires advanced skills. Basically, one has to become very conversant with the Windowing operator and other "Series" extension tools, about 80+ different ones. While basic time series forecasting tools, such as exponential smoothing are available as built-in operators, handling advanced techniques like ARIMA, requires some extensive workarounds. There are certain aspects of RapidMiner which are "non-conventional", particularly for time series. For people who do not want to give up the traidtional way of doing time series, have no fear, RapidMiner will allow you to keep your conventional methods by allowing you to fully integrate with standard methods.
This is done with the help of RapidMiner's truly flexible integration with the other most popular open source data mining tool, R. There are many packages and libraries in R, specifically tailored to handle time series forecasting in the "traditional" manner. RapidMiner integrates really well with R by providing two mechanisms:
- an interactive console, similar to the native R console and somewhat less sophisticated than RStudio
- and a more powerful full integration of R capabilities within the RapidMiner process design perspective.
The first option is fairly easy to put into work, assuming you have successfully added the R extension to RapidMiner. But the second option requires some initial planning. The key is to understand how to pass data from RapidMiner to R and back. Once you understand this simple but important aspect, then R essentially becomes another powerful "operator" within the vast library of existing RapidMiner operators. In this article, we will expore this second mechanism in a little more detail using the example of a time series problem.
Our simple time series data consists of 4 columns: a date and 3 numerical quantities which represent monthly sales volumes of three different products. There are 77 samples which include data up to November 2013 and we want to forecast these numbers for the next 12-24 months.
Passing data to R:
Once this data is read into RapidMiner using any of the available tools, we need to pass the data to R for analysis. We may want to select only some of the attributes to pass through to R for forecasting or the entire data set. Sending the entire dataset into R is very easy. Simply connect the output of the data retrieval to the "inp" port of the "Execute Script (R)" operator and the entire dataset is sent in as a data frame to R. This video provides more details on this step (see Part 4: Accessing Data). On the other hand, if you want to select only a few attributes to send to R, this can be done via "Select Attributes". In both cases, the Execute Script (R) operator has to be configured correctly.
Configuring Execute R Script operator:
There are 3 steps here. First, provide the names of the input variables being sent to R. If you are sending the entire dataset (as a data frame), then type the name of this data frame in the second box of the parameters tab (inputs: Edit Enumeration). If you are sending only a few attributes from your data set, each attribute name has to be entered separately as shown below.
The second step is to write the R script in the "script: Edit text" which is the first box of the parameters tab. Make sure that you reference the names of the variables selected above exactly (or rename them within R, as done below). In this case we are only using the attribute WT1 for forecasting along with the Date. Note that Date is being renamed as "Months" inside R.
Extracting data from R:
The final part of configuring the Execute Script (R) operator is to indicate which variables must be sent back to RapidMiner. The R script calls the necessary R libraries and generates several outputs. As seen above, we are running a Holt Winters exponential smoothing forecast and an ARIMA forecast on the attribute WT1. We are extracting both these outputs: xx is the HoltWinters forecast and yy is the ARIMA forecast. These defined output variables within R are extracted in this final step as shown below.
By default, the "type" option in the above step is "Generic R Result". However to be able to the generated forecasts within RapidMiner for other data manipulation or analysis purposes we need to send the R results back as data frame or data tables. This is only possible when the R script converts the standard outputs to a data frame which is done using the as.data.frame() function within R.
This will allow us for example to use RapidMiner's nice charting functions to plot the output - shown below is the ARIMA forecast (the "yy" output variable) with the 80% and 95% confidence bands which are automatically produced by R.
How useful would a book on comparing open source tools such as R, RapidMiner etc. be? Take our quick survey and let us know.