Time Series Forecasting: comparing RapidMiner and R for analysis
Time series forecasting is one of the oldest known predictive analytics techniques. The idea is to use historical data to make point forecasts about future data - as simple as that. This is what most predictive models are designed to do. However there is one important difference. In time series analysis we are concerned with forecasting a specific quantity given that we know how this quantity has varied over time. Predictive analytics models that do not involve time series use what is known as "Cross sectional" data which do not have the time variance component to them - strictly speaking.
For some people, particularly engineers like me, model building is about collecting data from several different attributes of a system and using these to typically fit a "function" to predict the desired quantity. For example, engineers may use rules of physics to model and predict how the body of a car will buckle during a crash. Purchasing managers may use data from several different commodity prices which influence the final price of a product to "model" the cost of the product. Economists may use unemployment, consumer price index, factory orders etc to model the GDP growth. The common thread among all these predictive models is that several different attributes, called predictors or independent variables, which potentially influence a target (forces, product costs or GDP) are used to predict that target variable.
In conventional time series forecasting, there is no difference between a predictor and a target. The predictor is the target! This long preamble helps you to understand one of the advantages of using RapidMiner for time series forecasting. With that, here are the essential differences between two popular open source data mining tools: R and RapidMiner.
Advantages of RapidMiner over R
1. The unique way of converting a time series into cross sectional data allows the analyst to deploy virtually all available algorithms for forecasting time series. This was briefly described in an article on using RapidMiner for time series forecasting earlier. A very handy video guide posted by Tom Ott on this subject can be found here.
2. Time series forecasting, treats prediction as essentially a single-variable problem. If you have a series of points spaced over time, conventional forecasting uses smoothing and averaging to "predict" where the next few points will likely be. However, for complex systems such as the economy or stock market, point forecasts are virtually useless because these systems are functions of hundreds if not thousands of variables. What is more valuable or useful is the ability to predict trends, rather than point forecasts. We can predict trends with greater confidence and reliability (i.e. are the quantities going to trend up or down), rather than the exact values of these quantities. For this reason, using different modeling schemes such as artificial neural networks or support vector machines or even polynomial regression can sometimes give highly accurate trend forecasts. With conventional forecasting available in R, we are limited to using a variety of smoothing functions.
Limitations of RapidMiner over R
1. Any time series can be decomposed into four parts: Trend, Cyclicality, Seasonality and Randomness. If you do have a time series which is not highly volatile (and therefore more predictable), conventional time series forecasting can help you understand the underlying structure of the variability better. In such cases, trends or seasonal components have a stronger signature than the random component. R does a very good job of taking any time series and breaking it up into these components, where as RapidMiner has no straight forward way of doing this.
2. RapidMiner also has no simple means of adding and displaying forecast confidence bands around the point forecasts. In R, as shown below, a basic plot of the forecast will include the 80% (yellow) and 95% (orange) bands around the forecasted (blue line) values.
So the bottom line is simple:
If you have a "predictable" time series which strong seasonal and trend signatures, using R and conventional time series forecasting is easy. However if you have a complex, multivariate dataset and you are more interested in picking up and predicting trends rather than making point forecasts, use RapidMiner.
Are you interested in a datamining cookbook that explains many of these techniques and shows you how to apply them using open source products like RapidMiner? Take the anonymous survey below to give us feedback!
Top Image courtesy: Steve Snodgrass, Creative Commons