Time Series Forecasting: data partitioning and visualizing results

Posted by Bala Deshpande on Tue, Nov 20, 2012 @ 07:09 AM

Data partitioning is a necessary step in all predictive analytics exercises. The basic idea is to separate out the available data into a training set and a testing (or validation) set. Why do we need to partition? Primarily because we want to ensure that our model does a good job of predicting the "seen" data, so that when we are presented with unseen data, we have some level of confidence about the predictive power of the model. For cross-sectional data analysis, we usually take great care to make sure that the training and testing samples are randomly chosen or in case of unbalanced data, carefully chosen.  

We need to apply a similar procedure for time series modeling. However we cannot randomly chose samples for training and testing. Time series data by definition occur in a temporal sequence and this order or sequence must be preserved in order to keep the structure of the series intact.

An easy solution to data partitioning of time series is to chose the first 80% of the samples for the training set and the remaining, more recent 20% of the samples for the validation set. There is of course no hard and fast requirement for the 80-20 mix, it could equally well be 70-30 or even 90-10.

Selecting a good model

The example below shows how partitioned data can be useful in selecting the right time series modeling technique for future predictions. 

data partitioning for time series forecasting resized 600

As seen in the chart above, this simple partitioning allows us to test the quality of the fitted model. The thick blue line represents the price of a metal commodity (nickel) from October 2007 to October 2012. The samples from November 2011 to October 2012 are used as the validation set. Clearly the multilayer perceptron model (green) is being significantly biased by the trend from the recent samples of the training set. While the Holt Winters 3-parameter model (purple) does a good job of capturing the immediate seasonal spike, it overpredicts the more recent prices. The best performing model seems to be the linear regression model (red), which appears to have the lowest forecast errors among the whole bunch. 

Presenting forecast results

Once the process of establishing a good model is complete, what determines the success of the analysis is presentation: the results of the analysis must be meaningfully relayed so that managers can utilize this information to guide their decision making. This is a very critical step and suprisingly requires many iterations or back and forth reviews with the end user. 

A line chart is typically the simplest way to do this. We can simply extend the chart, such as the one shown above, to include dates into the future (for example December 2012 and beyond) and the corresponding forecasted values. The challenge with this visualization is that after forecasts are made, there is no way of efficiently capturing newly emerging data. For example, when we make demand forecasts in early October for October, November, and December, these three values can be shown on the chart. But in early November, we will have the actual sales data for October and for management which tracks quarterly or annual sales, it is very useful to include this newly emerged data into the quarterly forecasts. Thus the updated quarterly forecast will now include actual data from October, and new forecasted data (which has absorbed the actual October values in the training set) for November and December. 

This type of rolling update for aggregate forecasts is very useful for production planning and budgeting and a simple line chart will not sufficiently guide decision making. The chart below shows one way to incorporate this information to help management. The orange bars are quarterly forecasts - termed "Hybrid" forecasts - include actual volumes combined with forecast volumes. For example, the Q4 hybrid forecast made in November includes the actual volume from October plus the forecasted values from November and December.

visualizing time series forecast data for decision making resized 600

The key point in this discussion is that, good predictive modeling does not end with quantification of forecast accuracy and a simple chart. To meaningfully integrate analysis into the decision making process, dashboards have to be easily understood and aid in a managers thought process. In order for analytics to become ingrained into the culture of a business, simplifying the process of consuming the knowledge is extremely important.

Download our free whitepaper on forecasting applied for product cost management.



Tags: predictive analytics, time series analysis