Your Business, Your Model: A Predictive Analytics Model How-To Guide
10/07/2019 by Chris St. Jeor Modernization - Analytics
There is a way to predict the future with great accuracy: predictive analytics. Unlike fortune cookies or daily horoscopes, predictive analytics leverages rich historical data to uncover hidden patterns –patterns that a person cannot see by simply looking at a graph or pivot table.
A common problem we are helping our customers solve is twofold: define what predictive analytics means to their company, and leverage predictive analytics in their daily decision-making processes.
Predictive analytics may not be able to tell you how the location of Saturn, or the frenetic movements of Mars, are going to impact your day. Predictive analytics can, however, use historical data to make accurate predictions about the future.
In its simplest form, predictive analytics, or advanced data analytics, is the process of coupling historical data with statistical algorithms to make predictions about future events. Predictive analytics is heavily used in a variety of ways by our largest corporations and industries.
Some predictive analytics examples include:
Based on where your company lies on the analytics maturity spectrum, there are a wide variety of data science statistical models your business can use. Models that will help you gain valuable insights into your customers and industry.
The remainder of this blog will focus on a particular breed of predictive analytics known as time series forecasting. Time series forecasting is a low-cost, easy-to-build solution that can provide powerful insights across a variety of industries.
Let’s walk through the three fundamental steps of building a quality time series model: making the data collected stationary, selecting the right model, and evaluating model accuracy. The examples in this post use historical page views data for a major automotive marketing company.
Time series forecasting involves the use of data that are indexed by equally spaced intervals of time (minutes, hours, days, etc.). The discrete nature of time series data leads to many time series data sets having a seasonal and/or trend element built into the data.
The first step in time series modeling is to account for these existing seasons (a recurring pattern over a fixed period of time) and/or trends (upward or downward movement in the data).
Capturing and accounting for these embedded patterns is what we call making the data stationary. Examples of trending and seasonal data can be seen in figures 1 and 2 below:
As previously mentioned, the first step in time series forecasting is to remove the effects of the trend or season that exists within the data to make it stationary. We keep throwing around the term stationarity, but what exactly does it mean?
A stationary series is one where the mean of the series is no longer a function of time. With trending data, as time increases, the mean of the series either increases or decreases with time (i.e., population growth over time). For seasonal data, the mean of the series fluctuates in accordance with the season (think of the increase and decrease in temperature every 24 hours).
There are two methods that can be applied to achieve stationarity: difference the data or linear regression. To take a difference, you calculate the difference between consecutive observations.
To use linear regression analysis, you include binary indicator variables for your seasonal component in the model. Before we decide which of these methods to apply, let’s explore our data. We plotted the historical daily page views using predictive analytics software SAS Visual Analytics.
The initial pattern seems to repeat itself every seven days indicating a weekly season. The prolonged increase in the number of page views over time indicates that there is a slight upward trend. With a general idea of the data, we then applied a statistical test of stationarity – the Augmented Dickey-Fuller (ADF) test. The ADF test is a unit-root test of stationarity.
We won’t get into the details here, but a unit-root indicates if the series is nonstationary. You use this test to determine the appropriate method to handle the trend or season (differencing or regression).
Based on the ADF test for the data above, we removed the seven-day season by regressing on dummy variables for day of the week and removed the trend by differencing the data. The resulting stationary data can be seen in the figure below.
Now that the data is stationary, the second step in time series forecasting is to establish a base level forecast. We should also note that most base level forecasts do not require the first step of making your data stationary. This is only required for more advanced models such as ARIMA modeling, which we will discuss momentarily.
There are several types of time series models. To build a model that can accurately forecast future page views (or whatever you are interested in forecasting), it is necessary to decide on the type of model that is appropriate for your data.
The simplest option is to assume that future values of y (the variable you are interested in forecasting) are equal to the most current value of y. This is considered the most basic, or “naïve model,” where the most recent observation is the most likely outcome for tomorrow.
The second model is the average model. In this model, all observations in the data set are given equal weight. Future forecasts of y are calculated as the average of the observed data.
The forecast generated could be quite accurate if the data is level but would provide a very poor forecast if the data is trending or has a seasonal component. The forecasted values for the page views data using the average model can be seen below.
If the data has either a seasonal or trend element, then a better option for a base level model is to implement an exponential smoothing model (ESM). ESMs strike a happy medium between the naïve and average models mentioned above. The most recent observation is given the greatest weight and the weight of all previous observations decrease exponentially into the past. ESMs also allow for a seasonal and/or trending component to be incorporated into the model. The following table provides an example of aninitial weight of 0.7 decreasing exponentially at a rate of 0.3.
There are various types of ESMs that can be implemented in time series forecasting. The ideal model to use will depend on the type of data you have. The table below provides a quick guide as to what type of ESM to use depending on the combination of trend and season in the data.
Because of the strong seven-day season and upward trend in the data, we selected an additive winters ESM as the new base level model. The forecast generated does a decent job of continuing the slight upward trend and captures the seven-day season. However, there is still more pattern in the data that can be removed.
After identifying the model that best accounts for the trend and season in the data, you ultimately have enough information to generate a decent forecast, as we see in Figure 2 above. However, these models are still limited in that they do not account for the correlation that the variable of interest has with itself over previous periods of time.
We refer to this correlation as autocorrelation, which is commonly found in time series data. If the data has autocorrelation, as ours does, then there may be additional modeling that can be done to further improve upon the baseline forecast.
To capture the effects of autocorrelation in a time series model, it is necessary to implement an Autoregressive Integrated Moving Average (ARIMA) model. ARIMA models include parameters to account for season and trend (like using dummy variables for days of the week and differencing).
ARIMA models also allow for the inclusion of autoregressive and/or moving average terms to deal with the autocorrelation embedded in the data. By using the appropriate ARIMA model, we can further increase the accuracy of the page views forecast as seen in Figure 3 below.
While you can see the improved accuracy of each of the models presented, visually identifying which model has the best accuracy is not always reliable.
Calculating the Mean Absolute Percent Error (MAPE) is a quick and easy way to compare the overall forecast accuracy of a proposed model. The lower the MAPE, the better the forecast accuracy.
Comparing the MAPE of each of the models previously discussed, it is easy to see that the seasonal ARIMA model provides the best forecast accuracy. Note that there are several other types of comparison statistics that can be used for model comparison.
Once you have asked the right question, the trick to building a powerful time series forecasting model is to remove as much of the noise (trend, season, and autocorrelation) as possible. So that the only remaining movement unaccounted for in the data is pure randomness.
For our data, we found that a seasonal ARIMA model with regression variables for the day of the week provided the most accurate forecast. The ARIMA model forecast was more accurate when compared to the naïve, average, and ESM models.
While no time series model will be able to help you read the stars, there are many types of forecasting methods at your disposal to help predict anything from page views to energy sales. The key to accurately predicting your variable of interest is to first, understand your data, and second, apply the model that best meets the needs of your data.
Time series forecasting may not be the solution for your company. Time series, though easy to implement, is typically the desired solution for companies that are further along the analytics maturity model. Quality data structure and integrity are key ingredients for a reliable time series solution. However, if you are currently using reporting — a good cross sell to management can be time series forecasting.
For information about how your company can better position itself to take advantage of time series forecasting and other advanced analytics solutions, make sure to watch our Analytics Maturity Model webinar. You’ll learn how to move your company forward along the analytics maturity curve.