Getting the right information out of time series data requires skill and experience, and perhaps inspiration and intuition, too. This article discusses how to analyze time series data using some more sophisticated tools which are often not covered in basic statistical training programs.
By Michel Thirion, Master Black Belt at Honeywell, and Robert Collis, Minitab technical training specialist
Michel Thirion, Master Black Belt at Honeywell, and Robert Collis, Minitab technical training specialist
Time series data is plentiful in both manufacturing and service industries. Examples include:
Analysing such data seems straightforward on the surface—but is it?
Doing data analysis as if you’re following a recipe is never enough. Getting the right information out of the data certainly requires skill and experience, but to monitor any process correctly so that adequate corrective actions can be made may require inspiration and intuition, too.
Thanks to my experiences in Honeywell Master Black Belt training and in statistical consulting sessions provided by Minitab, I’ve been able to analyze time series data using some lesser-known, more sophisticated tools which are often left to one side in many basic statistical training programs. The aim of this article is to share with you how to apply some of these techniques in Minitab Statistical Software.
The visible upward trend up is confirmed by the very low p value for Trends. There is also more clustering as the red line crosses the median than we would expect if the data were random.
At first, it appears we have out-of-control points at the beginning and the end of the I-MR chart, with relative stability in between. From this, we might infer that this process has three parts or phases: lowish values at the outset, a stable middle period, and highish values towards the end.
However, standard assumptions have to be met in order to justify the use of control charts such as this one: specifically, the data must be normally and independently distributed, with mean μ and standard deviation
The control limits on the I chart are based on the mean of the moving range--the absolute difference between each consecutive pair of points. If there is no independence between the points, we have a condition called autocorrelation, in which there is little difference in each consecutive pair of points. That means the moving range will be artificially low, and this will manifest itself in an increase false alarm rate in the I chart.
Clearly, using an I-MR chart with autocorrelated data can lead to problems, so it would be helpful to know if the data are autocorrelated. Fortunately, the Assistant in Minitab Statistical Software will check this for us without even needing to use sophisticated options deep down in software.
The Assistant also informs us about the autocorrelation and its consequences:
We should have used the Assistant to create the I-MR chart in the first place. Without it, many users would make false conclusions.
Does the autocorrelation mean we can’t use this data? No—we just need to go ahead with something more sophisticated.
To understand the correlations between data at different points in time lagged by one or more periods, we can conduct an autocorrelation and a partial autocorrelation analysis. This enables us to understand if there are correlations between the data at time t and time t-1, t-2 to t-k.
After we click OK, Minitab provides the following output:
The partial autocorrelation function can be obtained similarly by selecting:
Minitab provides the following output:
Minitab creates the following chart:
The process appears to be in control and the underlying assumptions have been met—however, we have left out two-thirds of the data. Given the small number of values in the time series column, we also could have grouped the data using relatively large subgroups to finish up with relatively few points on the Xbar chart. Neither of these options is ideal.
The best solution is to use time series modelling, and in particular the ARIMA (Autoregressive integrated moving average) approach. This approach is not well known in industrial and business circles, but it can be relatively easily applied if a few basic steps are understood. However, some intuition may be required to come up with the best solution. This is where the person doing the analysis may need to use a little more art and a little less science!
ARIMA modelling very simply makes use of data from either the recent or more distant past to model the existing data as well as to make good predictions of future behaviour. The aim is to identify an underlying model that explains the change in the process. Any point that deviates from this predicted behaviour could be considered a special cause, since it is not following the general movements in the data.
To construct an ARIMA model, the data needs to be made stationary: that is, there should be no trend in the process either upwards or downwards. In other words, certain stability in the process mean needs to be achieved. Our original data was not stationary—there seemed to be an upward trend over time. We can make that data stationary using the Difference too
We fill out the dialog box is filled out as follows:
When we click OK, Minitab generates a column of the specified differences. If the data at time t is Xt and the data at the time period before t is called Xt-1, the difference is Xt -Xt-1.
Clicking OK produces the following graph:
The data, to which we applied a difference of order 1, seems to have made stationary, with no clear upward or downward trend. That means we can use this data for determining an ARIMA model.
First, we’ll perform autocorrelation and partial autocorrelation analysis on the differenced data. Minitab gives us the following results:
Now we need to understand what the patterns in the autocorrelation and partial functions mean. One text that I’ve found useful is Forecasting, Methods and Applications by Makridakis, Wheelwright, and Hyndman (John Wiley and Sons, Third Edition, 1998).
We have an autocorrelation function with a sinusoidal (sine-wave-like) pattern and spikes for lags 1 to 3, which suggests an Autoregressive Model of order 3, or AR(3). Sinusoidal behaviour on the partial autocorrelation function and spikes up to lag 3 suggests a moving average model of order 3 or MA(3). Time series modelling can be a bit of an iterative, or even hit-or-miss, process—but these graphs suggest the ARIMA (3,1,3) model is a good place to start.
Each part of the ARIMA model has a role in the predictions it makes. The autoregressive part of the model predicts the value at time t by considering previous values in the series at time t-1, t-2, etc. The moving average uses past residual values—the differences between the actual value and the predicted value based on the model at time t.
Minitab produces this output:
The p values are only significant at the 10% level for the first-order coefficient of the autoregressive part of the model and the 3rd order coefficient of the moving average part of the model. Furthermore, the Ljung-Box Chi-Square statistics, which tests the overall randomness of the model, suggests that there may be a seasonal effect of at least order 1.
Therefore, we’ll refine our attempt to understand this data by constructing an ARIMA(1,1,3)(1,0,0)12 model. This notation is not as complicated as it might look. The first set of parentheses tells us that the lags for the autoregressive (AR) and integrated (I) parts of the model will be 1, while the moving average (MA) will be based on lag 3. The second set of parentheses indicates the seasonal affect, which we’re supposing follows a 12-period, i.e., annual, cycle around AR(1).
But before hitting OK, click on Storage and select Residuals and Fits:
After pressing OK in both dialogs, Minitab outputs the following:
The first-order autoregressive coefficient, the seasonal coefficient and the third-order moving average coefficient are all significant at the 10% alpha level, indicating that this might be an efficient model. The sum of squares that measures the sum of the squared differences between each original data point and its estimated value using this ARIMA model are quite small. What’s more, the Ljung-Box Chi-Square statistics show no correlations between points with a difference of 12 or 24 lags—including the seasonal coefficient has factored these out.
Minitab produces this graph:
We can see that the fitted values (in red) closely follow the original data values over time.
Select Forecasts and then fill out the dialog box in the following way before clicking OK in each dialog box:
Minitab produces this graph of predictions:
The future predicted behaviour of this process makes sense given past data. With more data, the 95% confidence interval could be reduced even more.
It is critical for ARIMA—just like in regression or ANOVA modelling—to examine the behaviour of residual values to see whether they are normal, random and have constant variation.
The residual values here are the differences between the observed value at time t and the predicted value based on the ARIMA model. These differences can be either negative or positive, and occasionally 0 when the fit is perfect.
The assumptions are met quite well, except there is some nonconstant variation in the Versus Fits pot. This stems from the fact that the quality of the model’s fit is better for early data points than for more recent ones.
These residual values do not have autocorrelation, so it would make sense to plot them on an I-MR chart using the Assistant to ascertain which points deviate from the expected behaviour—in other words, which points do not follow the model.
Minitab produces the following output:
One data point—26—was outside the control limits. This could be explained by an unexpected and dramatic change between points 25 and 26. The ARIMA model suggests we can expect some evolution in this process, but our current understanding of the data series indicates there is just one special cause for this particular change.
Imagine that you are trying to understand the movements in the value of a certain stock and the recent trend has been downwards, sincethe company is facing fierce competition. You could model this phenomenon using ARIMA. If the company’s CEO—who is not popular on Wall Street—says he will leave his post in one year, this could cause the stock to leap in value and jump out of the expected result for that day, based on the current underlying model. This would be seen in the residuals and would be a definite special cause.
This following discussion will be of most interest to more advanced users, as some of the derivations are quite complex.
The predicted value of the response at time t depends on the past values in the series, but equally on past residual values.
First it would be useful to understand basic notation, in particular backshift notation that is commonly used in Time Series Analysis.
There are two sides to the model: the autoregressive (AR) and the moving average (MA). In addition, differencing of order one has to be included (Yt-Yt-1).
Let’s start with the autoregressive side of the equation, which depends on past values in the series. There are three distinct parts in this part of the model:
Autoregressive term of order 1: (1-φ1B)Yt=Yt-φ1Yt-1
Seasonal AR(1): (1-θB12)Yt=Yt-θYt-12
Non-seasonal difference : (1-B)Yt=Yt-Yt-1
The autoregressive term of order 1, seasonal AR(1) and non-seasonal difference are multiplied together and then worked through
(1-φ1B) (1-θB12) (1-B)Yt = (1-φ1B-θB12+φ1B-θB13)(1-B)Yt =
(1-B)Yt - φ1B(1-B)Yt -θB12(1-B)Yt +φ1θB13(1-B)Yt=Yt-Yt-1-φ1(Yt-1-Yt-2)-θ(Yt-12-Yt-13)+φ1θYt-13-Yt-14)
It is interesting to note that the model contains both recent data values Yt-1, Yt-2 and much older ones such as Yt-13 and Yt-14.
The moving average side of the equation is much easier to construct. The moving average side of the equation rests on the residuals of the previous periods with respect to time t, the time at which we wish to make a prediction using the model. This “moving average” should not be confused with its classic definition. Traditionally a moving average of order 3 would be to take the average of each set of 3 consecutive data points and track these means on a chart, but this is not at all what we are doing here.
The moving average part of the model is:
The autoregressive side of the equation and the moving average side are made equal to each other, adding the constant term on the right hand side and this is what results:
Putting Yt on the left side of the equation and all the other terms on the right, this is the outcome, expressed as the predicted value at t:
Yt= β+Yt-1 +φ1(Yt-1-Yt-2) +θ(Yt-12-Yt-13) +φ1θ(Yt-13-Yt-14)-ψ1et-1-ψ2et-2-ψ3et-3+ et
We are able to derive the predicted value at time t in the same way as in a classic regression model; that is,
if Yt= model+ et, then Yhatt = model
Therefore, Yhatt = β+Yt-1 +φ1(Yt-1-Yt-2) +θ(Yt-12-Yt-13)-φ1θ(Yt-13-Yt-14)-ψ1et-1-ψ2et-2-ψ3et-3
Yhatt = 0.00066 + Yt-1 + 0.4139(Yt-1-Yt-2) + 0.9817(Yt-12-Yt-13) – 0.4139*0.9817(Yt-13-Yt-14) – (-0.1549)*et-1 - 0.1507*et-2 - 0.8431*et-3
The predicted value at the 16th time period is expressed as follows:
Yhat16 = 0.00066 + Y15 + 0.4139(Y15 – Y14) + 0.9817(Y4 – Y3) – 0.4139*0.9817(Y3 – Y2) – (-0.1549)*e15 - 0.1507*e14 - 0.8431*e13
Let us demonstrate how this equation can be used.
The table below shows the original data for periods t=-4 to t=+11 where t=1 is the first period for which we have data. The original data for periods t=1 to 11 are found in cells B8 through to B17.
The first step of the analysis process takes place in the background in Minitab when the ARIMA model is fitted is to calculate Y0, Y-1,Y-2..etc., in a sense creating predicted data values previous to t=1.
It does not seem straightforward to calculate Y0 as this is before the study period. Back-forecasts are conducted and can be observed in the session window:
The back forecast for time 0 is 1.623. So Y1-Y0=1.623 and therefore Y0=59.7-1.6=58.1, which corresponds to cell B6 as required. The same process can be applied for Y-1,Y-2, etc., going upwards in column B of the spreadsheet above. Y0-Y-1=-2.121 Y-1=Y0+2.121=58.1+2.1=60.2, which corresponds to cell B5.
The final equation for the tenth period is:
Yhatt = 0.00066 + Yt-1 + 0.4139(Yt-1 – Yt-2) + 0.9817(Yt-12 – Yt-13) – 0.4139*0.9817(Yt-13 – Yt-14) – (-0.1549)*et-1 - 0.1507*et-2 - 0.8431*et-3
Yhat10 = 0.00066 + Y9 + 0.4139(Y9 – Y8) + 0.9817(Y-2 – Y-3) – 0.4139*0.9817(Y-3 – Y-4) – (-0.1549)*e9 - 0.1507*e8 - 0.8431*e7
Y9 is cell B15 in the table above = 63.6
Y8 is cell B14 = 61.9
Y-2 is cell B4 = 60.444
Y-3 is cell B3 = 60.535
Y-4 is cell B2 = 58.809
e9 is cell D15 = -0.000376466
e8 is cell D14 = 0.012395082
e7 is cell D13 = 0.039578177
This is the same as E16.
The same can be applied to find the predicted values of any time period based on this model.
Most practioners analyse time series or process data in a relatively simplistic way, and rely mainly on run charts or simple Shewhart control charts such as I-MR or Xbar-R or Xbar-S charts.
But any autocorrelation in the data can boost the false alarm rate. It can be appropriate to try to model the data using a sophisticated time series modelling technique such as ARIMA. Used correctly, ARIMA can provide a very good fit to existing data and offer good predictions of future behaviour, which is important in an uncertain world. ARIMA techniques, however, are quite complex and are not as well known or understood as more basic analyses. However, once the basic tenets are understood, successful time series models in can be constructed relatively easily using Minitab. Furthermore, once the ARIMA model is built, it is appropriate to evaluate the residual values to see whether there are any special causes.
1 Montgomery, Douglas (2005). Introduction to Statistical Quality Control. John Wiley and Sons, 5th Edition. Page 438.
Get our free monthly e-newsletter for the latest Minitab news, tutorials, case studies, statistics tips and other helpful information.
Data is the new gold: 5 ways to make sure your data is reliable
Advancing the Power of Analytics
A Statistical Analysis of Boston’s 2015 Record Snowfall