What does this mean?
Now, we know that observed residuals are correlated (due to minimizing least squares) + We still expect apparent randomness (no systematic patterns) if regression model OK
Sketch some examples.
Autocorrelation most often occurs often when
Experiments are run on “batches” of animals, cells, at specific times, etc
Autocorrelation occurs since information/phenomena change slowly through time/space,
so factors influencing a variable are likely to be similar in adjacent periods/places.
This is usually a rectifiable problem
(like we solved non-linearity and non-constant spread via transformations)
For example, if experiments are run in batches, include the batch number as a predictor.
Autocorrelation can also be negative:
Tall trees are next to short trees, which are next to tall trees, which are next to short trees,…
Autocorrelation can affect OLS analysis in the several ways:
Underestimate of variability which leads to increased false positive decisions:
rejecting the null hypothesis \(\beta = 0\) when one should not
Whether positive or negative correlation,
confidence intervals for \(\beta\)s may no longer be valid.
Plots:
Statistical tests:
Textbook Table 8.1 shows the data
Quarterly data from 1952 to 1956 on
\(y\) = U.S. consumer expenditure
\(x\) = U.S. money stock
Units: billions of current dollars
We regress the expenditure on money stock and plot vs. time
(these data are naturally collected in time order)
# expendData <- read.delim("http://statistics.uchicago.edu/~collins/data/RABE5/P211.txt")
expendData <- read.delim("./data/P211.txt")
# order data by time (year and quarter)
expendData <- arrange(expendData, Year, Quarter)
head(expendData, 6)
Year | Quarter | Expenditure | Stock |
---|---|---|---|
1952 | 1 | 214.6 | 159.3 |
1952 | 2 | 217.7 | 161.2 |
1952 | 3 | 219.6 | 162.8 |
1952 | 4 | 227.2 | 164.6 |
1953 | 1 | 230.9 | 165.9 |
1953 | 2 | 233.3 | 167.9 |
## fit linear regression: expenditures on money stock
lmfit <- lm(Expenditure ~ Stock, data = expendData)
tidy(lmfit)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -154.7 | 19.8500 | -7.794 | 0 |
Stock | 2.3 | 0.1146 | 20.080 | 0 |
glance(lmfit)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.9573 | 0.9549 | 3.983 | 403.2 | 0 | 2 | 18 |
For time series data (data collected where time order is relevant),
a useful plot for analysis is the index plot
If errors are correlated,
residuals of the same sign (\(+\) or \(-\)) will occur in clusters or bunches called runs.
## add residuals, standardized residuals, and an index variable to dataset
n <- nrow(expendData) ## sample size
expendData <- mutate(expendData, e = residuals(lmfit), r = rstandard(lmfit), index = 1:n)
## index plot
gf_point(r ~ index, data = expendData, size = 2) %>%
gf_line() %>%
gf_labs(x = "Quarter", y = "standardized residuals") %>%
gf_lims(y = c(min(min(~r, data = expendData), -2.1), max(max(~r, data = expendData), 2.1))) %>%
gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")
We connected the points from left to right to make it easier to see autocorrelation.
Here, the sequential residuals seem to follow an ordered pattern
(except for a few points) with respect to magnitude and sign
Number of runs:
The visual impression can be formally confirmed by counting
the number of sequences, batches, clusters, (runs) of either positive or negative residuals
Statistic = number of runs
Sequence Plot:
One-dimensional plot of the signs of the residuals, taken in the order of the observations.
For the expenditure vs. money stock example, the sequence plot is
expendData <- mutate(expendData, positive = 1 * (r > 0), sign = factor(positive, labels = c("-", "+")))
\(++-++++-------++++++\)
If you are curious, here is what I have stored in the dataset up to this point:
head(expendData, 6)
Year | Quarter | Expenditure | Stock | e | r | index | positive | sign |
---|---|---|---|---|---|---|---|---|
1952 | 1 | 214.6 | 159.3 | 2.8701 | 0.8095 | 1 | 1 | + |
1952 | 2 | 217.7 | 161.2 | 1.5994 | 0.4401 | 2 | 1 | + |
1952 | 3 | 219.6 | 162.8 | -0.1812 | -0.0490 | 3 | 0 | - |
1952 | 4 | 227.2 | 164.6 | 3.2782 | 0.8724 | 4 | 1 | + |
1953 | 1 | 230.9 | 165.9 | 3.9877 | 1.0513 | 5 | 1 | + |
1953 | 2 | 233.3 | 167.9 | 1.7869 | 0.4658 | 6 | 1 | + |
Now, back to looking at the sequence plot:
\(++-++++-------++++++\)
## rle = run length encoding (in base R)
rle(as.vector(expendData$positive)) ## runs and their lengths
Run Length Encoding
lengths: int [1:5] 2 1 4 7 6
values : num [1:5] 1 0 1 0 1
nruns <- length(rle(as.vector(expendData$sign))$lengths) ## number of runs
n <- nrow(expendData) ## sample size
n1 <- sum(expendData$positive) ## number of positive residuals
n2 <- n - n1 ## number of negative residuals
The sequence plot indicates that there are \(5\) runs.
\(\displaystyle \mu=\frac{2n_1n_2}{n_1+n_2}+1\)
and
\(\displaystyle \sigma^2=\frac{2n_1n_2(2n_1n_2-n_1-n_2)}{(n_1+n_2)^2(n_1+n_2-1)}\)
mu <- 1 + 2*n1*n2 / (n1 + n2)
sigmasqr <- 2*n1*n2 * (2*n1*n2 - n1 - n2) / ( (n1 + n2)^2 * (n1 + n2 -1) )
In our example, \(n_1=12\) and \(n_2=8\), so
the expected number of runs is \(\mu=10.6\) and
the standard deviation is \(\sigma = 2.085\)
We observed only \(5\) runs; which is 2.7 SDs below \(\mu\).
Is this unusual under the null hypothesis of no autocorrelation?
A well-known nonparametric test is called a “runs test”
In R, load the tseries
(time series) package,
then use the runs.test
function
The runs.test
function expects you to give it a factor with two and only two levels.
For example,
library(tseries)
runs.test( as.factor( rstandard(lmfit) > 0) )
library(tseries, warn.conflicts = FALSE)
data.frame(sample.size=n, number.of.runs=nruns, positive=n1, negative=n2)
sample.size | number.of.runs | positive | negative |
---|---|---|---|
20 | 5 | 12 | 8 |
runs.fit <- runs.test(expendData$sign)
runs.test.statistic <- runs.fit$statistic
runs.test.pvalue <- runs.fit$p.value
When I ran this code, I got a p-value of \(0.0072\).
This indicates we should reject the null hypothesis of no autocorrelation.
We kind of expected this given the appearance of the index plot.
A common approach to detecting serial correlation is known as the Durbin-Watson test.
The test is based on assuming that successive errors are correlated.
The model is written this way:
\[ Y_t = \beta_0 + \beta_1 x_{1t} + \cdots + \beta_p x_{pt} + \epsilon_t\]
where \(t\) is the time index and \(\epsilon_t\) is the \(t^{th}\) error in time.
If the errors are correlated, perhaps each residual (\(\epsilon_t\)):
If this model is reasonable, then \[
\epsilon_t=\gamma_0 + \gamma_1\epsilon_{t-1}+\omega_t = \rho\epsilon_{t-1}+\omega_t, \qquad |\rho|<1,
\] where \(\rho\) is the correlation coefficient between successive errors
and \(\omega_t\) are independent, normally distributed, with constant variance across the errors (\(\epsilon\)s).
Why is the intercept (\(\gamma_0\)) equal to zero?
Why is the slope (\(\gamma_1\)) equal to \(\rho\)?
Thus, if \(\epsilon_t\) depends linearly on \(\epsilon_{t-1}\) and
is not correlated with the size of earlier errors, then this model is reasonable, then
\[ \epsilon_t=\rho\epsilon_{t-1}+\omega_t, \qquad |\rho|<1, \]
The relationship is likely more complex than this, but it’s a nice, simple starting point.
We call this a “first-order autoregressive model”
The DW statistic (often labeled \(d\)) is defined as: \[ d\;=\; \frac{\sum_{t=2}^n (e_t-e_{t-1})^2}{\sum_{t=1}^n e_t^2} \;=\; \frac{\sum_{t=2}^n (e_t-e_{t-1})^2}{\mathrm{SSE}}, \] where \(e_t\) is the \(t^{th}\) observed residual with the data arranged in time order.
NOTE: Yes, the denominator of \(d\) sums from \(t=1\), the but numerator cannot.
The numerator has \((n-1)\) terms: \(\;e_t\; (e_2, e_3, \ldots, e_n)\;\) and \(\;e_{t-1}\; (e_1, e_2, \ldots, e_{n-1})\)
The DW statistic is used to test \(H_0:\rho=0\) vs \(H_1:\rho >0\).
When \(\rho=0\) cannot be rejected, we conclude that the errors are uncorrelated.
NOTE: \(H_1: \rho < 0\) is not common to test or observe,
but we can by using the statistic (\(4-d\)) instead of \(d\) and proceeding as below.
We can show that the \(d\) statistic can be approximated by \(\widehat{\rho}\):
\[d \;\approx\; 2(1-\widehat{\rho}), \qquad\textrm{ where }\qquad \widehat{\rho} \;=\; \frac{\sum_{t=2}^n e_te_{t-1}}{\sum_{t=1}^n e_t^2} \;=\; \frac{\sum_{t=2}^n e_te_{t-1}}{\mathrm{SSE}}. \]
NOTE: Yes, the denominator of \(\widehat{\rho}\) sums from \(t=1\), the but numerator cannot.
The numerator has \((n-1)\) terms: \(\;e_t\; (e_2, e_3, \ldots, e_n)\;\) and \(\;e_{t-1}\; (e_1, e_2, \ldots, e_{n-1})\)
When \(\rho = 0\) (no autocorrelation), \(d \approx 2\).
When \(\rho = 1\) (perfect positive correlation), \(d \approx 0\).
When \(\rho = -1\) (perfect negative correlation), \(d \approx 4\).
The range of \(d\) is approximately \(0 \le d \le 4\).
So, when \(d\) is close to 2, there is not much autocorrelation.
When \(d\) is small, it suggests evidence of positive autocorrelation (\(\rho > 0\)).
When \((4-d)\) is small, it suggests evidence of negative autocorrelation (\(\rho < 0\)).
If your alternative hypothesis is \(H_1: \rho < 0\) (\(4-d\)) as your statistic.
Usually, the one-sided hypothesis is proposed prior to looking at the data (based on context).
How close is considered “close to 2”?
For positive autocorrelations, (\(d\) is below 2 leads us to reject null), use the following process:
If \(d<d_L\), we reject the null hypothesis that there is no positive autocorrelation.
If \(d>d_U\), we fail to reject the null hypothesis. There is not much evidence of positive autocorrelation.
If \(d_L < d < d_U\), the test is inconclusive.
Like many nonparametric tests, values are tabled for reference.
See the textbook Appendix: Tables A.6 and A.7.
In R, there are a couple choices.
lmfit <- lm(y ~ x, data = mydata)
library(car)
durbinWatsonTest(lmfit)
library(lmtest)
dwtest(lmfit)
SHOW R CODE TO SEE RESULTS
library(car, warn.conflicts = FALSE)
durbinWatsonTest(lmfit, alternative = "positive")
lag Autocorrelation D-W Statistic p-value
1 0.7506 0.3282 0
Alternative hypothesis: rho > 0
library(lmtest, warn.conflicts = FALSE)
dwtest(lmfit, alternative = "greater")
Durbin-Watson test
data: lmfit
DW = 0.33, p-value = 0.00000002
alternative hypothesis: true autocorrelation is greater than 0
Not surprisingly (given our earlier look at the residuals by time and the runs test),
the DW statistic also indicates strong autocorrelation.
When there is autocorrelation present in the errors,
one way to remove it is to transform the variables.
The Cochrane-Orcutt transformation (textbook Section 8.4) is one method.
To see how it works, can write mathematically the errors for two adjacent observations:
\[\begin{align} \epsilon_t & \;=\; y_t-\beta_0-\beta_1 x_t \\ \epsilon_{t-1} & \;=\; y_{t-1}-\beta_0-\beta_1 x_{t-1} \end{align}\]
When errors are first-order autocorrelated with correlation coefficient \(\rho\), we have
\(\epsilon_t=\rho\epsilon_{t-1}+\omega_t\)
where the errors \(\omega_t\) are independent, normally distributed, with constant variance across \(\epsilon_t\)s
Plugging in the above expressions for two adjacent residuals gives
\[\begin{align} \epsilon_t & \;=\; \rho\epsilon_{t-1}+\omega_t\\ (y_t-\beta_0-\beta_1 x_t) & \;=\; \rho \left(y_{t-1}-\beta_0-\beta_1 x_{t-1} \right)+\omega_t. \end{align}\]
Rearranging the above equation we have \[ (y_t-\rho y_{t-1}) \;=\; \beta_0(1-\rho) + \beta_1( x_t-\rho x_{t-1}) + \omega_t. \]
This is equivalent to the linear model \[ y_t^\star=\beta_0^\star+\beta_1^\star x_t^\star +\omega_t \]
with
\(y_t^\star \;=\; (y_t-\rho y_{t-1})\),
\(x_t^\star \;=\; (x_t-\rho x_{t-1})\),
\(\beta_0^\star \;=\; \beta_0(1-\rho)\), and
\(\beta_1^\star \;=\; \beta_1\)
This linear model (with error term \(\omega_t\), which are uncorrelated by definition) is then fit to the data
Compute \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) by OLS as usual: \(Y_t = \beta_0 + \beta_1 x_t + \epsilon_t\)
Calculate the residuals from Step 1 and estimate \(\rho\): \(\displaystyle \widehat{\rho} \;=\; \frac{\sum_{t=2}^n e_te_{t-1}}{\sum_{t=1}^n e_t^2}\)
Compute OLS with response \((y_t - \widehat{\rho} y_{t-1})\) and predictor \((x_t-\widehat{\rho} x_{t-1})\)
Let \(\widehat{\beta}_{0} = \displaystyle{\frac{\widehat{\beta}_{0}^*}{1-\widehat{\rho}}}\) and \(\widehat{\beta}_{1} = \widehat{\beta}_{1}^*\).
\(\mathrm{se}(\widehat{\beta}_{1})\) can be obtained from the OLS estimate: \(\mathrm{se}(\widehat{\beta}_{1}^*)\).
\(\mathrm{se}(\widehat{\beta}_{0})\) can be esimated from \(\widehat{\rho}\) and the OLS estimate: \(\mathrm{se}(\widehat{\beta}_{0}^*)\).
That is, \(\displaystyle \mathrm{se}(\widehat{\beta}_{0}) \approx \frac{\mathrm{se}(\widehat{\beta}_{0}^*)}{1-\widehat{\rho}}\)
Repeat until residuals after Step 3 show no evidence of autocorrelation.
Step 1:
lmfit <- lm(Expenditure ~ Stock, data = expendData)
head(expendData, 6)
Year | Quarter | Expenditure | Stock | e | r | index | positive | sign |
---|---|---|---|---|---|---|---|---|
1952 | 1 | 214.6 | 159.3 | 2.8701 | 0.8095 | 1 | 1 | + |
1952 | 2 | 217.7 | 161.2 | 1.5994 | 0.4401 | 2 | 1 | + |
1952 | 3 | 219.6 | 162.8 | -0.1812 | -0.0490 | 3 | 0 | - |
1952 | 4 | 227.2 | 164.6 | 3.2782 | 0.8724 | 4 | 1 | + |
1953 | 1 | 230.9 | 165.9 | 3.9877 | 1.0513 | 5 | 1 | + |
1953 | 2 | 233.3 | 167.9 | 1.7869 | 0.4658 | 6 | 1 | + |
Step 2:
# estimate rho (correlation coefficient)
# "by hand"
e <- residuals(lmfit)
et <- e[2:n]
et1 <- e[1:(n-1)]
rhohat <- sum(et * et1) / sum(e^2)
rhohat
[1] 0.7506
## or grab rhohat from the Durbin-Watson test
durbinWatsonTest(lmfit, alternative = "positive")
lag Autocorrelation D-W Statistic p-value
1 0.7506 0.3282 0
Alternative hypothesis: rho > 0
objects(durbinWatsonTest(lmfit, alternative = "positive"))
[1] "alternative" "dw" "p" "r"
rhohat <- durbinWatsonTest(lmfit, alternative = "positive")$r
rhohat
[1] 0.7506
\(\widehat{\rho} = 0.7506\)
Step 3:
y <- with(expendData, Expenditure - rhohat * lag(Expenditure))
x <- with(expendData, Stock - rhohat * lag(Stock))
lmfit.CO1 <- lm(y ~ x)
tidy(lmfit.CO1)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -53.696 | 13.6164 | -3.943 | 0.001 |
x | 2.643 | 0.3069 | 8.614 | 0.000 |
b0star <- tidy(lmfit.CO1)$estimate[1]
b1star <- tidy(lmfit.CO1)$estimate[2]
\(\widehat{\beta}_{0}^* = -53.7\)
\(\widehat{\beta}_{1}^* = 2.643\)
b0hat <- b0star / (1 - rhohat)
b1hat <- b1star
data.frame(b0hat, b0star, b1hat, b1star)
b0hat | b0star | b1hat | b1star |
---|---|---|---|
-215.3 | -53.7 | 2.643 | 2.643 |
Step 4:
SHOW R CODE AND OUTPUT:
dwtest(lmfit.CO1, alternative = "greater")
Durbin-Watson test
data: lmfit.CO1
DW = 1.4, p-value = 0.06
alternative hypothesis: true autocorrelation is greater than 0
## grab the p-value from the Durbin-Watson test
objects(dwtest(lmfit.CO1, alternative = "greater"))
[1] "alternative" "data.name" "method" "p.value" "statistic"
dw.CO1.stat <- dwtest(lmfit.CO1, alternative = "greater")$statistic
dw.CO1.pvalue <- dwtest(lmfit.CO1, alternative = "greater")$p.value
Durbin-Watson statistic is now closer to 2: \(\; d = 1.426\)
But, the pvalue is still pretty small: p-value = \(0.05786\).
I would like to make one more iteration.
So, we go back to Step 2 with \(\widehat{\beta}_{0} = -215.3\) and \(\widehat{\beta}_{1} = 2.643\) calculated in Step 3 above.
Step 2: Estimate \(\rho\)
The residuals are now \(\quad e_t = y_t - (\widehat{\beta}_{0} + \widehat{\beta}_{1} x_t) = y_t - (-215.3 + 2.643 x_t)\)
## Calculate rhohat "manually"
e <- with(expendData, Expenditure - (b0hat + b1hat * Stock))
et <- e[2:n]
et1 <- e[1:(n-1)]
rhohat1 <- sum(et * et1) / sum(e^2)
rhohat1
[1] 0.79
I found \(\widehat{\rho} = 0.79\).
Step 3: Cochrane-Orcutt-transform \(y\) and \(x\) and re-estimate \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\)
y <- with(expendData, Expenditure - rhohat1 * lag(Expenditure))
x <- with(expendData, Stock - rhohat1 * lag(Stock))
lmfit.CO2 <- lm(y ~ x)
tidy(lmfit.CO2)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -47.38 | 13.6885 | -3.462 | 0.003 |
x | 2.70 | 0.3643 | 7.411 | 0.000 |
b0star <- tidy(lmfit.CO2)$estimate[1]
b1star <- tidy(lmfit.CO2)$estimate[2]
\(\widehat{\beta}_{0}^* = -47.38\)
\(\widehat{\beta}_{1}^* = 2.7\)
b0hat <- b0star / (1 - rhohat1)
b1hat <- b1star
data.frame(b0hat, b0star, b1hat, b1star)
b0hat | b0star | b1hat | b1star |
---|---|---|---|
-225.6 | -47.38 | 2.7 | 2.7 |
Step 4:
SHOW R CODE AND OUTPUT:
dwtest(lmfit.CO2, alternative = "greater")
Durbin-Watson test
data: lmfit.CO2
DW = 1.5, p-value = 0.09
alternative hypothesis: true autocorrelation is greater than 0
## grab the p-value from the Durbin-Watson test
objects(dwtest(lmfit.CO2, alternative = "greater"))
[1] "alternative" "data.name" "method" "p.value" "statistic"
dw.CO2.stat <- dwtest(lmfit.CO2, alternative = "greater")$statistic
dw.CO2.pvalue <- dwtest(lmfit.CO2, alternative = "greater")$p.value
Durbin-Watson statistic is even closer to 2: \(\; d = 1.525\)
But, the pvalue is still pretty small: p-value = \(0.09133\).
I would like to make one more iteration.
So, we go back to Step 2 with \(\widehat{\beta}_{0} = -225.6\) and \(\widehat{\beta}_{1} = 2.7\) calculated in Step 3 above.
Step 2: Estimate \(\rho\)
The residuals are now \(\quad e_t = y_t - (\widehat{\beta}_{0} + \widehat{\beta}_{1} x_t) = y_t - (-225.6 + 2.7 x_t)\)
## Calculate rhohat "manually"
e <- with(expendData, Expenditure - (b0hat + b1hat * Stock))
et <- e[2:n]
et1 <- e[1:(n-1)]
rhohat2 <- sum(et * et1) / sum(e^2)
rhohat2
[1] 0.7977
I found \(\widehat{\rho} = 0.7977\).
Step 3: Cochrane-Orcutt-transform \(y\) and \(x\) and re-estimate \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\)
y <- with(expendData, Expenditure - rhohat2 * lag(Expenditure))
x <- with(expendData, Stock - rhohat2 * lag(Stock))
lmfit.CO3 <- lm(y ~ x)
tidy(lmfit.CO3)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -46.081 | 13.7151 | -3.360 | 0.0037 |
x | 2.712 | 0.3785 | 7.165 | 0.0000 |
b0star <- tidy(lmfit.CO3)$estimate[1]
b1star <- tidy(lmfit.CO3)$estimate[2]
\(\widehat{\beta}_{0}^* = -46.08\)
\(\widehat{\beta}_{1}^* = 2.712\)
b0hat <- b0star / (1 - rhohat2)
b1hat <- b1star
data.frame(b0hat, b0star, b1hat, b1star)
b0hat | b0star | b1hat | b1star |
---|---|---|---|
-227.8 | -46.08 | 2.712 | 2.712 |
Step 4:
SHOW R CODE AND OUTPUT:
dwtest(lmfit.CO3, alternative = "greater")
Durbin-Watson test
data: lmfit.CO3
DW = 1.5, p-value = 0.1
alternative hypothesis: true autocorrelation is greater than 0
## grab the p-value from the Durbin-Watson test
objects(dwtest(lmfit.CO3, alternative = "greater"))
[1] "alternative" "data.name" "method" "p.value" "statistic"
dw.CO3.stat <- dwtest(lmfit.CO3, alternative = "greater")$statistic
dw.CO3.pvalue <- dwtest(lmfit.CO3, alternative = "greater")$p.value
Durbin-Watson statistic is even closer to 2: \(\; d = 1.543\)
p-value = \(0.09133\).
We fail to reject the null of no autocorrelation, so we stop here after three iterations.
NOTE: I’ll show you below how we can let the computer do iterations automatically.
\(\widehat{\beta}_{0} = \displaystyle{\frac{\widehat{\beta}_{0}^*}{1-\widehat{\rho}}} = -227.8\)
\(\widehat{\beta}_{1} = \widehat{\beta}_{1}^* = 2.712\)
Thus, our estimated regression equation is
\(y_t = -227.8 + 2.712 x_t\)
Regression coefficient estimates
tidy(lmfit)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -154.7 | 19.8500 | -7.794 | 0 |
Stock | 2.3 | 0.1146 | 20.080 | 0 |
tidy(lmfit.CO3)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -46.081 | 13.7151 | -3.360 | 0.0037 |
x | 2.712 | 0.3785 | 7.165 | 0.0000 |
The slopes are similar and \(\mathrm{se}(\widehat{\beta}_{1})\) is larger after transformation.
The intercept and it’s SE do not get appropriately transformed in the standard regression output above.
We can transform the output ourselves:
se.b0hat <- tidy(lmfit.CO3)$std.error[1] / (1 - rhohat2)
b0hat <- b0star / (1 - rhohat2)
data.frame(b0hat, se.b0hat)
b0hat | se.b0hat |
---|---|
-227.8 | 67.81 |
\(\widehat{\beta}_{0} = \displaystyle{\frac{\widehat{\beta}_{0}^*}{1-\widehat{\rho}}} = -227.8\)
\(\mathrm{se}(\widehat{\beta}_{0}) \approx \displaystyle{\frac{\mathrm{se}(\widehat{\beta}_{0}^*)}{1-\widehat{\rho}}} = 67.8\)
t-statistic \(\; = \displaystyle{\frac{\widehat{\beta}_{0} - 0}{\mathrm{se}(\widehat{\beta}_{0})}} = -3.36\) as above.
Regression goodness-of-fit statistics
glance(lmfit)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.9573 | 0.9549 | 3.983 | 403.2 | 0 | 2 | 18 |
glance(lmfit.CO3)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.7512 | 0.7366 | 2.238 | 51.34 | 0 | 2 | 17 |
Durbin-Watson statistic
SHOW R CODE AND OUTPUT
durbinWatsonTest(lmfit, alternative = "positive")
lag Autocorrelation D-W Statistic p-value
1 0.7506 0.3282 0
Alternative hypothesis: rho > 0
SHOW R CODE AND OUTPUT
durbinWatsonTest(lmfit.CO3, alternative = "positive")
lag Autocorrelation D-W Statistic p-value
1 0.1852 1.543 0.094
Alternative hypothesis: rho > 0
After the model is corrected for autocorrelation by Cochrane-Orcutt transformaion:
DW statistic much closer to 2 (weaker evidence of autocorrelation)
Autocorrelation is much smaller.
Slope coefficients are similar, but \(\mathrm{se}(\widehat{\beta}_{1})\) is more than 3 times larger!
\(R^2\) is smaller.
WARNING:
Ignoring positive autocorrelation often leads to
Index plots of residuals:
## index plot of regular OLS residuals
p1 <- gf_point(r ~ index, data = expendData, size = 2) %>%
gf_line() %>%
gf_lims(y = c(-2.1, 2.1)) %>%
gf_labs(x = "Quarter", y = "OLS residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")
## index plot of Cochrane-Orcutt residuals after three iterations
p2 <- gf_point(rstandard(lmfit.CO3) ~ 2:n, size = 2) %>%
gf_line() %>%
gf_lims(y = c(-2.1, 2.1)) %>%
gf_labs(x = "Quarter", y = "Cochrane-Orcutt residuals (3 iterations)") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")
grid.arrange(p1, p2, ncol=2)
library(orcutt)
cochrane.orcutt(lmfit)
library(orcutt, warn.conflicts = FALSE)
cochrane.orcutt.fit <- cochrane.orcutt(lmfit)
OLS
tidy(lmfit)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -154.7 | 19.8500 | -7.794 | 0 |
Stock | 2.3 | 0.1146 | 20.080 | 0 |
Several iterations of Cochrane-Orcutt
tidy(cochrane.orcutt.fit)
Warning in tidy.orcutt(cochrane.orcutt.fit): deal with tidy.orcutt conf.int
nonsense
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -235.488 | 78.6122 | -2.996 | 0.0081 |
Stock | 2.753 | 0.4365 | 6.308 | 0.0000 |
The slope coefficients are similar, but \(\mathrm{se}(\widehat{\beta}_{1})\) is almost 4 times larger!
Comparison of goodness-of-fit statistics:
glance(lmfit)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.9573 | 0.9549 | 3.983 | 403.2 | 0 | 2 | 18 |
glance(cochrane.orcutt.fit)
r.squared | adj.r.squared | rho | number.interaction | dw.original | p.value.original | dw.transformed | p.value.transformed |
---|---|---|---|---|---|---|---|
0.7006 | 0.683 | 0.8241 | 13 | 0.3282 | 0 | 1.601 | 0.1261 |
Index plots of residuals:
## index plot of regular OLS residuals
p1 <- gf_point(r ~ index, data = expendData, size = 2) %>%
gf_line() %>%
gf_lims(y = c(-2.1, 2.1)) %>%
gf_labs(x = "Quarter", y = "OLS residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")
## index plot of Cochrane-Orcutt residuals after three iterations
p2 <- gf_point(rstandard(lmfit.CO3) ~ 2:n, size = 2) %>%
gf_line() %>%
gf_lims(y = c(-2.1, 2.1)) %>%
gf_labs(x = "Quarter", y = "Cochrane-Orcutt residuals (3 iterations)") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")
## index plot of Cochrane-Orcutt "standardized" residuals after several iterations
rhohat.orcutt <- cochrane.orcutt.fit$rho
y <- with(expendData, Expenditure - rhohat.orcutt * lag(Expenditure))[-1]
x <- with(expendData, Stock - rhohat.orcutt * lag(Stock))[-1]
b0hat <- cochrane.orcutt.fit$coeff[1] * (1 - rhohat.orcutt)
b1hat <- cochrane.orcutt.fit$coeff[2]
e <- y - b0hat - b1hat * x
## simple standardization of residuals (z-score)
z <- (e - mean(e)) / sd(e)
p3 <- gf_point(z ~ 2:n, size = 2) %>%
gf_line() %>%
gf_lims(y = c(-2.1, 2.1)) %>%
gf_labs(x = "Quarter", y = paste("Cochrane-Orcutt residuals (", cochrane.orcutt.fit$number.interaction, " iterations)", sep="")) %>%
gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")
grid.arrange(p1, p2, p3, ncol=3)
Correlated errors can occur when a variable not in the model is related
to the response and related to the time or order in which the data were collected
From textbook Table 8.4:
A midwestern construction industry association wants to know the relationship between housing starts and population growth. They obtain annual data on regional housing starts and try to relate it to number of potential home buyers in the region.
# houseData <- read.delim("http://statistics.uchicago.edu/~collins/data/RABE5/P219.txt")
houseData <- read.delim("./data/P219.txt")
names(houseData) <- c("houseStarts", "popnSize", "mortgageMoney")
head(houseData)
houseStarts | popnSize | mortgageMoney |
---|---|---|
0.0909 | 2.200 | 0.0364 |
0.0894 | 2.222 | 0.0334 |
0.0976 | 2.244 | 0.0387 |
0.0955 | 2.267 | 0.0374 |
0.0968 | 2.280 | 0.0406 |
0.1033 | 2.289 | 0.0424 |
gf_point(houseStarts ~ popnSize, data = houseData, size = 2) %>%
gf_labs(y = "Housing Starts") %>%
gf_labs(x="Population")
lmfit <- lm(houseStarts ~ popnSize, data=houseData)
#plot(fit$residuals)
tidy(lmfit)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.0609 | 0.0104 | -5.845 | 0 |
popnSize | 0.0714 | 0.0042 | 16.867 | 0 |
glance(lmfit)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.9252 | 0.922 | 0.0041 | 284.5 | 0 | 2 | 23 |
## add residuals, standardized residuals, and an index variable to dataset
n <- nrow(houseData) ## sample size
houseData <- mutate(houseData, e = residuals(lmfit), r = rstandard(lmfit), index = 1:n)
## index plot
p1 <- gf_point(r ~ index, data = houseData, size = 2) %>%
gf_line() %>%
gf_labs(title='Fit with population only') %>%
gf_labs(y = "standardized residuals") %>%
gf_lims(y = c(min(min(~r, data = houseData), -2.1), max(max(~r, data = houseData), 2.1))) %>%
gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")
p1
This residual plot shows clear evidence of autocorrelation!
We can confirm this using a runs test:
houseData <- mutate(houseData, positive = 1 * (r > 0), sign = factor(positive, labels = c("-", "+")))
Signs of the residuals in time order:
—–+++++++++++—–+-++
runs.test(factor(sign(houseData$r)))
Runs Test
data: factor(sign(houseData$r))
Standard Normal = -3, p-value = 0.002
alternative hypothesis: two.sided
runs.fit <- runs.test(factor(sign(houseData$r)))
runs.test.statistic <- runs.fit$statistic
runs.test.pvalue <- runs.fit$p.value
p-value = \(0.0024\).
This indicates we should reject the null hypothesis of no autocorrelation.
We can also calculate the Durbin-Watson statistic:
durbinWatsonTest(lmfit)
lag Autocorrelation D-W Statistic p-value
1 0.6511 0.6208 0
Alternative hypothesis: rho != 0
The residual plot, runs test, and Durbin-Watson statistic
all indicate the presence of autocorrelation.
The researchers further conjectured that the availability of mortgage money for the region might also be a predictor for housing starts, and decided to include it in the model.
Available mortgage money also varies by time.
fit.mortgage <- lm(houseStarts ~ popnSize + mortgageMoney, data = houseData)
tidy(fit.mortgage)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.0104 | 0.0103 | -1.013 | 0.322 |
popnSize | 0.0347 | 0.0064 | 5.394 | 0.000 |
mortgageMoney | 0.7605 | 0.1216 | 6.255 | 0.000 |
glance(fit.mortgage)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.9731 | 0.9706 | 0.0025 | 397.6 | 0 | 3 | 22 |
# add residuals and standardized residuals to dataset
n <- nrow(houseData) # sample size
houseData <- mutate(houseData, e.mortgage = residuals(fit.mortgage), r.mortgage = rstandard(fit.mortgage))
## index plot
p2 <- gf_point(r.mortgage ~ index, data = houseData, size = 2) %>%
gf_line() %>%
gf_labs(title='Fit with population and mortgage money') %>%
gf_labs(y = "standardized residuals") %>%
gf_lims(y = c(min(min(~ r.mortgage, data = houseData), -2.1), max(max(~ r.mortgage, data = houseData), 2.1))) %>%
gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")
p2
This looks as though the autocorrelation has been fixed.
Let’s check first with a runs test:
houseData <- mutate(houseData, positive.mortgage = 1 * (r.mortgage > 0), sign.mortgage = factor(positive.mortgage, labels = c("-", "+")))
Signs of the residuals in time order:
–+–++++-++++-+–++-+++-
runs.fit <- runs.test(factor(sign(houseData$r.mortgage)))
runs.test.statistic <- runs.fit$statistic
runs.test.pvalue <- runs.fit$p.value
p-value = \(1\).
This indicates we should not reject the null hypothesis of no autocorrelation.
We can also calculate the Durbin-Watson statistic:
durbinWatsonTest(fit.mortgage)
lag Autocorrelation D-W Statistic p-value
1 0.03957 1.852 0.412
Alternative hypothesis: rho != 0
grid.arrange(p1, p2, ncol=2)
Can we understand what happened?
First, make an added variable plot
to see how the mortgage money affects housing starts,
after controlling both mortgage money and housing for population:
avPlots(fit.mortgage, terms = ~ mortgageMoney, pch=16, xlab="Residuals of mortgage money after regressing on population", ylab="Resid. of housing after regressing on population")
We see that there is a positive relationship:
more mortgage money means more housing, all else being equal.
We can look at how residuals of mortgage money after regressing on population change over time:
fit.m <- lm(mortgageMoney ~ popnSize, data=houseData) # Regress mortgage money on residuals
houseData <- mutate(houseData, m.on.h.resids = fit.m$residuals)
p3 <- gf_point(m.on.h.resids ~ index, data = houseData, size = 2) %>%
gf_line() %>%
gf_labs(y = "Residuals of mortgage money after regressing on population")%>%
# gf_lims(y = c(min(min(~r.mortgage, data = house), -2.1), max(max(~r.mortgage, data = house), 2.1))) %>%
gf_hline(yintercept = c(0), linetype = "dashed")
grid.arrange(p3 , p1, ncol=2)
It seems that in the original fit, with mortgage money omitted,
the model underestimated housing starts at those times when
mortgage money was higher than would have been predicted from population alone.
The analytic methods examined thus far (DW statistic, runs test) deal with
first order autocorrelation,
meaning serial correlation of adjacent values.
There are cases where autocorrelation may indeed be present,
but with other, perhaps more complex, patterns of time-dependence.
From textbook Table 8.8,
we consider the exports of a company that produces and markets ski equipment in the US.
We hope to summarize a relationship of quarterly sales to some leading economic indicators,
like personal disposable income (PDI) in billions of current dollars.
The initial model is
\[ S_t = \beta_{0} + \beta_{1} PDI_t + \epsilon_t \]
where \(S_t\) is the ski Sales in period \(t\).
The sales period naturally follows time order, and cycles through seasons.
# skiData <- read.delim("http://statistics.uchicago.edu/~collins/data/RABE5/P224.txt")
skiData <- read.delim("./data/P224.txt")
head(select(skiData, Sales, PDI), 8)
Sales | PDI |
---|---|
37.0 | 109 |
33.5 | 115 |
30.8 | 113 |
37.9 | 116 |
37.4 | 118 |
31.6 | 120 |
34.0 | 122 |
38.1 | 124 |
Let’s look at the scatterplot of Sales vs PDI:
gf_point(Sales ~ PDI, data = skiData) %>%
gf_labs(title = 'Ski Sales') %>%
gf_labs(y = "Sales", x = "PDI")
This looks pretty good. A clear linear relationship.
Since the data were collected over time,
we should look for autocollinearity.
Start with an index plot of the Sales
variable:
n <- nrow(skiData)
skiData <- mutate(skiData, index = 1:n)
gf_point(Sales ~ index, data = skiData, size = 2) %>%
gf_line(color='black') %>%
gf_labs(title='Ski Sales') %>%
gf_labs(y = "Sales", x="Index")
The Sales values go up-and-down and up-and-down over time.
We go ahead and fit the simple linear regression model for now to check further for autocorrelation.
ski.fit <- lm(Sales ~ PDI, data = skiData)
skiData <- mutate(skiData, r = rstandard(ski.fit))
tidy(ski.fit)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 12.3921 | 2.539 | 4.88 | 0 |
PDI | 0.1979 | 0.016 | 12.35 | 0 |
glance(ski.fit)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.8006 | 0.7953 | 3.019 | 152.5 | 0 | 2 | 38 |
Look at an index plot of the residuals:
gf_point(r ~ index , data = skiData) %>%
gf_line(color = "black") %>%
gf_labs(title = 'Ski Sales') %>%
gf_labs(y = "r", x = "index") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")
This makes clear the up-and-down and up-and-down pattern.
What do we see from the Durbin-Watson statistic?
durbinWatsonTest(ski.fit)
lag Autocorrelation D-W Statistic p-value
1 -0.0008867 1.968 0.79
Alternative hypothesis: rho != 0
What?
The autocorrelation is basically zero.
The Durbin-Watson statistic is near 2.
There is no evidence of first-order autocorrelation. There must be something else that explains the pattern.
Were any other variables collected over time?
skiData <- mutate(skiData, theSeason = factor(Season, labels = c('Spring/Summer', 'Fall/Winter')))
head(skiData, 8)
Quarter | Sales | PDI | Season | index | r | theSeason |
---|---|---|---|---|---|---|
Q1/64 | 37.0 | 109 | 1 | 1 | 1.0516 | Fall/Winter |
Q2/64 | 33.5 | 115 | 0 | 2 | -0.5679 | Spring/Summer |
Q3/64 | 30.8 | 113 | 0 | 3 | -1.3633 | Spring/Summer |
Q4/64 | 37.9 | 116 | 1 | 4 | 0.8753 | Fall/Winter |
Q1/65 | 37.4 | 118 | 1 | 5 | 0.5665 | Fall/Winter |
Q2/65 | 31.6 | 120 | 0 | 6 | -1.5522 | Spring/Summer |
Q3/65 | 34.0 | 122 | 0 | 7 | -0.8654 | Spring/Summer |
Q4/65 | 38.1 | 124 | 1 | 8 | 0.3970 | Fall/Winter |
What is the pattern in the data by season?
gf_point(Sales ~ PDI, color = ~ theSeason, data = skiData) %>%
gf_labs(title = 'Ski Sales') %>%
gf_labs(y = "Sales", x = "PDI") %>%
gf_theme(legend.position = "bottom")
Perhaps two parallel lines are needed (same slope, different intercept by season).
n <- nrow(skiData)
skiData <- mutate(skiData, index = 1:n)
gf_point(Sales ~ index, color = ~ theSeason, data = skiData, size = 2) %>%
gf_line(color='black') %>%
gf_labs(title='Ski Sales') %>%
gf_labs(y = "Sales", x="Index")
It’s clear Sales are higher in Fall/Winter versus lower Sales in Spring/Summer
and this information not accounted for in our model.
Here is the residual plot with seasons indicated:
gf_point(r ~ index , color = ~ theSeason, data = skiData) %>%
gf_line(color = "black") %>%
gf_labs(title = 'Ski Sales') %>%
gf_labs(y = "r", x = "index") %>%
gf_theme(legend.position = "bottom")%>%
# gf_labs(legend=c('Fall and Winter','Spring and summer')) %>%
gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")
The higher Sales in Fall/Winter lead to consistently higher residuals in these seasons.
This is a form of autocorrelation:
high Sales in Fall are paired with high Sales in the next Winter,
while low Sales in Spring are adjacent to low Sales in Summer.
Reminder: This autocorrelation was NOT detected by the Durbin-Watson test?
durbinWatsonTest(ski.fit)
lag Autocorrelation D-W Statistic p-value
1 -0.0008867 1.968 0.818
Alternative hypothesis: rho != 0
The positive correlation within adjacent Fall/Winter and Spring/Summer quarters
is balanced by a negative correlation between adjacent Winter/Spring and Summer/Fall seasons,
leading to a \(\widehat\rho\) that is close to zero.
Also, the linear relationship between Sales
and PDI
is different by season (same slope, different intercept).
Let’s try a different approach.
We’ll include Season
as an indicator variable, with value 1 in the Fall/Winter.
ski.fit2 <- lm(Sales ~ PDI + Season, data = skiData)
skiData <- mutate(skiData, r2 = rstandard(ski.fit2))
tidy(ski.fit2)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 9.5402 | 0.9748 | 9.787 | 0 |
PDI | 0.1987 | 0.0060 | 32.915 | 0 |
Season | 5.4643 | 0.3597 | 15.192 | 0 |
glance(ski.fit2)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.9724 | 0.971 | 1.137 | 652.9 | 0 | 3 | 37 |
p1 <- gf_point(r2 ~ index, color = ~theSeason, data = skiData) %>%
gf_labs(title = 'Ski Sales') %>%
gf_labs(y = "r", x="index") %>%
gf_theme(legend.position="bottom") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")
p2 <- gf_point(r2 ~ index, color = ~theSeason, data = skiData) %>%
gf_line(color = "black") %>%
gf_labs(title = 'Ski Sales') %>%
gf_labs(y = "r", x="index") %>%
gf_theme(legend.position="bottom") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")
grid.arrange(p1, p2, ncol=2)
This looks much better.
I’m a little concerned that there may be some other autocorrelation going on:
that residuals are consistently higher in observations 15-30 or so.
But this autocorrelation, if it exists, would be first-order,
and I should be able to capture it with the Durbin-Watson test:
durbinWatsonTest(ski.fit2)
lag Autocorrelation D-W Statistic p-value
1 0.09692 1.772 0.326
Alternative hypothesis: rho != 0
OK, this looks good! No evidence of autocorrelation.
Last, I will consider including an interaction term,
in case the slope differs for Fall/Winter versus Spring/Summer:
ski.fit3 <- lm(Sales ~ PDI * Season, data=skiData)
tidy(ski.fit3)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 9.5300 | 1.3787 | 6.9122 | 0.0000 |
PDI | 0.1987 | 0.0087 | 22.8773 | 0.0000 |
Season | 5.4846 | 1.9397 | 2.8276 | 0.0076 |
PDI:Season | -0.0001 | 0.0122 | -0.0106 | 0.9916 |
The interaction term isn’t significant, so I can stick with the simpler model.