We know some transformations to remedy non-linearity
are also variance-stabilizing transformations.
We can also adjust for non-consant variance by
Textbook Table 6.9 data:
A study of 27 industrial plants of varying sizes
\(y\) = number of supervisors
\(x\) = number of supervised workers
supervisorData <- read.csv("http://statistics.uchicago.edu/~collins/data/s224/supervisor_supervised.csv")
head(supervisorData)
workers | supervisors |
---|---|
294 | 30 |
247 | 32 |
267 | 37 |
358 | 44 |
423 | 47 |
311 | 49 |
gf_point(supervisors ~ workers, data = supervisorData)
The relationship looks roughly linear, but there is a clear problem with non-constant variance.
lmfit <- lm(supervisors ~ workers, data = supervisorData)
tidy(lmfit)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 14.4481 | 9.5620 | 1.511 | 0.1433 |
workers | 0.1054 | 0.0113 | 9.303 | 0.0000 |
glance(lmfit)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.7759 | 0.7669 | 21.73 | 86.54 | 0 | 2 | 25 |
p1 <- gf_point(supervisors ~ workers, data = supervisorData) %>%
gf_coefline(model = lmfit)
p2 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
gf_labs(x = "fitted values (supervisors)", y = "std residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")
# p3 <- gf_point(rstandard(lmfit) ~ workers, data = supervisorData) %>%
# gf_labs(x = "predictor (workers)", y = "std residuals") %>%
# gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")
p3 <- gf_histogram(~ rstandard(lmfit), bins = 6, color = "white") %>%
gf_labs(x = "standardized residuals")
p4 <- gf_qq(~ rstandard(lmfit)) %>%
gf_labs(x = "standard normal", y = "standardized residuals") %>%
gf_qqline(~ rstandard(lmfit))
grid.arrange(p1, p2, p3, p4, ncol=2)
\[y_i = \beta_{0} + \beta_{1} x_i + \epsilon_i\]
Need a transformation that will
In many industrial, economic, and biological applications,
\[ var(\epsilon_i) = k^2x_i^2,\quad k>0\] If the errors have this variance structure, we could calculate new errors with constant variance by dividing them all by \(x_i\):
\[\mathrm{var}\left[\frac{\epsilon_i}{x_i}\right]=\frac{\mathrm{var}(\epsilon_i)}{x_i^2}=\frac{ k^2 x_i^2}{x_i^2}=k^2 = \text{constant variance}\]
\[\frac{y_i}{x_i} = \beta_{0} \frac{1}{x_i} + \beta_{1} \frac{x_i}{x_i} + \frac{\epsilon_i}{x_i}\] We can see that this is a new linear model by renaming the transformed variables and coefficients:
\[Y^\prime = \frac{Y}{X} \qquad X^\prime = \frac{1}{X}\qquad \beta^\prime_0 = \beta_1 \qquad \beta^\prime_1 = \beta_0\qquad \epsilon^\prime=\frac{\epsilon}{X}\] So we are left with a new model:
\[y^\prime_i = \beta_{0}^\prime + \beta_{1}^\prime x_i + \epsilon^\prime_i\] NOTE: \(\beta_0\) and \(\beta_1\) have switched places in this new model!
This can be a little confusing.
supervisorData <- mutate(supervisorData,
workers.reciprocal = 1 / workers,
supervisors.over.x = supervisors / workers
)
lmfit.reciprocal <- lm(supervisors.over.x ~ workers.reciprocal, data = supervisorData)
p1 <- gf_point(supervisors.over.x ~ workers.reciprocal, data = supervisorData) %>%
gf_coefline(model = lmfit.reciprocal)
p2 <- gf_point(rstandard(lmfit.reciprocal) ~ fitted(lmfit.reciprocal)) %>%
gf_labs(x = "fitted values (supervisors over x)", y = "std residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")
# p3 <- gf_histogram(~ rstandard(lmfit.reciprocal), bins=6, color = "white") %>%
# gf_labs(x = "standardized residuals")
# p4 <- gf_qq(~ rstandard(lmfit.reciprocal)) %>%
# gf_labs(x = "standard normal", y = "standardized residuals") %>%
# gf_qqline(~ rstandard(lmfit.reciprocal))
grid.arrange(p1, p2, ncol=2)
Variance is stabilized.
But, model-level inference from this model is NOT not a simple matter
of grabbing \(R^2\) or the F-statistic from the regression output.
This is a difficult statistical problem …still being investigated.
The F-test and \(R^2\) from the reciprocal model output should be ignored.
However, the t-tests for coefficients are meaningful
Just be careful that the intercept is the slope and the slope is the intercept. Ugh!
Comparison of the regression output:
tidy(lmfit)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 14.4481 | 9.5620 | 1.511 | 0.1433 |
workers | 0.1054 | 0.0113 | 9.303 | 0.0000 |
b0 <- tidy(lmfit)$estimate[1]
b1 <- tidy(lmfit)$estimate[2]
tidy(lmfit.reciprocal)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.121 | 0.009 | 13.4454 | 0.0000 |
workers.reciprocal | 3.803 | 4.570 | 0.8323 | 0.4131 |
b0prime <- tidy(lmfit.reciprocal)$estimate[2]
b1prime <- tidy(lmfit.reciprocal)$estimate[1]
glance(lmfit)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.7759 | 0.7669 | 21.73 | 86.54 | 0 | 2 | 25 |
glance(lmfit.reciprocal)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.027 | -0.012 | 0.0227 | 0.6927 | 0.4131 | 2 | 25 |
Remember the relationships:
\(\beta^{\prime}_0 = \beta_1, \quad \beta^{\prime}_1 = \beta_0\).
Without any transformation (ordinary least squares):
\(\widehat{y}_i = \beta_0 + \beta_1 x_i = 14.45 + 0.1054 x_i\)
Using transformed data:
\(\widehat{y}_i = \beta^\prime_0 + \beta^\prime_1 x_i = 3.803 + 0.121 x_i\)
In the original model, \(\beta_0\) (the intercept) was close to zero
…and in the new model, \(\beta_1^\prime\) (the slope) is close to zero.
Another approach to correcting non-constant variance problems
…that does not involve transforming data.
To see how it works, we need to remind ourselves why it’s a problem in the first place when we have non-constant variance. The main idea:
If the distribution of the response has larger error variance at different values of the predictor(s)
lmfit <- lm(supervisors ~ workers, data = supervisorData)
gf_point(supervisors ~ workers, data = supervisorData) %>%
gf_coefline(model = lmfit)
What if we could simply tell the computer to give these high-variance observations less weight,
to better equalize the influence of all the observations?
We can do this if we know the variance for each observation,
either for theoretical reasons or because we’ve got repeated measurements at each x value.
OLS finds \(\beta\)s which minimize
\[\sum{(y_i - \widehat{y}_i)^2}=\sum{e_i^2}\]
For weighted least squares, we want to minimize
\[\sum{w_i\,(y_i - \widehat{y}_i)^2}\;=\;\sum{w_i\, e_i^2},\qquad \text{where }\;w_i\propto\frac{1}{var(\epsilon_i)}\]
For the superviser data, the “fan-shaped” residual plot indicates
the std dev of errors is proportional to size of predictor variable
p1 <- gf_point(supervisors ~ workers, data = supervisorData) %>%
gf_coefline(model = lmfit)
p2 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
gf_labs(x = "fitted values (supervisors)", y = "std residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")
grid.arrange(p1, p2, ncol=2)
\[ var(\epsilon_i) = k^2x_i^2,\quad k>0\]
so we use weights \(\displaystyle w_i=\frac{1}{x_i^2}\) and minimize the function
\[\sum{\frac{1}{x_i^2}\,(y_i - \widehat{y}_i)^2}\;=\;\sum\frac{e_i^2}{x_i^2}\]
supervisorData <- mutate(supervisorData, wt = 1 / workers^2)
lmfit.wt <- lm(supervisors ~ workers, data = supervisorData, weights = wt)
Comparison of transformation method vs. weighted least squares
tidy(lmfit.reciprocal)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.121 | 0.009 | 13.4454 | 0.0000 |
workers.reciprocal | 3.803 | 4.570 | 0.8323 | 0.4131 |
tidy(lmfit.wt)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 3.803 | 4.570 | 0.8323 | 0.4131 |
workers | 0.121 | 0.009 | 13.4454 | 0.0000 |
glance(lmfit.reciprocal)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.027 | -0.012 | 0.0227 | 0.6927 | 0.4131 | 2 | 25 |
glance(lmfit.wt)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.8785 | 0.8737 | 0.0227 | 180.8 | 0 | 2 | 25 |
With weight least squares: OK to use model-level statistics like \(R^2\) and F-test.
Residual plots:
p1 <- gf_point(rstandard(lmfit.reciprocal) ~ fitted(lmfit.reciprocal)) %>%
gf_labs(x = "fitted values (supervisors/workers)", y = "std residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")
p2 <- gf_point(rstandard(lmfit.wt) ~ fitted(lmfit.wt), data = supervisorData) %>%
gf_labs(x = "fitted values (supervisors, weighted)", y = "std residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")
grid.arrange(p1, p2, ncol=2)
For reciprocal transformation:
\[\begin{align*} SSE & = \sum_{i=1}^n (y_i^\prime - \beta_0^\prime - \beta_1^\prime x_i^\prime)^2\\ & = \left[\frac{y_i}{x_i} - \beta_1 - \beta_0\frac{1}{x_i}\right]^2\\ & = \sum_{i=1}^n \frac{1}{x_i^2} (y_i - \beta_1 x_i - \beta_0)^2\\ & = \sum_{i=1}^n \frac{1}{x_i^2} (y_i - \beta_0 - \beta_1 x_i)^2 \end{align*}\]
…which is the SSE for weighted least squares with weights \(\displaystyle w_i = \frac{1}{x_i^2}\)
So, both methods give the same estimates for the slope and intercept,
(but reversed roles for reciprocal transformation).
The WLS method results in valid \(R^2\) and F-test.