What do we do when the regression assumptions are violated?
Transformation (essentially rescaling) variables is a useful tool.
The linear regression model may become (more) appropriate.
We might transform the response variable:
\(\sqrt{Y} = \beta_0 + \beta_1 x + \epsilon\)
\(\displaystyle{\frac{1}{Y}} = \beta_0 + \beta_1 x + \epsilon\)
\(\log(Y) = \beta_0 + \beta_1 x + \epsilon\)
\(\displaystyle{\log\left(\frac{Y}{1-Y}\right) = \beta_0 + \beta_1 x + \epsilon}\quad\) (really only makes sense when \(Y\) is bounded: \(\;0 < Y < 1\))
We might transform a predictor variable:
\(Y = \beta_0 + \beta_1 \log(x) + \epsilon\)
\(Y = \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon\)
…or transform both variables:
\(\displaystyle{\frac{1}{Y}} = \beta_0 + \beta_1 \displaystyle{\frac{1}{x}} + \epsilon\)
\(\log(Y) = \beta_0 + \beta_1 \log(x) + \epsilon\)
But cannot fix all problems.
That is, a non-linear model may be needed.
Look at lots of scatterplots to help decide
Consider the context of the data
Is exponential growth/decay expected?
\(\displaystyle Y = \beta_0\; e^{\beta_1 x}\; e^{\epsilon}\)
Non-linear model and non-constant spread (multiplicative errors)
Suggested transformation: \(\log(Y)\)
\(\log(Y) = \beta_0 + \beta_1 x + \epsilon\)
Here are simulated data from an exponential growth model (\(\beta_1 > 0\) for growth, otherwise decay)
Model: \(\displaystyle Y = \beta_0\; e^{\beta_1 x}\; e^{\epsilon}\)
b0 <- 2
b1 <- .3
sigma <- 0.3
x <- seq(0, 10, 0.1)
n <- length(x)
set.seed(12345)
y <- b0 * exp( b1*x + rnorm(n, m=0, s=sigma) )
p1 <- gf_point(y ~ x)
p2 <- gf_point(y ~ x) %>%
gf_coefline(model = lm(y ~ x))
lmfit <- lm(y ~ x)
p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p4 <- gf_qq(~ rstandard(lmfit)) %>%
gf_labs(x = "standard normal", y = "standardized residuals") %>%
gf_qqline(~ rstandard(lmfit))
grid.arrange(p1, p2, p3, p4, ncol=2)
Transformed data: Response is \(\log(y)\)
plota <- gf_histogram(~ y, color = "white", bins=15)
plotb <- gf_histogram(~ log(y), color = "white", bins=12)
grid.arrange(plota, plotb, ncol=1)
Model: \(\log(Y) = \beta_0 + \beta_1 x + \epsilon\)
p5 <- gf_point(log(y) ~ x)
p6 <- gf_point(log(y) ~ x) %>%
gf_coefline(model = lm(log(y) ~ x))
lmfit <- lm(log(y) ~ x)
p7 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p8 <- gf_qq(~ rstandard(lmfit)) %>%
gf_labs(x = "standard normal", y = "standardized residuals") %>%
gf_qqline(~ rstandard(lmfit))
grid.arrange(p5, p6, p7, p8, ncol=2)
\(\log(Y)\) is a variance-stabilizing transformation
that will often “fix” non-normality at the same time!
Consider the context of the data
Is multiplicative growth/shrinkage expected?
\(\displaystyle Y = \beta_0 x^{\beta_1}e^{\epsilon}\)
Non-linear model and non-constant spread (multiplicative errors)
Suggested transformation: \(\log(Y)\) and \(\log(x)\)
\(\log(Y) = \beta_0 + \beta_1\log(x) + \epsilon\)
Here are simulated data from a multiplicative growth model (\(\beta_1 > 0\) for growth, otherwise decay)
Model: \(\displaystyle Y = \beta_0 x^{\beta_1}e^{\epsilon}\)
b0 <- 2
b1 <- 0.5
sigma <- 0.5
x <- seq(1, 10, 0.1)
n <- length(x)
set.seed(1234567)
y <- b0 * x^b1 * exp(rnorm(n, m=0, s=sigma) )
p1 <- gf_point(y ~ x)
p2 <- gf_point(y ~ x) %>%
gf_coefline(model = lm(y ~ x))
lmfit <- lm(y ~ x)
p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
gf_labs(x = "fitted values (yhat)", y = "std residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p4 <- gf_qq(~ rstandard(lmfit)) %>%
gf_labs(x = "standard normal", y = "standardized residuals") %>%
gf_qqline(~ rstandard(lmfit))
grid.arrange(p1, p2, p3, p4, ncol=2)
Transformed data: Response is \(\log(y)\), predictor is \(\log(x)\)
plota <- gf_histogram(~ y, color = "white", bins=15)
plotb <- gf_histogram(~ log(y), color = "white", bins=12)
grid.arrange(plota, plotb, ncol=1)
Model: \(\log(Y) = \beta_0 + \beta_1\log(x) + \epsilon\)
p5 <- gf_point(log(y) ~ log(x))
p6 <- gf_point(log(y) ~ log(x)) %>%
gf_coefline(model = lm(log(y) ~ log(x)))
lmfit <- lm(log(y) ~ log(x))
p7 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p8 <- gf_qq(~ rstandard(lmfit)) %>%
gf_labs(x = "standard normal", y = "standardized residuals") %>%
gf_qqline(~ rstandard(lmfit))
grid.arrange(p5, p6, p7, p8, ncol=2)
Consider the context of the data
Is the response a proportion (binomial model)?
\(E(Y\, |\, x) = \pi(x)\) = some function of \(x\) with range between 0 and 1
\(\mathrm{var}(Y\, |\, x) = \displaystyle{\frac{\pi(x)[1-\pi(x)]}{n}}\) …which depends on \(x\) and the mean \(E(Y\, |\, x)\)
restricted, non-normal response, non-constant spread (varies with the mean)
Suggested transformation: \(\;\displaystyle \log\left(\frac{Y}{1-Y}\right)\)
After transformation the model is \(\;\displaystyle{ \log\left(\frac{Y}{1-Y}\right) = \beta_0 + \beta_1 x + \epsilon }\)
So, the suggested model is actually: \(\displaystyle{Y = \frac{e^{\beta_0 + \beta_1 x + \epsilon}}{1 + e^{\beta_0 + \beta_1 x + \epsilon}}}\quad\) (after solving for \(Y\))
Here are data simulated from the model.
Model: \(\displaystyle{Y = \frac{e^{\beta_0 + \beta_1 x + \epsilon}}{1 + e^{\beta_0 + \beta_1 x + \epsilon}}}\)
b0 <- 2
b1 <- 0.5
sigma <- 0.4
x <- seq(-13, 1, 0.2)
n <- length(x)
set.seed(123452)
e <- exp(rnorm(n, m=0, s=sigma))
y <- exp(b0 + b1*x + e) / (1 + exp(b0 + b1*x + e))
p1 <- gf_point(y ~ x)
p2 <- gf_point(y ~ x) %>%
gf_coefline(model = lm(y ~ x))
lmfit <- lm(y ~ x)
p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
gf_labs(x = "fitted values (yhat)", y = "std residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p4 <- gf_qq(~ rstandard(lmfit)) %>%
gf_labs(x = "standard normal", y = "standardized residuals") %>%
gf_qqline(~ rstandard(lmfit))
grid.arrange(p1, p2, p3, p4, ncol=2)
Transformed data: Response is \(\log(y / (1-y))\)
plota <- gf_histogram(~ y, color = "white", bins=15)
plotb <- gf_histogram(~ log(y / (1-y)), color = "white", bins=12)
grid.arrange(plota, plotb, ncol=1)
Model: \(\;\displaystyle{ \log\left(\frac{Y}{1-Y}\right) = \beta_0 + \beta_1 x + \epsilon }\)
p5 <- gf_point(log(y / (1-y)) ~ x)
p6 <- gf_point(log(y / (1-y)) ~ x) %>%
gf_coefline(model = lm(log(y / (1-y)) ~ x))
lmfit <- lm(log(y / (1-y)) ~ x)
p7 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p8 <- gf_qq(~ rstandard(lmfit)) %>%
gf_labs(x = "standard normal", y = "standardized residuals") %>%
gf_qqline(~ rstandard(lmfit))
grid.arrange(p5, p6, p7, p8, ncol=2)
NOTE: This does not solve the fact that the residuals come from
a short-tailed (non-normal) distribution (because the response is bounded by 0 and 1).
There are better methods of modeling binomial responses (logistic regression: Chapter 12 and STAT 226)
Consider the context of the data
Is the response a count from a fairly rare event (Poisson model)?
\(E(Y\, |\, x) = \lambda(x)\) = some function of \(x\) (\(\lambda(x) > 0\))
\(\mathrm{var}(Y\, |\, x) = \lambda(x)\) …which depends on \(x\) and the mean \(E(Y\, |\, x)\)
right-skewed response and non-constant spread (increases as the mean increases)
Suggested model: \(\sqrt{Y} = \beta_0 + \beta_1 x + \epsilon\)
Here are data simulated from a Poisson event model.
b0 <- 2
b1 <- 3
sigma <- 2
x <- seq(1, 25, 0.25)
n <- length(x)
#set.seed(1234555)
set.seed(1122333)
e <- exp(rnorm(n, m=0, s=sigma))
y <- rpois(n, lambda=b0 + b1*x + e)
x <- x[-c(1,2,4,5,7)]
y <- y[-c(1,2,4,5,7)]
p1 <- gf_point(y ~ x)
p2 <- gf_point(y ~ x) %>%
gf_coefline(model = lm(y ~ x))
lmfit <- lm(y ~ x)
p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
gf_labs(x = "fitted values (yhat)", y = "std residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p4 <- gf_qq(~ rstandard(lmfit)) %>%
gf_labs(x = "standard normal", y = "standardized residuals") %>%
gf_qqline(~ rstandard(lmfit))
grid.arrange(p1, p2, p3, p4, ncol=2)
Transformed data: Response is \(\sqrt{y}\)
plota <- gf_histogram(~ y, color = "white", bins=15)
plotb <- gf_histogram(~ sqrt(y), color = "white", bins=12)
grid.arrange(plota, plotb, ncol=1)
Model: \(\sqrt{Y} = \beta_0 + \beta_1 x + \epsilon\)
p5 <- gf_point(sqrt(y) ~ x)
p6 <- gf_point(sqrt(y) ~ x) %>%
gf_coefline(model = lm(sqrt(y) ~ x))
lmfit <- lm(sqrt(y) ~ x)
p7 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p8 <- gf_qq(~ rstandard(lmfit)) %>%
gf_labs(x = "standard normal", y = "standardized residuals") %>%
gf_qqline(~ rstandard(lmfit))
grid.arrange(p5, p6, p7, p8, ncol=2)
NOTE: This does not solve the fact that the residuals come from
a right-skewed (non-normal) distribution.
There are better methods of modeling Poisson responses (Poisson regression: Chapter 13 and STAT 226)
\(-2\) | \(-1\) | \(1/2\) | \(\log\) | \(1/2\) | \(1\) | \(2\) |
---|---|---|---|---|---|---|
\(\frac{1}{Y^2}\) | \(\frac{1}{Y}\) | \(\frac{1}{\sqrt{Y}}\) | \(\log(Y)\) | \(\sqrt{Y}\) | \(Y\) | \(Y^2\) |
\(\frac{1}{x^2}\) | \(\frac{1}{x}\) | \(\frac{1}{\sqrt{x}}\) | \(\log(x)\) | \(\sqrt{x}\) | \(x\) | \(x^2\) |
An algorithm that suggests the “boxcox transformation”
tries to find an “optimal” transformation of \(Y\) of the form \(Y^{\lambda}\)
This method does not suggest transformations for the predictors, only the response.
If the method suggests \(\lambda = -1.78\), either use \(Y^{-1.78}\)
or a close, but easier-to-interpret transformation, like \(Y^{-2} = \displaystyle \frac{1}{Y^2}\)
If \(\lambda=0\) is suggested by the boxcox method,
then use the \(\log(Y)\) transformation.
Reminder: We used the square root transformation for Poisson responses: \(\sqrt{y}\).
In R:
library(MASS)
boxcox( lm(y ~ x, data = mydata) )
boxcox(lm(y ~ x))
boxcox(lm(y ~ x), lambda = seq(-0, 1, 0.1))
Indeed \(\lambda = 0.5\) is within the interval of tranformations suggested by the boxcox method.
library(Sleuth3) ## datasets from the statistical sleuth textbook
voltageData <- case0802
n <- dim(voltageData)[1]
In a quality control experiment, an electronic component was tested
for resiliancy against different voltage inputs.
\(n = 76\) components were tested
response = time until component breakdown
predictor = voltage level supplied to the component
head(select(voltageData, -Group), 10)
Time | Voltage |
---|---|
5.79 | 26 |
1579.52 | 26 |
2323.70 | 26 |
68.85 | 28 |
108.29 | 28 |
110.29 | 28 |
426.07 | 28 |
1067.60 | 28 |
7.74 | 30 |
17.05 | 30 |
lmfit <- lm(Time ~ Voltage, data = voltageData)
p1 <- gf_histogram(~ Time, data = voltageData, color = "white")
p2 <- gf_point(Time ~ Voltage, data = voltageData) %>%
gf_labs(x = "Voltage (kV)", y = "breakdown time (min)") %>%
gf_coefline(model = lmfit)
p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p4 <- gf_qq(~ rstandard(lmfit)) %>%
gf_labs(x = "standard normal", y = "standardized residuals") %>%
gf_qqline(~ rstandard(lmfit))
grid.arrange(p1, p2, p3, p4, ncol=2)
Non-normal: The response variable is highly skewed (and residuals too).
Non-linear: The linear model is clearly not appropriate.
So, what transformation should we use?
Let’s try the “ladder of transformation” and the boxcox method.
Ladder of transformation: \(Y^{\lambda}\)
Try \(\lambda = -2, -1.5, -1, -0.5, 0=\log, 0.5, 1, 1.5, 2\).
Well, we clearly don’t want to spread out the \(y\)-values even more than they already are.
So, let’s not even try \(\lambda = 1.5, 2\)
Let’s try \(\lambda = -1, 0.5\) and \(\log(Y)\) to start.
First, try \(\lambda = -1\): tranformation is \(1/Y\)
voltageData <- mutate(voltageData,
reciprocalTime = 1 / Time,
sqrtTime = sqrt(Time),
logTime = log(Time))
lmfit <- lm(reciprocalTime ~ Voltage, data = voltageData)
p1 <- gf_histogram(~ reciprocalTime, data = voltageData, color = "white")
p2 <- gf_point(reciprocalTime ~ Voltage, data = voltageData) %>%
gf_labs(x = "Voltage (kV)", y = "1 / (breakdown time)") %>%
gf_coefline(model = lmfit)
p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p4 <- gf_qq(~ rstandard(lmfit)) %>%
gf_labs(x = "standard normal", y = "standardized residuals") %>%
gf_qqline(~ rstandard(lmfit))
grid.arrange(p1, p2, p3, p4, ncol=2)
Now, \(\lambda = 0.5\): tranformation is \(\sqrt{Y}\)
lmfit <- lm(sqrtTime ~ Voltage, data = voltageData)
p1 <- gf_histogram(~ sqrtTime, data = voltageData, color = "white")
p2 <- gf_point(sqrtTime ~ Voltage, data = voltageData) %>%
gf_labs(x = "Voltage (kV)", y = "sqrt(breakdown time)") %>%
gf_coefline(model = lmfit)
p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p4 <- gf_qq(~ rstandard(lmfit)) %>%
gf_labs(x = "standard normal", y = "standardized residuals") %>%
gf_qqline(~ rstandard(lmfit))
grid.arrange(p1, p2, p3, p4, ncol=2)
Finally, \(\lambda = 0\): tranformation is \(\log(Y)\)
lmfit.log <- lm(logTime ~ Voltage, data = voltageData)
p1 <- gf_histogram(~ logTime, data = voltageData, color = "white")
p2 <- gf_point(logTime ~ Voltage, data = voltageData) %>%
gf_labs(x = "Voltage (kV)", y = "log(breakdown time)") %>%
gf_coefline(model = lmfit.log)
p3 <- gf_point(rstandard(lmfit.log) ~ fitted(lmfit.log)) %>%
gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p4 <- gf_qq(~ rstandard(lmfit.log)) %>%
gf_labs(x = "standard normal", y = "standardized residuals") %>%
gf_qqline(~ rstandard(lmfit.log))
grid.arrange(p1, p2, p3, p4, ncol=2)
The logarithm is by far the best transformation thus far,
although the residuals have some left skew.
A side-by-side comparison:
p1 <- gf_point(Time ~ Voltage, data = voltageData) %>%
gf_labs(x = "Voltage (kV)", y = "breakdown time (min)") %>%
gf_coefline(model = lmfit)
p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
gf_labs(x = "fitted values (yhat): raw data", y = "standardized residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p2 <- gf_point(logTime ~ Voltage, data = voltageData) %>%
gf_labs(x = "Voltage (kV)", y = "log(breakdown time)") %>%
gf_coefline(model = lmfit.log)
p4 <- gf_point(rstandard(lmfit.log) ~ fitted(lmfit.log)) %>%
gf_labs(x = "fitted values (yhat): transformed data", y = "standardized residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype=2)
grid.arrange(p1, p2, p3, p4, ncol=2)
In R:
library(MASS)
boxcox(lm(Time ~ Voltage, data = voltageData))
boxcox(lm(Time ~ Voltage, data = voltageData))
boxcox(lm(Time ~ Voltage, data = voltageData), lambda = seq(-0.25, 0.25, 0.1))
boxcox(lm(Time ~ Voltage, data = voltageData), lambda = seq(-0.25, 0.25, 0.1), plotit = FALSE)
$x
[1] -0.25 -0.15 -0.05 0.05 0.15 0.25
$y
[1] -210.7 -202.1 -197.7 -198.5 -205.4 -218.8
Since 0 is “not rejected”, the log transformation is also a suggestion from the boxcox method.
lmfit <- lm(logTime ~ Voltage, data = voltageData)
tidy(lmfit)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 18.9555 | 1.9100 | 9.924 | 0 |
Voltage | -0.5074 | 0.0574 | -8.840 | 0 |
glance(lmfit)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.5136 | 0.507 | 1.56 | 78.14 | 0 | 2 | 74 |
lmfit.coef <- tidy(lmfit)
b0 <- lmfit.coef$estimate[1]
b1 <- lmfit.coef$estimate[2]
\[\mathrm{E}(\log({\rm breakdown\ time})) = 18.96 + (-0.5074)\, {\rm voltage}\]
So, a one kV increase in voltage is associated with a \(-0.5074\) minute change in average log(breakdown time).
What can we say about the relationship between voltage and breakdown time
back in the original units (minutes)?
b0 <- 2
b1 <- .3
sigma <- 0.3
x <- seq(0, 10, 0.1)
n <- length(x)
set.seed(12345)
y <- b0 * exp( b1*x + rnorm(n, m=0, s=sigma) )
p1 <- gf_point(y ~ x)
p2 <- gf_point(log(y) ~ x) %>%
gf_coefline(model = lm(log(y) ~ x))
lmfit <- lm(log(y) ~ x)
p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p4 <- gf_lm(y ~ x,
formula = log(y) ~ x, backtrans = exp) %>%gf_point(y~ x)
grid.arrange(p1, p2, p3, p4, ncol=2)
The mean of the log is less than the log of the mean
(the log is a non-linear transformation).
Try this out with some numbers:
a <- 2
b <- 3
Let \(a=2\) and \(b = 3\),
then the average of the logs \(= \displaystyle \frac{\log(a) + \log(b)}{2} = \frac{\log(2) + \log(3)}{2} = 0.8959\)
but the log of the average \(= \displaystyle \log\left[\frac{a + b}{2}\right] = \log\left[\frac{2 + 3}{2}\right] = 0.9163\)
So, while it is true that \(\log(Y) = \beta_0 + \beta_1 x + \epsilon\)
implies \(Y = \exp(\log(Y)) = \exp(\beta_0 + \beta_1 x + \epsilon)\)
it is NOT TRUE that \(\mathrm{E}(\log(Y)) = \beta_0 + \beta_1 x\)
implies \(\mathrm{E}(Y) = \exp(\beta_0 + \beta_1 x)\)
because \(\exp[\mathrm{E}(\log(Y))] < \exp[\log(\mathrm{E}(Y))] = \mathrm{E}(Y)\).
Thus, \(\widehat{y}= \exp(\widehat{\beta}_{0} + \widehat{\beta}_{1} x)\) tends to underestimate \(\mathrm{E}(Y)\).
Good news: \({\rm Median}(\log(Y)) = \log\left({\rm Median}(Y)\right)\)
The log function is monotone increasing,
so if half the original distribution of \(Y\) values is below \(M\),
then half of the distribution of \(\log(Y)\) values is below \(\log(M)\).
So, \(\widehat{y}= \exp(\widehat{\beta}_{0} + \widehat{\beta}_{1} x)\) is a good estimate for the median of \(y\)
…if the errors for the original model have at least a symmetric distribution: \(\log(Y) = \beta_0 + \beta_1 x + \epsilon\).
Then, \(E(\log(Y)) = \textrm{ Median}(\log(Y))\;\) by symmetry.
A 95% confidence interval for the median breakdown time (minutes)
can come from a 95% confidence interval for the mean of log(breakdown time).
range(~ Voltage, data=voltageData)
[1] 26 38
newVoltage <- data.frame(Voltage = 30)
Let’s find a confidence interval for the median breakdown time when voltage is \(30\) kV.
First, find a confidence interval for the mean log(breakdown) time
when voltage is \(30\) kV.
lmfit <- lm(logTime ~ Voltage, data = voltageData)
predict(lmfit, newVoltage, interval = "confidence")
fit lwr upr
1 3.735 3.229 4.24
kable(predict(lmfit, newVoltage, interval = "confidence"))
fit | lwr | upr |
---|---|---|
3.735 | 3.229 | 4.24 |
Then, convert the estimate and interval
from units of log(breakdown time) to breakdown time.
\[e^{\log(y)} = y\]
kable(exp(predict(lmfit, newVoltage, interval = "confidence")))
fit | lwr | upr |
---|---|---|
41.87 | 25.26 | 69.4 |
elephantData <- case2201 ## Dataset from the Statistical Sleuth textbook
elephantData <- case2201
glimpse(elephantData)
Observations: 41
Variables: 2
$ Age <int> 27, 28, 28, 28, 28, 29, 29, 29, 29, 29, 29, 30, 32, 33, 33, 3…
$ Matings <int> 0, 1, 1, 1, 3, 0, 0, 0, 2, 2, 2, 1, 2, 4, 3, 3, 3, 2, 1, 1, 2…
n <- dim(elephantData)[1]
In an 8-year study of the mating habits of male elephants,
data were collected on age and number of successful matings.
Larger elephants are typically more successful competing for female attention,
and elephants continue to grow throughout their lives.
“Thus, males most likely to pass their genes to future generations are those whose characteristics enable them to live long lives.” -J. H. Poose, “Mate Guarding, Reproductive Success, and Female Choice in African Elephants”, Animal Behavior, 37, (1989): 842-49.
\(n = 41\) male elephants were in the study at the start
response = number of successful matings predictor = age
First 10 elephants in the sample:
head(elephantData, 10)
Age | Matings |
---|---|
27 | 0 |
28 | 1 |
28 | 1 |
28 | 1 |
28 | 3 |
29 | 0 |
29 | 0 |
29 | 0 |
29 | 2 |
29 | 2 |
lmfit <- lm(Matings ~ Age, data = elephantData)
p1 <- gf_histogram(~ Matings, data = elephantData, color = "white")
p2 <- gf_point(Matings ~ Age, data = elephantData) %>%
gf_labs(x = "Age of Male Elephants", y = "## Successful Matings") %>%
gf_coefline(model = lmfit)
p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
gf_lims(y = c( min(-2.2, min(rstandard(lmfit))), max(2.2, max(rstandard(lmfit))) )) %>%
gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p4 <- gf_qq(~ rstandard(lmfit)) %>%
gf_labs(x = "standard normal", y = "standardized residuals") %>%
gf_qqline(~ rstandard(lmfit))
grid.arrange(p1, p2, p3, p4, ncol=2)
Do you see problems with any of the regression specifications?
STOP
Variance non-constant: Increasing with \(x\) = Age.
The response variable is a count of a rare event (mostly 0, 1, 2, 3).
Perhaps a Poisson model for the response makes sense.
Which, implies the reponse transformation \(\sqrt{y}\) may be beneficial
The first time I ran boxcox( lm(Matings ~ Age, data = elephantData) )
R complained that “response variable must be positive”
So, I ran this instead to handle the zeros: boxcox( lm( (Matings + 1) ~ Age, data = elephantData) )
lmfit <- lm(Matings ~ Age, data = elephantData)
boxcox( lm( (Matings + 1) ~ Age, data = elephantData) )
boxcox( lm( (Matings + 1) ~ Age, data = elephantData) ,lambda = seq(-0.25, 1, 0.1))
elephantData <- mutate(elephantData, sqrtMatings = sqrt(Matings))
lmfit.sqrt <- lm(sqrtMatings ~ Age, data = elephantData)
p1 <- gf_histogram(~ sqrtMatings, data = elephantData, color = "white")
p2 <- gf_point(sqrtMatings ~ Age, data = elephantData) %>%
gf_labs(x = "Age of Male Elephants", y = "sqrt(#) Successful Matings") %>%
gf_coefline(model = lmfit.sqrt)
p3 <- gf_point(rstandard(lmfit.sqrt) ~ fitted(lmfit.sqrt)) %>%
gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
gf_lims(y = c( min(-2.2, min(rstandard(lmfit.sqrt))), max(2.2, max(rstandard(lmfit.sqrt))) )) %>%
gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p4 <- gf_qq(~ rstandard(lmfit.sqrt)) %>%
gf_labs(x = "standard normal", y = "standardized residuals") %>%
gf_qqline(~ rstandard(lmfit.sqrt))
grid.arrange(p1, p2, p3, p4, ncol=2)
A side-by-side comparison:
p1 <- gf_point(Matings ~ Age, data = elephantData) %>%
gf_labs(x = "Age of Male Elephants", y = "## Successful Matings") %>%
gf_coefline(model = lmfit)
p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
gf_labs(x = "fitted values (yhat): raw data", y = "standardized residuals") %>%
gf_lims(y = c( min(-2.2, min(rstandard(lmfit))), max(2.2, max(rstandard(lmfit))) )) %>%
gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p2 <- gf_point(sqrtMatings ~ Age, data = elephantData) %>%
gf_labs(x = "Age of Male Elephants", y = "sqrt(#) Successful Matings") %>%
gf_coefline(model = lmfit.sqrt)
p4 <- gf_point(rstandard(lmfit.sqrt) ~ fitted(lmfit.sqrt)) %>%
gf_labs(x = "fitted values (yhat): transformed data", y = "standardized residuals") %>%
gf_lims(y = c( min(-2.2, min(rstandard(lmfit.sqrt))), max(2.2, max(rstandard(lmfit.sqrt))) )) %>%
gf_hline(yintercept = c(-2, 0, 2), linetype=2)
grid.arrange(p1, p2, p3, p4, ncol=2)
bacteriaData <- read.delim("http://statistics.uchicago.edu/~collins/data/RABE5/P168.txt")
bacteriaData <- read.delim("http://statistics.uchicago.edu/~collins/data/RABE5/P168.txt")
n <- dim(bacteriaData)[1]
From the textbook, page 167:
The data are the number of surviving bacteria (in hundreds) after exposure to 200kV X-rays for intervals of 6 minutes
(1 interval, 2 intervals, …, up to 15 intervals).
\(t\) = number of 6-minute intervals (\(t=1, 2, \ldots, 15\))
\(N_t\) = number of surviving bacteria after exposure \(t\)
Let’s start by looking at the data (\(n = 15\)).
kable(bacteriaData)
t | N_t |
---|---|
1 | 355 |
2 | 211 |
3 | 197 |
4 | 166 |
5 | 142 |
6 | 106 |
7 | 104 |
8 | 60 |
9 | 56 |
10 | 38 |
11 | 36 |
12 | 32 |
13 | 21 |
14 | 19 |
15 | 15 |
First, consider a simple linear regression model
\[N_t = \beta_0 + \beta_1 t + \epsilon\]
lmfit <- lm(N_t ~ t, data = bacteriaData)
p1 <- gf_histogram(~ N_t, data = bacteriaData, color = "white")
p2 <- gf_point(N_t ~ t, data = bacteriaData) %>%
gf_labs(x = "## of 6-minute X-rays", y = "## of bacteria remaining") %>%
gf_coefline(model = lmfit)
p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
gf_lims(y = c( min(-2.2, min(rstandard(lmfit))), max(2.2, max(rstandard(lmfit))) )) %>%
gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p4 <- gf_qq(~ rstandard(lmfit)) %>%
gf_labs(x = "standard normal", y = "standardized residuals") %>%
gf_qqline(~ rstandard(lmfit))
grid.arrange(p1, p2, p3, p4, ncol=2)
The data do not exhibit a linear relationship.
According to the scientific theory, “there is a single, vital center in each bacterium,
and this must be hit by a ray before the bacteria is inactivated or killed.”
Further, it is thought that “if the theory is applicable, then \(N_t\) and \(t\) should be related by”
\[N_t = N_0\, e^{\beta_1 t},\quad t \ge 0\qquad \text{(deterministic, scientific model)}\]
\(N_0 =\) the number of bacteria at the start of the experiment
\(\beta_1 =\) the decay rate per 6-minute exposure
Note: It’s not clear to me why we were not provided \(N_0\) with the data.
Presumably, they counted the number of bacteria before starting the X-rays.
Oh well. We can estimate it from the data.
Taking the logarithm of both sides of the theoretical relationship,
\[\log(N_t) \;=\; \log(N_0) + \beta_1 t \;=\; \beta_0 + \beta_1 t\]
Of course, we expect variability around this theory, so we’ll try the model
\[\log(N_t) \;=\; \beta_0 + \beta_1 t + \epsilon_t\]
where \(\beta_0 = \log(N_0)\).
On the original scale, this means
\[N_t = e^{\beta_0 + \beta_1 t + \epsilon_t} \;=\; e^{\beta_0}\, e^{\beta_1 t}\, e^{\epsilon_t} \;=\; N_0\, e^{\beta_1 t}\,\epsilon^\prime_t ,\quad t \ge 0\]
This is not a linear model. Why not?
The coefficient \(\beta_1\) does not enter the model linearly.
What about the errors?
\(\epsilon_t = \log(\epsilon^\prime_t)\)
and if \(\epsilon \sim N(0, \sigma^2)\), then \(\epsilon^\prime_t\) must have a “log-normal” distribution.
In practice, we just check as usual that the residuals
from the linear model could plausibly have come from a normal distribution
(rather than checking the residuals from a non-linear model are “log-normal”)
bacteriaData <- mutate(bacteriaData, logN_t = log(N_t))
lmfit.log <- lm(logN_t ~ t, data = bacteriaData)
p1 <- gf_histogram(~ logN_t, data = bacteriaData, color = "white")
p2 <- gf_point(logN_t ~ t, data = bacteriaData) %>%
gf_labs(x = "# of 6-minute X-rays", y = "log(#) of bacteria remaining") %>%
gf_coefline(model = lmfit.log)
p3 <- gf_point(rstandard(lmfit.log) ~ fitted(lmfit.log)) %>%
gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
gf_lims(y = c( min(-2.2, min(rstandard(lmfit.log))), max(2.2, max(rstandard(lmfit.log))) )) %>%
gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p4 <- gf_qq(~ rstandard(lmfit.log)) %>%
gf_labs(x = "standard normal", y = "standardized residuals") %>%
gf_qqline(~ rstandard(lmfit.log))
grid.arrange(p1, p2, p3, p4, ncol=2)
A side-by-side comparison:
p1 <- gf_point(N_t ~ t, data = bacteriaData) %>%
gf_labs(x = "# of 6-minute X-rays", y = "# of bacteria remaining") %>%
gf_coefline(model = lmfit)
p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
gf_lims(y = c( min(-2.2, min(rstandard(lmfit))), max(2.2, max(rstandard(lmfit))) )) %>%
gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p2 <- gf_point(logN_t ~ t, data = bacteriaData) %>%
gf_labs(x = "# of 6-minute X-rays", y = "log(#) of bacteria remaining") %>%
gf_coefline(model = lmfit.log)
p4 <- gf_point(rstandard(lmfit.log) ~ fitted(lmfit.log)) %>%
gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
gf_lims(y = c( min(-2.2, min(rstandard(lmfit.log))), max(2.2, max(rstandard(lmfit.log))) )) %>%
gf_hline(yintercept = c(-2, 0, 2), linetype=2)
grid.arrange(p1, p2, p3, p4, ncol=2)
And what is our estimate for \(N_0\)?
Whatever it is, it had better be larger than \(355\). Why?
kable(head(bacteriaData, 4))
t | N_t | logN_t |
---|---|---|
1 | 355 | 5.872 |
2 | 211 | 5.352 |
3 | 197 | 5.283 |
4 | 166 | 5.112 |
Here are the results of fitting the log-linear regression model to the data: \[\log(N_t) \;=\; \beta_0 + \beta_1 t + \epsilon_t\]
tidy(lmfit.log)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 5.9732 | 0.0598 | 99.92 | 0 |
t | -0.2184 | 0.0066 | -33.22 | 0 |
glance(lmfit.log)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.9884 | 0.9875 | 0.11 | 1104 | 0 | 2 | 13 |
b0 <- tidy(lmfit.log)$estimate[1]
b1 <- tidy(lmfit.log)$estimate[2]
So, \(\widehat{\beta}_{0} = 5.973\) and this estimates \(\log(N_0)\).
Thus, \(\widehat{N}_0 = \exp(\widehat{\beta}_{0}) = \exp(5.973) = 392.7\)
Reminder: \(\log()\) and \(\exp()\) are not linear functions.
In fact, \(\exp[\mathrm{E}(\widehat{\beta}_{0})] \le \mathrm{E}[\exp(\widehat{\beta}_{0})]\)
and \(\exp[\mathrm{E}(\widehat{\beta}_{0})] = \exp(\beta_0) = N_0\).
Thus, \(\mathrm{E}[\exp(\widehat{\beta}_{0})] \ge N_0\).
So even though \(\mathrm{E}(\widehat{\beta}_{0}) = \beta_0\) (unbiased), \(\mathrm{E}(\widehat{N}_0) = \mathrm{E}[\exp(\widehat{\beta}_{0})] \ge N_0\).
That is, the estimate \(\widehat{N}_0 = \exp(\widehat{\beta}_{0})\) is biased
(tends to overestimate \(N_0\)).
This bias does not effect estimating or testing theory about the decay rate: \(\beta_1\).
brainData <- read.csv("http://statistics.uchicago.edu/~collins/data/s224/brain.csv")
brainData <- read.csv("http://statistics.uchicago.edu/~collins/data/s224/brain.csv")
## I'm going to remove the three observations from dinasours before analyzing
brainData <- filter(brainData, bodywt <= 9000)
glimpse(brainData)
Observations: 25
Variables: 3
$ name <fct> Mountain beaver, Cow, Graywolf, Goat, Guineapig, Asian elepha…
$ brainwt <dbl> 8.1, 423.0, 119.5, 115.0, 5.5, 4603.0, 419.0, 655.0, 115.0, 2…
$ bodywt <dbl> 1.350, 465.000, 36.330, 27.660, 1.040, 2547.000, 187.100, 521…
arrange(brainData, bodywt)
name | brainwt | bodywt |
---|---|---|
Mouse | 0.4 | 0.023 |
Hamster | 1.0 | 0.120 |
Mole | 3.0 | 0.122 |
Rat | 1.9 | 0.280 |
Guineapig | 5.5 | 1.040 |
Mountain beaver | 8.1 | 1.350 |
Rabbit | 12.1 | 2.500 |
Cat | 25.6 | 3.300 |
Rhesus monkey | 179.0 | 6.800 |
Potar monkey | 115.0 | 10.000 |
Goat | 115.0 | 27.660 |
Kangaroo | 56.0 | 35.000 |
Graywolf | 119.5 | 36.330 |
Chimpanzee | 440.0 | 52.160 |
Sheep | 175.0 | 55.500 |
Human | 1320.0 | 62.000 |
Jaguar | 157.0 | 100.000 |
Donkey | 419.0 | 187.100 |
Pig | 180.0 | 192.000 |
Gorilla | 406.0 | 207.000 |
Cow | 423.0 | 465.000 |
Horse | 655.0 | 521.000 |
Giraffe | 680.0 | 529.000 |
Asian elephant | 4603.0 | 2547.000 |
African elephant | 5712.0 | 6654.000 |
Look at a scatterplot of the data.
gf_point(brainwt ~ bodywt, data = brainData)
What does the Box-Cox method suggest?
boxcox( lm(brainwt ~ bodywt, data = brainData) )
The transformation \(Y^{1/4}\) is suggested. Let’s try that.
gf_point(brainwt^(1/4) ~ bodywt, data = brainData)
Yikes! It seems the Box-Cox method is no help.
It only looks for optimal transformations of the response (not predictors).
Both body weight (\(x\)) and brain weight (\(y\)) have very right-skewed distributions.
I would probably try to symmetrize BOTH with a log transformation.
But, we will investigate other transformations for both variables too.
p1 <- gf_point(1/brainwt ~ 1/bodywt, data = brainData)
p2 <- gf_point(1/sqrt(brainwt) ~1/sqrt(bodywt), data = brainData)
p3 <- gf_point(log(brainwt, 10) ~ log(bodywt, 10), data = brainData)
p4 <- gf_point(sqrt(brainwt) ~ sqrt(bodywt), data = brainData)
grid.arrange(p1, p2, p3, p4, ncol=2)
Indeed, the \(\log(x)\) and \(\log(y)\) transformations linearize the relationship.
Do these transformations also symmetrize the \(x\) and \(y\) distributions?
p1 <- gf_histogram(~ bodywt, data = brainData, bins=10, color = "white")
p2 <- gf_histogram(~ log(bodywt, 10), data = brainData, bins=10, color = "white")
p3 <- gf_histogram(~ brainwt, data = brainData, bins=10, color = "white")
p4 <- gf_histogram(~ log(brainwt, 10), data = brainData, bins=10, color = "white")
grid.arrange(p1, p2, p3, p4, ncol=2)
That’s better. Much less skew.
Here is the model fit and residual diagnostics.
brainData <- mutate(brainData,
logbodywt = log(bodywt, 10),
logbrainwt = log(brainwt, 10)
)
lmfit <- lm(logbrainwt ~ logbodywt, data = brainData)
brainData <- mutate(brainData,
yhat = fitted(lmfit),
r = rstandard(lmfit)
)
tidy(lmfit)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.9339 | 0.0871 | 10.72 | 0 |
logbodywt | 0.7523 | 0.0457 | 16.45 | 0 |
glance(lmfit)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.9217 | 0.9183 | 0.3152 | 270.7 | 0 | 2 | 23 |
p1 <- gf_point(logbrainwt ~ logbodywt, data = brainData) %>%
gf_labs(x = "log(body weight)", y = "log(brain weight)") %>%
gf_coefline(model = lmfit)
p2 <- gf_histogram(~ r, data = brainData, bins=10, color = "white") %>%
gf_labs(x = "standardized residuals")
p3 <- gf_point(r ~ yhat, data = brainData) %>%
gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype=2) %>%
gf_lims(y = c(-3, 3))
p4 <- gf_qq(~ r, data = brainData) %>%
gf_labs(x = "standard normal", y = "standardized residuals") %>%
gf_qqline(~ r, data = brainData)
grid.arrange(p1, p2, p3, p4, ncol=2)
Two animals are outliers. Out of curiousity, which animals?
tail(arrange(brainData, r), 7)
name | brainwt | bodywt | logbodywt | logbrainwt | yhat | r | |
---|---|---|---|---|---|---|---|
19 | Cat | 25.6 | 3.300 | 0.5185 | 1.4082 | 1.3240 | 0.2748 |
20 | Asian elephant | 4603.0 | 2547.000 | 3.4060 | 3.6630 | 3.4961 | 0.5683 |
21 | Mole | 3.0 | 0.122 | -0.9136 | 0.4771 | 0.2466 | 0.7906 |
22 | Potar monkey | 115.0 | 10.000 | 1.0000 | 2.0607 | 1.6862 | 1.2139 |
23 | Chimpanzee | 440.0 | 52.160 | 1.7173 | 2.6435 | 2.2258 | 1.3547 |
24 | Rhesus monkey | 179.0 | 6.800 | 0.8325 | 2.2529 | 1.5602 | 2.2484 |
25 | Human | 1320.0 | 62.000 | 1.7924 | 3.1206 | 2.2823 | 2.7210 |
Look at Cook’s distance and hat values
brainData <- mutate(brainData,
h = hatvalues(lmfit),
C = cooks.distance(lmfit)
)
p <- 1 ## one predictor
n <- dim(brainData)[1] ## sample size)
p1 <- gf_point(h ~ r, data = brainData) %>%
gf_lims(x = c( min(-2.5, min(~r, data = brainData)), max(2.5, max(~r, data = brainData)))) %>%
gf_labs(x = "standardized residuals", y = "hat value") %>%
gf_hline(yintercept = 2 * (p+1) / n, linetype = "dashed") %>%
gf_vline(xintercept = c(-2,0,2), linetype = "dashed")
p2 <- gf_dotplot(~ C, data=brainData) %>%
gf_lims(y = c(0, n)) %>%
gf_labs(x = "Cook's distance")
grid.arrange(p1, p2, ncol=2)
`stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
How much influence might the various “interesting” observations have?
p1 <- gf_point(logbrainwt ~ logbodywt, data = brainData, size = ~ h) %>%
gf_labs(x = "log(body weight)", y = "log(brain weight)") %>%
gf_coefline(model = lmfit)
p2 <- gf_point(logbrainwt ~ logbodywt, data = brainData, size = ~ C) %>%
gf_labs(x = "log(body weight)", y = "log(brain weight)") %>%
gf_coefline(model = lmfit)
grid.arrange(p1, p2, ncol=2)
See Table 6.1 and Table 6.5 in the textbook for ideas.