All lectures

1 Chapter 6: Transformations

1.0.1 Intro: Transformations of variables

What do we do when the regression assumptions are violated?

  • non-linearity
  • non-normality
  • non-constant variance (heteroscedasticity)

Transformation (essentially rescaling) variables is a useful tool.
The linear regression model may become (more) appropriate.

We might transform the response variable:

  • \(\sqrt{Y} = \beta_0 + \beta_1 x + \epsilon\)

  • \(\displaystyle{\frac{1}{Y}} = \beta_0 + \beta_1 x + \epsilon\)

  • \(\log(Y) = \beta_0 + \beta_1 x + \epsilon\)

  • \(\displaystyle{\log\left(\frac{Y}{1-Y}\right) = \beta_0 + \beta_1 x + \epsilon}\quad\) (really only makes sense when \(Y\) is bounded: \(\;0 < Y < 1\))

We might transform a predictor variable:

  • \(Y = \beta_0 + \beta_1 \log(x) + \epsilon\)

  • \(Y = \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon\)

…or transform both variables:

  • \(\displaystyle{\frac{1}{Y}} = \beta_0 + \beta_1 \displaystyle{\frac{1}{x}} + \epsilon\)

  • \(\log(Y) = \beta_0 + \beta_1 \log(x) + \epsilon\)

1.0.1.1 Why transform variables?

  • The linear regression model may become (more) appropriate.
  • Non-linear models more difficult to estimate and interpret
  • Non-constant variance difficult to summarize succinctly

But cannot fix all problems.
That is, a non-linear model may be needed.

1.0.1.2 Which transformation? Scatterplots

Look at lots of scatterplots to help decide

  • Response (\(Y\)) vs. predictor(s).\(\quad\) For example, a scatterplot matrix,
  • Standarized residuals vs. fitted values (\(\widehat{y}\)’s)
  • Standarized residuals vs. predictors (\(x\)-variables)
  • Normal quantile plot of standardized residuals

1.0.1.3 Which transformation? Exponential growth/decay

Consider the context of the data
Is exponential growth/decay expected?

  • \(\displaystyle Y = \beta_0\; e^{\beta_1 x}\; e^{\epsilon}\)

  • Non-linear model and non-constant spread (multiplicative errors)

  • Suggested transformation: \(\log(Y)\)

    • Take the logarithm of both sides of the equation to linearize
  • \(\log(Y) = \beta_0 + \beta_1 x + \epsilon\)


Here are simulated data from an exponential growth model (\(\beta_1 > 0\) for growth, otherwise decay)

Model: \(\displaystyle Y = \beta_0\; e^{\beta_1 x}\; e^{\epsilon}\)

b0 <- 2
b1 <- .3
sigma <- 0.3
x <- seq(0, 10, 0.1)
n <- length(x)
set.seed(12345)
y <- b0 * exp( b1*x + rnorm(n, m=0, s=sigma) )

p1 <- gf_point(y ~ x)
p2 <- gf_point(y ~ x) %>%
  gf_coefline(model = lm(y ~ x))

lmfit <- lm(y ~ x)
p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p4 <- gf_qq(~ rstandard(lmfit)) %>%
  gf_labs(x = "standard normal", y = "standardized residuals") %>%
  gf_qqline(~ rstandard(lmfit))
grid.arrange(p1, p2, p3, p4, ncol=2)

Transformed data: Response is \(\log(y)\)

plota <- gf_histogram(~ y, color = "white", bins=15)
plotb <- gf_histogram(~ log(y), color = "white", bins=12)
grid.arrange(plota, plotb, ncol=1)

Model: \(\log(Y) = \beta_0 + \beta_1 x + \epsilon\)

p5 <- gf_point(log(y) ~ x)
p6 <- gf_point(log(y) ~ x) %>%
  gf_coefline(model = lm(log(y) ~ x))

lmfit <- lm(log(y) ~ x)
p7 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p8 <- gf_qq(~ rstandard(lmfit)) %>%
  gf_labs(x = "standard normal", y = "standardized residuals") %>%
  gf_qqline(~ rstandard(lmfit))
grid.arrange(p5, p6, p7, p8, ncol=2)

\(\log(Y)\) is a variance-stabilizing transformation
that will often “fix” non-normality at the same time!

1.0.1.4 Which transformation? Multiplicative effect of x (predictor)

Consider the context of the data
Is multiplicative growth/shrinkage expected?

  • \(\displaystyle Y = \beta_0 x^{\beta_1}e^{\epsilon}\)

  • Non-linear model and non-constant spread (multiplicative errors)

  • Suggested transformation: \(\log(Y)\) and \(\log(x)\)

    • Take the logarithm of both sides of the equation to linearize
  • \(\log(Y) = \beta_0 + \beta_1\log(x) + \epsilon\)


Here are simulated data from a multiplicative growth model (\(\beta_1 > 0\) for growth, otherwise decay)

Model: \(\displaystyle Y = \beta_0 x^{\beta_1}e^{\epsilon}\)

b0 <- 2
b1 <- 0.5
sigma <- 0.5
x <- seq(1, 10, 0.1)
n <- length(x)
set.seed(1234567)
y <- b0 * x^b1 * exp(rnorm(n, m=0, s=sigma) )
p1 <- gf_point(y ~ x)
p2 <- gf_point(y ~ x) %>%
  gf_coefline(model = lm(y ~ x))

lmfit <- lm(y ~ x)
p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
  gf_labs(x = "fitted values (yhat)", y = "std residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p4 <- gf_qq(~ rstandard(lmfit)) %>%
  gf_labs(x = "standard normal", y = "standardized residuals") %>%
  gf_qqline(~ rstandard(lmfit))
grid.arrange(p1, p2, p3, p4, ncol=2)

Transformed data: Response is \(\log(y)\), predictor is \(\log(x)\)

plota <- gf_histogram(~ y, color = "white", bins=15)
plotb <- gf_histogram(~ log(y), color = "white", bins=12)
grid.arrange(plota, plotb, ncol=1)

Model: \(\log(Y) = \beta_0 + \beta_1\log(x) + \epsilon\)

p5 <- gf_point(log(y) ~ log(x))
p6 <- gf_point(log(y) ~ log(x)) %>%
  gf_coefline(model = lm(log(y) ~ log(x)))

lmfit <- lm(log(y) ~ log(x))
p7 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p8 <- gf_qq(~ rstandard(lmfit)) %>%
  gf_labs(x = "standard normal", y = "standardized residuals") %>%
  gf_qqline(~ rstandard(lmfit))
grid.arrange(p5, p6, p7, p8, ncol=2)

1.0.1.5 Which transformation? Responses are proportions.

Consider the context of the data
Is the response a proportion (binomial model)?

  • \(E(Y\, |\, x) = \pi(x)\) = some function of \(x\) with range between 0 and 1

  • \(\mathrm{var}(Y\, |\, x) = \displaystyle{\frac{\pi(x)[1-\pi(x)]}{n}}\) …which depends on \(x\) and the mean \(E(Y\, |\, x)\)

  • restricted, non-normal response, non-constant spread (varies with the mean)

  • Suggested transformation: \(\;\displaystyle \log\left(\frac{Y}{1-Y}\right)\)


After transformation the model is \(\;\displaystyle{ \log\left(\frac{Y}{1-Y}\right) = \beta_0 + \beta_1 x + \epsilon }\)

So, the suggested model is actually: \(\displaystyle{Y = \frac{e^{\beta_0 + \beta_1 x + \epsilon}}{1 + e^{\beta_0 + \beta_1 x + \epsilon}}}\quad\) (after solving for \(Y\))


Here are data simulated from the model.

Model: \(\displaystyle{Y = \frac{e^{\beta_0 + \beta_1 x + \epsilon}}{1 + e^{\beta_0 + \beta_1 x + \epsilon}}}\)

b0 <- 2
b1 <- 0.5
sigma <- 0.4
x <- seq(-13, 1, 0.2)
n <- length(x)
set.seed(123452)
e <- exp(rnorm(n, m=0, s=sigma))
y <- exp(b0 + b1*x + e) / (1 + exp(b0 + b1*x + e))

p1 <- gf_point(y ~ x)
p2 <- gf_point(y ~ x) %>%
  gf_coefline(model = lm(y ~ x))

lmfit <- lm(y ~ x)
p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
  gf_labs(x = "fitted values (yhat)", y = "std residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p4 <- gf_qq(~ rstandard(lmfit)) %>%
  gf_labs(x = "standard normal", y = "standardized residuals") %>%
  gf_qqline(~ rstandard(lmfit))
grid.arrange(p1, p2, p3, p4, ncol=2)

Transformed data: Response is \(\log(y / (1-y))\)

plota <- gf_histogram(~ y, color = "white", bins=15)
plotb <- gf_histogram(~ log(y / (1-y)), color = "white", bins=12)
grid.arrange(plota, plotb, ncol=1)

Model: \(\;\displaystyle{ \log\left(\frac{Y}{1-Y}\right) = \beta_0 + \beta_1 x + \epsilon }\)

p5 <- gf_point(log(y / (1-y)) ~ x)
p6 <- gf_point(log(y / (1-y)) ~ x) %>%
  gf_coefline(model = lm(log(y / (1-y)) ~ x))

lmfit <- lm(log(y / (1-y)) ~ x)
p7 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p8 <- gf_qq(~ rstandard(lmfit)) %>%
  gf_labs(x = "standard normal", y = "standardized residuals") %>%
  gf_qqline(~ rstandard(lmfit))
grid.arrange(p5, p6, p7, p8, ncol=2)

NOTE: This does not solve the fact that the residuals come from
a short-tailed (non-normal) distribution (because the response is bounded by 0 and 1).

There are better methods of modeling binomial responses (logistic regression: Chapter 12 and STAT 226)

1.0.1.6 Which transformation? Responses are counts of a rare event

Consider the context of the data
Is the response a count from a fairly rare event (Poisson model)?

  • \(E(Y\, |\, x) = \lambda(x)\) = some function of \(x\) (\(\lambda(x) > 0\))

  • \(\mathrm{var}(Y\, |\, x) = \lambda(x)\) …which depends on \(x\) and the mean \(E(Y\, |\, x)\)

  • right-skewed response and non-constant spread (increases as the mean increases)

  • Suggested model: \(\sqrt{Y} = \beta_0 + \beta_1 x + \epsilon\)

Here are data simulated from a Poisson event model.

b0 <- 2
b1 <- 3
sigma <- 2
x <- seq(1, 25, 0.25)
n <- length(x)
#set.seed(1234555)
set.seed(1122333)
e <- exp(rnorm(n, m=0, s=sigma))
y <- rpois(n, lambda=b0 + b1*x + e)
x <- x[-c(1,2,4,5,7)]
y <- y[-c(1,2,4,5,7)]
p1 <- gf_point(y ~ x)
p2 <- gf_point(y ~ x) %>%
  gf_coefline(model = lm(y ~ x))

lmfit <- lm(y ~ x)
p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
  gf_labs(x = "fitted values (yhat)", y = "std residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p4 <- gf_qq(~ rstandard(lmfit)) %>%
  gf_labs(x = "standard normal", y = "standardized residuals") %>%
  gf_qqline(~ rstandard(lmfit))
grid.arrange(p1, p2, p3, p4, ncol=2)

Transformed data: Response is \(\sqrt{y}\)

plota <- gf_histogram(~ y, color = "white", bins=15)
plotb <- gf_histogram(~ sqrt(y), color = "white", bins=12)
grid.arrange(plota, plotb, ncol=1)

Model: \(\sqrt{Y} = \beta_0 + \beta_1 x + \epsilon\)

p5 <- gf_point(sqrt(y) ~ x)
p6 <- gf_point(sqrt(y) ~ x) %>%
  gf_coefline(model = lm(sqrt(y) ~ x))

lmfit <- lm(sqrt(y) ~ x)
p7 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p8 <- gf_qq(~ rstandard(lmfit)) %>%
  gf_labs(x = "standard normal", y = "standardized residuals") %>%
  gf_qqline(~ rstandard(lmfit))
grid.arrange(p5, p6, p7, p8, ncol=2)

NOTE: This does not solve the fact that the residuals come from
a right-skewed (non-normal) distribution.

There are better methods of modeling Poisson responses (Poisson regression: Chapter 13 and STAT 226)

1.0.1.7 Which transformation? Try “ladder of transformation”

\(-2\) \(-1\) \(1/2\) \(\log\) \(1/2\) \(1\) \(2\)
\(\frac{1}{Y^2}\) \(\frac{1}{Y}\) \(\frac{1}{\sqrt{Y}}\) \(\log(Y)\) \(\sqrt{Y}\) \(Y\) \(Y^2\)
\(\frac{1}{x^2}\) \(\frac{1}{x}\) \(\frac{1}{\sqrt{x}}\) \(\log(x)\) \(\sqrt{x}\) \(x\) \(x^2\)

1.0.1.8 Which transformation? Let “boxcox” suggest

An algorithm that suggests the “boxcox transformation”
tries to find an “optimal” transformation of \(Y\) of the form \(Y^{\lambda}\)

This method does not suggest transformations for the predictors, only the response.

If the method suggests \(\lambda = -1.78\), either use \(Y^{-1.78}\)
or a close, but easier-to-interpret transformation, like \(Y^{-2} = \displaystyle \frac{1}{Y^2}\)

If \(\lambda=0\) is suggested by the boxcox method,
then use the \(\log(Y)\) transformation.

1.0.1.9 What would boxcox method suggest for Poisson response?

Reminder: We used the square root transformation for Poisson responses: \(\sqrt{y}\).

In R:
library(MASS)
boxcox( lm(y ~ x, data = mydata) )

boxcox(lm(y ~ x))

boxcox(lm(y ~ x), lambda = seq(-0, 1, 0.1))

Indeed \(\lambda = 0.5\) is within the interval of tranformations suggested by the boxcox method.

1.0.2 Example 1: Breakdown vs. voltage level

library(Sleuth3) ## datasets from the statistical sleuth textbook
voltageData <- case0802
n <- dim(voltageData)[1]

In a quality control experiment, an electronic component was tested
for resiliancy against different voltage inputs.

\(n = 76\) components were tested

response = time until component breakdown
predictor = voltage level supplied to the component

1.0.2.1 Regression: raw (voltage) data

head(select(voltageData, -Group), 10)
Time Voltage
5.79 26
1579.52 26
2323.70 26
68.85 28
108.29 28
110.29 28
426.07 28
1067.60 28
7.74 30
17.05 30
lmfit <- lm(Time ~ Voltage, data = voltageData)
p1 <- gf_histogram(~ Time, data = voltageData, color = "white")
p2 <- gf_point(Time ~ Voltage, data = voltageData) %>%
  gf_labs(x = "Voltage (kV)", y = "breakdown time (min)") %>%
  gf_coefline(model = lmfit)
p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p4 <- gf_qq(~ rstandard(lmfit)) %>%
  gf_labs(x = "standard normal", y = "standardized residuals") %>%
  gf_qqline(~ rstandard(lmfit))
grid.arrange(p1, p2, p3, p4, ncol=2)

Non-normal: The response variable is highly skewed (and residuals too).
Non-linear: The linear model is clearly not appropriate.

1.0.2.2 Ladder of transformation

So, what transformation should we use?
Let’s try the “ladder of transformation” and the boxcox method.

Ladder of transformation: \(Y^{\lambda}\)
Try \(\lambda = -2, -1.5, -1, -0.5, 0=\log, 0.5, 1, 1.5, 2\).

Well, we clearly don’t want to spread out the \(y\)-values even more than they already are.
So, let’s not even try \(\lambda = 1.5, 2\)

Let’s try \(\lambda = -1, 0.5\) and \(\log(Y)\) to start.

1.0.2.3 (1) transformation = 1 / Y

First, try \(\lambda = -1\): tranformation is \(1/Y\)

voltageData <- mutate(voltageData,
                      reciprocalTime = 1 / Time,
                      sqrtTime = sqrt(Time),
                      logTime = log(Time))
lmfit <- lm(reciprocalTime ~ Voltage, data = voltageData)
p1 <- gf_histogram(~ reciprocalTime, data = voltageData, color = "white")
p2 <- gf_point(reciprocalTime ~ Voltage, data = voltageData) %>%
  gf_labs(x = "Voltage (kV)", y = "1 / (breakdown time)") %>%
  gf_coefline(model = lmfit)
p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p4 <- gf_qq(~ rstandard(lmfit)) %>%
  gf_labs(x = "standard normal", y = "standardized residuals") %>%
  gf_qqline(~ rstandard(lmfit))
grid.arrange(p1, p2, p3, p4, ncol=2)

1.0.2.4 (2) transformation = sqrt(Y)

Now, \(\lambda = 0.5\): tranformation is \(\sqrt{Y}\)

lmfit <- lm(sqrtTime ~ Voltage, data = voltageData)
p1 <- gf_histogram(~ sqrtTime, data = voltageData, color = "white")
p2 <- gf_point(sqrtTime ~ Voltage, data = voltageData) %>%
  gf_labs(x = "Voltage (kV)", y = "sqrt(breakdown time)") %>%
  gf_coefline(model = lmfit)
p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p4 <- gf_qq(~ rstandard(lmfit)) %>%
  gf_labs(x = "standard normal", y = "standardized residuals") %>%
  gf_qqline(~ rstandard(lmfit))
grid.arrange(p1, p2, p3, p4, ncol=2)

1.0.2.5 (3) transformation = log(Y)

Finally, \(\lambda = 0\): tranformation is \(\log(Y)\)

lmfit.log <- lm(logTime ~ Voltage, data = voltageData)
p1 <- gf_histogram(~ logTime, data = voltageData, color = "white")
p2 <- gf_point(logTime ~ Voltage, data = voltageData) %>%
  gf_labs(x = "Voltage (kV)", y = "log(breakdown time)") %>%
  gf_coefline(model = lmfit.log)
p3 <- gf_point(rstandard(lmfit.log) ~ fitted(lmfit.log)) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p4 <- gf_qq(~ rstandard(lmfit.log)) %>%
  gf_labs(x = "standard normal", y = "standardized residuals") %>%
  gf_qqline(~ rstandard(lmfit.log))
grid.arrange(p1, p2, p3, p4, ncol=2)

The logarithm is by far the best transformation thus far,
although the residuals have some left skew.

A side-by-side comparison:

p1 <- gf_point(Time ~ Voltage, data = voltageData) %>%
  gf_labs(x = "Voltage (kV)", y = "breakdown time (min)") %>%
  gf_coefline(model = lmfit)
p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
  gf_labs(x = "fitted values (yhat): raw data", y = "standardized residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype=2)

p2 <- gf_point(logTime ~ Voltage, data = voltageData) %>%
  gf_labs(x = "Voltage (kV)", y = "log(breakdown time)") %>%
  gf_coefline(model = lmfit.log)
p4 <- gf_point(rstandard(lmfit.log) ~ fitted(lmfit.log)) %>%
  gf_labs(x = "fitted values (yhat): transformed data", y = "standardized residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype=2)

grid.arrange(p1, p2, p3, p4, ncol=2)

1.0.2.6 What does boxcox method suggest?

In R:
library(MASS)
boxcox(lm(Time ~ Voltage, data = voltageData))

boxcox(lm(Time ~ Voltage, data = voltageData))

boxcox(lm(Time ~ Voltage, data = voltageData), lambda = seq(-0.25, 0.25, 0.1))

boxcox(lm(Time ~ Voltage, data = voltageData), lambda = seq(-0.25, 0.25, 0.1), plotit = FALSE)
$x
[1] -0.25 -0.15 -0.05  0.05  0.15  0.25

$y
[1] -210.7 -202.1 -197.7 -198.5 -205.4 -218.8

Since 0 is “not rejected”, the log transformation is also a suggestion from the boxcox method.

1.0.2.7 The fitted model with log(Y)

lmfit <- lm(logTime ~ Voltage, data = voltageData)
tidy(lmfit)
term estimate std.error statistic p.value
(Intercept) 18.9555 1.9100 9.924 0
Voltage -0.5074 0.0574 -8.840 0
glance(lmfit)[-c(7:10)]
r.squared adj.r.squared sigma statistic p.value df df.residual
0.5136 0.507 1.56 78.14 0 2 74
lmfit.coef <- tidy(lmfit)
b0 <- lmfit.coef$estimate[1]
b1 <- lmfit.coef$estimate[2]

\[\mathrm{E}(\log({\rm breakdown\ time})) = 18.96 + (-0.5074)\, {\rm voltage}\]

So, a one kV increase in voltage is associated with a \(-0.5074\) minute change in average log(breakdown time).

What can we say about the relationship between voltage and breakdown time
back in the original units (minutes)?

1.0.3 Units

1.0.4 Caution: E[log(Y)] < log[E(Y)]

b0 <- 2
b1 <- .3
sigma <- 0.3
x <- seq(0, 10, 0.1)
n <- length(x)
set.seed(12345)
y <- b0 * exp( b1*x + rnorm(n, m=0, s=sigma) )

p1 <- gf_point(y ~ x)
p2 <- gf_point(log(y) ~ x) %>%
  gf_coefline(model = lm(log(y) ~ x))
lmfit <- lm(log(y) ~ x)
p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p4 <- gf_lm(y ~ x,
      formula = log(y) ~ x, backtrans = exp) %>%gf_point(y~ x) 

grid.arrange(p1, p2, p3, p4, ncol=2)

The mean of the log is less than the log of the mean
(the log is a non-linear transformation).

Try this out with some numbers:

a <- 2
b <- 3

Let \(a=2\) and \(b = 3\),

then the average of the logs \(= \displaystyle \frac{\log(a) + \log(b)}{2} = \frac{\log(2) + \log(3)}{2} = 0.8959\)

but the log of the average \(= \displaystyle \log\left[\frac{a + b}{2}\right] = \log\left[\frac{2 + 3}{2}\right] = 0.9163\)

So, while it is true that \(\log(Y) = \beta_0 + \beta_1 x + \epsilon\)
implies \(Y = \exp(\log(Y)) = \exp(\beta_0 + \beta_1 x + \epsilon)\)

it is NOT TRUE that \(\mathrm{E}(\log(Y)) = \beta_0 + \beta_1 x\)
implies \(\mathrm{E}(Y) = \exp(\beta_0 + \beta_1 x)\)

because \(\exp[\mathrm{E}(\log(Y))] < \exp[\log(\mathrm{E}(Y))] = \mathrm{E}(Y)\).

Thus, \(\widehat{y}= \exp(\widehat{\beta}_{0} + \widehat{\beta}_{1} x)\) tends to underestimate \(\mathrm{E}(Y)\).

1.0.4.1 Estimation in the original units: The median

Good news: \({\rm Median}(\log(Y)) = \log\left({\rm Median}(Y)\right)\)

The log function is monotone increasing,
so if half the original distribution of \(Y\) values is below \(M\),
then half of the distribution of \(\log(Y)\) values is below \(\log(M)\).

So, \(\widehat{y}= \exp(\widehat{\beta}_{0} + \widehat{\beta}_{1} x)\) is a good estimate for the median of \(y\)

…if the errors for the original model have at least a symmetric distribution: \(\log(Y) = \beta_0 + \beta_1 x + \epsilon\).

Then, \(E(\log(Y)) = \textrm{ Median}(\log(Y))\;\) by symmetry.

A 95% confidence interval for the median breakdown time (minutes)
can come from a 95% confidence interval for the mean of log(breakdown time).

range(~ Voltage, data=voltageData)
[1] 26 38
newVoltage <- data.frame(Voltage = 30)

Let’s find a confidence interval for the median breakdown time when voltage is \(30\) kV.

First, find a confidence interval for the mean log(breakdown) time
when voltage is \(30\) kV.

lmfit <- lm(logTime ~ Voltage, data = voltageData)
predict(lmfit, newVoltage, interval = "confidence")
    fit   lwr  upr
1 3.735 3.229 4.24

kable(predict(lmfit, newVoltage, interval = "confidence"))
fit lwr upr
3.735 3.229 4.24

Then, convert the estimate and interval
from units of log(breakdown time) to breakdown time.

\[e^{\log(y)} = y\]

kable(exp(predict(lmfit, newVoltage, interval = "confidence")))
fit lwr upr
41.87 25.26 69.4

1.0.5 Example 2: Elephant matings

elephantData <- case2201 ## Dataset from the Statistical Sleuth textbook
elephantData <- case2201
glimpse(elephantData)
Observations: 41
Variables: 2
$ Age     <int> 27, 28, 28, 28, 28, 29, 29, 29, 29, 29, 29, 30, 32, 33, 33, 3…
$ Matings <int> 0, 1, 1, 1, 3, 0, 0, 0, 2, 2, 2, 1, 2, 4, 3, 3, 3, 2, 1, 1, 2…
n <- dim(elephantData)[1]

In an 8-year study of the mating habits of male elephants,
data were collected on age and number of successful matings.

Larger elephants are typically more successful competing for female attention,
and elephants continue to grow throughout their lives.

“Thus, males most likely to pass their genes to future generations are those whose characteristics enable them to live long lives.” -J. H. Poose, “Mate Guarding, Reproductive Success, and Female Choice in African Elephants”, Animal Behavior, 37, (1989): 842-49.

\(n = 41\) male elephants were in the study at the start

response = number of successful matings predictor = age

1.0.5.1 Regression: raw (elephant mating) data

First 10 elephants in the sample:

head(elephantData, 10)
Age Matings
27 0
28 1
28 1
28 1
28 3
29 0
29 0
29 0
29 2
29 2
lmfit <- lm(Matings ~ Age, data = elephantData)
p1 <- gf_histogram(~ Matings, data = elephantData, color = "white")
p2 <- gf_point(Matings ~ Age, data = elephantData) %>%
  gf_labs(x = "Age of Male Elephants", y = "## Successful Matings") %>%
  gf_coefline(model = lmfit)
p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
  gf_lims(y = c( min(-2.2, min(rstandard(lmfit))), max(2.2, max(rstandard(lmfit))) )) %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p4 <- gf_qq(~ rstandard(lmfit)) %>%
  gf_labs(x = "standard normal", y = "standardized residuals") %>%
  gf_qqline(~ rstandard(lmfit))
grid.arrange(p1, p2, p3, p4, ncol=2)

Do you see problems with any of the regression specifications?

STOP








Variance non-constant: Increasing with \(x\) = Age.

The response variable is a count of a rare event (mostly 0, 1, 2, 3).
Perhaps a Poisson model for the response makes sense.

Which, implies the reponse transformation \(\sqrt{y}\) may be beneficial

1.0.5.2 What transformation does the boxcox method suggest?

The first time I ran boxcox( lm(Matings ~ Age, data = elephantData) )
R complained that “response variable must be positive”

So, I ran this instead to handle the zeros: boxcox( lm( (Matings + 1) ~ Age, data = elephantData) )

lmfit <- lm(Matings ~ Age, data = elephantData)
boxcox( lm( (Matings + 1) ~ Age, data = elephantData) )

boxcox( lm( (Matings + 1) ~ Age, data = elephantData) ,lambda = seq(-0.25, 1, 0.1))

1.0.5.3 Transformation: \(sqrt(Y)\)

elephantData <- mutate(elephantData, sqrtMatings = sqrt(Matings))
lmfit.sqrt <- lm(sqrtMatings ~ Age, data = elephantData)
p1 <- gf_histogram(~ sqrtMatings, data = elephantData, color = "white")
p2 <- gf_point(sqrtMatings ~ Age, data = elephantData) %>%
  gf_labs(x = "Age of Male Elephants", y = "sqrt(#) Successful Matings") %>%
  gf_coefline(model = lmfit.sqrt)
p3 <- gf_point(rstandard(lmfit.sqrt) ~ fitted(lmfit.sqrt)) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
   gf_lims(y = c( min(-2.2, min(rstandard(lmfit.sqrt))), max(2.2, max(rstandard(lmfit.sqrt))) )) %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p4 <- gf_qq(~ rstandard(lmfit.sqrt)) %>%
  gf_labs(x = "standard normal", y = "standardized residuals") %>%
  gf_qqline(~ rstandard(lmfit.sqrt))
grid.arrange(p1, p2, p3, p4, ncol=2)

A side-by-side comparison:

p1 <- gf_point(Matings ~ Age, data = elephantData) %>%
  gf_labs(x = "Age of Male Elephants", y = "## Successful Matings") %>%
  gf_coefline(model = lmfit)
p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
  gf_labs(x = "fitted values (yhat): raw data", y = "standardized residuals") %>%
  gf_lims(y = c( min(-2.2, min(rstandard(lmfit))), max(2.2, max(rstandard(lmfit))) )) %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype=2)

p2 <- gf_point(sqrtMatings ~ Age, data = elephantData) %>%
  gf_labs(x = "Age of Male Elephants", y = "sqrt(#) Successful Matings") %>%
  gf_coefline(model = lmfit.sqrt)
p4 <- gf_point(rstandard(lmfit.sqrt) ~ fitted(lmfit.sqrt)) %>%
  gf_labs(x = "fitted values (yhat): transformed data", y = "standardized residuals") %>%
   gf_lims(y = c( min(-2.2, min(rstandard(lmfit.sqrt))), max(2.2, max(rstandard(lmfit.sqrt))) )) %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype=2)

grid.arrange(p1, p2, p3, p4, ncol=2)

1.0.6 Example 3: Bacteria Decay

bacteriaData <- read.delim("http://statistics.uchicago.edu/~collins/data/RABE5/P168.txt")
bacteriaData <- read.delim("http://statistics.uchicago.edu/~collins/data/RABE5/P168.txt")
n <- dim(bacteriaData)[1]

From the textbook, page 167:

The data are the number of surviving bacteria (in hundreds) after exposure to 200kV X-rays for intervals of 6 minutes
(1 interval, 2 intervals, …, up to 15 intervals).

\(t\) = number of 6-minute intervals (\(t=1, 2, \ldots, 15\))

\(N_t\) = number of surviving bacteria after exposure \(t\)

Let’s start by looking at the data (\(n = 15\)).

kable(bacteriaData)
t N_t
1 355
2 211
3 197
4 166
5 142
6 106
7 104
8 60
9 56
10 38
11 36
12 32
13 21
14 19
15 15

1.0.6.1 Regression: raw data

First, consider a simple linear regression model

\[N_t = \beta_0 + \beta_1 t + \epsilon\]

lmfit <- lm(N_t ~ t, data = bacteriaData)
p1 <- gf_histogram(~ N_t, data = bacteriaData, color = "white")
p2 <- gf_point(N_t ~ t, data = bacteriaData) %>%
  gf_labs(x = "## of 6-minute X-rays", y = "## of bacteria remaining") %>%
  gf_coefline(model = lmfit)
p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
  gf_lims(y = c( min(-2.2, min(rstandard(lmfit))), max(2.2, max(rstandard(lmfit))) )) %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p4 <- gf_qq(~ rstandard(lmfit)) %>%
  gf_labs(x = "standard normal", y = "standardized residuals") %>%
  gf_qqline(~ rstandard(lmfit))
grid.arrange(p1, p2, p3, p4, ncol=2)

The data do not exhibit a linear relationship.

According to the scientific theory, “there is a single, vital center in each bacterium,
and this must be hit by a ray before the bacteria is inactivated or killed.”

Further, it is thought that “if the theory is applicable, then \(N_t\) and \(t\) should be related by”

\[N_t = N_0\, e^{\beta_1 t},\quad t \ge 0\qquad \text{(deterministic, scientific model)}\]

\(N_0 =\) the number of bacteria at the start of the experiment
\(\beta_1 =\) the decay rate per 6-minute exposure

Note: It’s not clear to me why we were not provided \(N_0\) with the data.
Presumably, they counted the number of bacteria before starting the X-rays.

Oh well. We can estimate it from the data.

Taking the logarithm of both sides of the theoretical relationship,

\[\log(N_t) \;=\; \log(N_0) + \beta_1 t \;=\; \beta_0 + \beta_1 t\]

Of course, we expect variability around this theory, so we’ll try the model

\[\log(N_t) \;=\; \beta_0 + \beta_1 t + \epsilon_t\]

where \(\beta_0 = \log(N_0)\).

On the original scale, this means

\[N_t = e^{\beta_0 + \beta_1 t + \epsilon_t} \;=\; e^{\beta_0}\, e^{\beta_1 t}\, e^{\epsilon_t} \;=\; N_0\, e^{\beta_1 t}\,\epsilon^\prime_t ,\quad t \ge 0\]

This is not a linear model. Why not?






The coefficient \(\beta_1\) does not enter the model linearly.

What about the errors?

\(\epsilon_t = \log(\epsilon^\prime_t)\)
and if \(\epsilon \sim N(0, \sigma^2)\), then \(\epsilon^\prime_t\) must have a “log-normal” distribution.

In practice, we just check as usual that the residuals
from the linear model could plausibly have come from a normal distribution
(rather than checking the residuals from a non-linear model are “log-normal”)

1.0.6.2 Transformation: log(Y)

bacteriaData <- mutate(bacteriaData, logN_t = log(N_t))

lmfit.log <- lm(logN_t ~ t, data = bacteriaData)

p1 <- gf_histogram(~ logN_t, data = bacteriaData, color = "white")
p2 <- gf_point(logN_t ~ t, data = bacteriaData) %>%
  gf_labs(x = "# of 6-minute X-rays", y = "log(#) of bacteria remaining") %>%
  gf_coefline(model = lmfit.log)
p3 <- gf_point(rstandard(lmfit.log) ~ fitted(lmfit.log)) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
  gf_lims(y = c( min(-2.2, min(rstandard(lmfit.log))), max(2.2, max(rstandard(lmfit.log))) )) %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p4 <- gf_qq(~ rstandard(lmfit.log)) %>%
  gf_labs(x = "standard normal", y = "standardized residuals") %>%
  gf_qqline(~ rstandard(lmfit.log))
grid.arrange(p1, p2, p3, p4, ncol=2)

A side-by-side comparison:

p1 <- gf_point(N_t ~ t, data = bacteriaData) %>%
  gf_labs(x = "# of 6-minute X-rays", y = "# of bacteria remaining") %>%
  gf_coefline(model = lmfit)
p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
  gf_lims(y = c( min(-2.2, min(rstandard(lmfit))), max(2.2, max(rstandard(lmfit))) )) %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype=2)

p2 <- gf_point(logN_t ~ t, data = bacteriaData) %>%
  gf_labs(x = "# of 6-minute X-rays", y = "log(#) of bacteria remaining") %>%
  gf_coefline(model = lmfit.log)
p4 <- gf_point(rstandard(lmfit.log) ~ fitted(lmfit.log)) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
  gf_lims(y = c( min(-2.2, min(rstandard(lmfit.log))), max(2.2, max(rstandard(lmfit.log))) )) %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype=2)

grid.arrange(p1, p2, p3, p4, ncol=2)

And what is our estimate for \(N_0\)?
Whatever it is, it had better be larger than \(355\). Why?







kable(head(bacteriaData, 4))
t N_t logN_t
1 355 5.872
2 211 5.352
3 197 5.283
4 166 5.112

Here are the results of fitting the log-linear regression model to the data: \[\log(N_t) \;=\; \beta_0 + \beta_1 t + \epsilon_t\]

tidy(lmfit.log)
term estimate std.error statistic p.value
(Intercept) 5.9732 0.0598 99.92 0
t -0.2184 0.0066 -33.22 0
glance(lmfit.log)[-c(7:10)]
r.squared adj.r.squared sigma statistic p.value df df.residual
0.9884 0.9875 0.11 1104 0 2 13
b0 <- tidy(lmfit.log)$estimate[1]
b1 <- tidy(lmfit.log)$estimate[2]

So, \(\widehat{\beta}_{0} = 5.973\) and this estimates \(\log(N_0)\).

Thus, \(\widehat{N}_0 = \exp(\widehat{\beta}_{0}) = \exp(5.973) = 392.7\)

Reminder: \(\log()\) and \(\exp()\) are not linear functions.

In fact, \(\exp[\mathrm{E}(\widehat{\beta}_{0})] \le \mathrm{E}[\exp(\widehat{\beta}_{0})]\)

and \(\exp[\mathrm{E}(\widehat{\beta}_{0})] = \exp(\beta_0) = N_0\).

Thus, \(\mathrm{E}[\exp(\widehat{\beta}_{0})] \ge N_0\).

So even though \(\mathrm{E}(\widehat{\beta}_{0}) = \beta_0\) (unbiased), \(\mathrm{E}(\widehat{N}_0) = \mathrm{E}[\exp(\widehat{\beta}_{0})] \ge N_0\).

That is, the estimate \(\widehat{N}_0 = \exp(\widehat{\beta}_{0})\) is biased
(tends to overestimate \(N_0\)).

This bias does not effect estimating or testing theory about the decay rate: \(\beta_1\).

1.0.7 Example 4: Brain Weight

brainData <- read.csv("http://statistics.uchicago.edu/~collins/data/s224/brain.csv")
brainData <- read.csv("http://statistics.uchicago.edu/~collins/data/s224/brain.csv")

## I'm going to remove the three observations from dinasours before analyzing
brainData <- filter(brainData, bodywt <= 9000)
glimpse(brainData)
Observations: 25
Variables: 3
$ name    <fct> Mountain beaver, Cow, Graywolf, Goat, Guineapig, Asian elepha…
$ brainwt <dbl> 8.1, 423.0, 119.5, 115.0, 5.5, 4603.0, 419.0, 655.0, 115.0, 2…
$ bodywt  <dbl> 1.350, 465.000, 36.330, 27.660, 1.040, 2547.000, 187.100, 521…
arrange(brainData, bodywt)
name brainwt bodywt
Mouse 0.4 0.023
Hamster 1.0 0.120
Mole 3.0 0.122
Rat 1.9 0.280
Guineapig 5.5 1.040
Mountain beaver 8.1 1.350
Rabbit 12.1 2.500
Cat 25.6 3.300
Rhesus monkey 179.0 6.800
Potar monkey 115.0 10.000
Goat 115.0 27.660
Kangaroo 56.0 35.000
Graywolf 119.5 36.330
Chimpanzee 440.0 52.160
Sheep 175.0 55.500
Human 1320.0 62.000
Jaguar 157.0 100.000
Donkey 419.0 187.100
Pig 180.0 192.000
Gorilla 406.0 207.000
Cow 423.0 465.000
Horse 655.0 521.000
Giraffe 680.0 529.000
Asian elephant 4603.0 2547.000
African elephant 5712.0 6654.000

Look at a scatterplot of the data.

gf_point(brainwt ~ bodywt, data = brainData)

What does the Box-Cox method suggest?

boxcox( lm(brainwt ~ bodywt, data = brainData) )

The transformation \(Y^{1/4}\) is suggested. Let’s try that.

gf_point(brainwt^(1/4) ~ bodywt, data = brainData)

Yikes! It seems the Box-Cox method is no help.
It only looks for optimal transformations of the response (not predictors).

Both body weight (\(x\)) and brain weight (\(y\)) have very right-skewed distributions.
I would probably try to symmetrize BOTH with a log transformation.

But, we will investigate other transformations for both variables too.

p1 <-  gf_point(1/brainwt ~ 1/bodywt, data = brainData)
p2 <-  gf_point(1/sqrt(brainwt) ~1/sqrt(bodywt), data = brainData)
p3 <-  gf_point(log(brainwt, 10) ~ log(bodywt, 10), data = brainData)
p4 <-  gf_point(sqrt(brainwt) ~ sqrt(bodywt), data = brainData)
grid.arrange(p1, p2, p3, p4, ncol=2)

Indeed, the \(\log(x)\) and \(\log(y)\) transformations linearize the relationship.

Do these transformations also symmetrize the \(x\) and \(y\) distributions?

p1 <-  gf_histogram(~ bodywt, data = brainData, bins=10, color = "white")
p2 <-  gf_histogram(~ log(bodywt, 10), data = brainData, bins=10, color = "white")
p3 <-  gf_histogram(~ brainwt, data = brainData, bins=10, color = "white")
p4 <-  gf_histogram(~ log(brainwt, 10), data = brainData, bins=10, color = "white")
grid.arrange(p1, p2, p3, p4, ncol=2)

That’s better. Much less skew.

Here is the model fit and residual diagnostics.

brainData <- mutate(brainData, 
                    logbodywt = log(bodywt, 10),
                    logbrainwt = log(brainwt, 10)
)
lmfit <- lm(logbrainwt ~ logbodywt, data = brainData)
brainData <- mutate(brainData,
                    yhat = fitted(lmfit),
                    r = rstandard(lmfit)
)

tidy(lmfit)
term estimate std.error statistic p.value
(Intercept) 0.9339 0.0871 10.72 0
logbodywt 0.7523 0.0457 16.45 0
glance(lmfit)[-c(7:10)]
r.squared adj.r.squared sigma statistic p.value df df.residual
0.9217 0.9183 0.3152 270.7 0 2 23
p1 <- gf_point(logbrainwt ~ logbodywt, data = brainData) %>%
  gf_labs(x = "log(body weight)", y = "log(brain weight)") %>%
  gf_coefline(model = lmfit)
p2 <- gf_histogram(~ r, data = brainData, bins=10, color = "white") %>%
  gf_labs(x = "standardized residuals")
p3 <- gf_point(r ~ yhat, data = brainData) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype=2) %>%
  gf_lims(y = c(-3, 3))
p4 <- gf_qq(~ r, data = brainData) %>%
  gf_labs(x = "standard normal", y = "standardized residuals") %>%
  gf_qqline(~ r, data = brainData)
grid.arrange(p1, p2, p3, p4, ncol=2)

Two animals are outliers. Out of curiousity, which animals?

tail(arrange(brainData, r), 7)
name brainwt bodywt logbodywt logbrainwt yhat r
19 Cat 25.6 3.300 0.5185 1.4082 1.3240 0.2748
20 Asian elephant 4603.0 2547.000 3.4060 3.6630 3.4961 0.5683
21 Mole 3.0 0.122 -0.9136 0.4771 0.2466 0.7906
22 Potar monkey 115.0 10.000 1.0000 2.0607 1.6862 1.2139
23 Chimpanzee 440.0 52.160 1.7173 2.6435 2.2258 1.3547
24 Rhesus monkey 179.0 6.800 0.8325 2.2529 1.5602 2.2484
25 Human 1320.0 62.000 1.7924 3.1206 2.2823 2.7210

1.0.7.1 Are outliers influential?

Look at Cook’s distance and hat values

brainData <- mutate(brainData,
                    h = hatvalues(lmfit),
                    C = cooks.distance(lmfit)
)

p <- 1 ## one predictor
n <- dim(brainData)[1] ## sample size)

p1 <- gf_point(h ~ r, data = brainData) %>%
  gf_lims(x = c( min(-2.5, min(~r, data = brainData)), max(2.5, max(~r, data = brainData)))) %>%
  gf_labs(x = "standardized residuals", y = "hat value") %>%
  gf_hline(yintercept = 2 * (p+1) / n, linetype = "dashed") %>%
  gf_vline(xintercept = c(-2,0,2), linetype = "dashed")

p2 <-  gf_dotplot(~ C, data=brainData) %>%
  gf_lims(y = c(0, n)) %>%
  gf_labs(x = "Cook's distance")

grid.arrange(p1, p2, ncol=2)
`stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

How much influence might the various “interesting” observations have?

  • observations with large hat values = leverage (far from middle in x-direction)
  • observations with large standardized residuals (far from middle in y-direction)
  • observations with large Cook’s distance (a measure mixing both x- and y-directions)
p1 <- gf_point(logbrainwt ~ logbodywt, data = brainData, size = ~ h) %>%
  gf_labs(x = "log(body weight)", y = "log(brain weight)") %>%
  gf_coefline(model = lmfit)

p2 <- gf_point(logbrainwt ~ logbodywt, data = brainData, size = ~ C) %>%
  gf_labs(x = "log(body weight)", y = "log(brain weight)") %>%
  gf_coefline(model = lmfit)

grid.arrange(p1, p2, ncol=2)

1.0.8 Summary

1.0.9 When to do transformations?

  • If the predictor variables exhibit a non-linear relationship to the response variable
  • When non-constant variance is observed

1.0.9.1 Transformations for non-linearity

  • If you know (or can guess) the form of the true relationship between response and predictor,
    • let that guide your transformation.
  • Write down the equation for the what you believe may be the true relationship,
    • then do algebra (on both sides of the equation) until you reach a linear form.
    • Then define new variables \(Y^\prime\), \(X^\prime\) as needed.
  • or can try different transformations of the predictor and/or response variables and see what works.

See Table 6.1 and Table 6.5 in the textbook for ideas.

1.0.9.2 All together now…

  • transformation may simplify analyses …if can linearize
    • using linear regression is much simpler than non-linear regression
    • but, cannot always fix the problem (may need non-linear model)
  • Common scenarios and their useful transformations
    • exponential growth/decay: \(\log(y)\)
    • multiplicative growth/decay: \(\log(x)\) and \(\log(y)\)
    • fan-shaped or horn-shaped residuals vs. fitted: try \(\log(y)\)
    • quadratic-shaped residuals vs. predictor: try adding \(x^2\) predictor
    • responses are from a binomial model: try \(\log[y\;/\;(1-y)]\)
    • responses are from a “counting rare events” (Poisson) model: try \(\sqrt{y}\)
  • trial-and-error
    • ladder of transformation \(y^\lambda\) for \(\lambda=-2,-1,-0.5,0.5,1,2\)
    • \(\lambda=0\) corresponds to $(y)
    • can also try transforming \(x\), with the same or different values of \(\lambda\)
  • algorithmic optimization
    • Box-Cox transformation to find \(\lambda\)
    • try transformation \(y^{\lambda}\)
    • perhaps use a simpler-to-interpret value close to \(\lambda\)
      • (e.g. \(-2, -1, -0.5, -0.25, 0.25, 0.5, 2\))
    • if method suggests \(\lambda=0\), use \(\log(y)\) transformation
  • Beware of using expected values when transforming back to original units
    • If \(y^\prime=\log(y)\), then \(\widehat{y^\prime}=E(\log(y))\) but \(e^\widehat{y^\prime}\) gives the median, not the expected value, of \(\widehat{y}\)
    • This is true whenever the transformation of y is nonlinear but monotonic and when the errors are symmetric (median = mean)