--- title: "STAT 224 Chapter 6" output: html_document: theme: cerulean css: styles.css code_folding: hide df_print: kable number_sections: yes self_contained: yes toc_depth: 3 toc_float: collapsed: false pdf_document: toc: yes params: dotabs: TRUE --- ```{r test-main, child = 'preamble.Rmd'} ``` # Chapter 6: Transformations `r if(params$dotabs) paste('{.tabset}')` ### Intro: Transformations of variables What do we do when the regression assumptions are violated? + non-linearity + non-normality + non-constant variance (heteroscedasticity) Transformation (essentially rescaling) variables is a useful tool. The linear regression model may become (more) appropriate. We might transform the response variable: + $\sqrt{Y} = \beta_0 + \beta_1 x + \epsilon$ + $\displaystyle{\frac{1}{Y}} = \beta_0 + \beta_1 x + \epsilon$ + $\log(Y) = \beta_0 + \beta_1 x + \epsilon$ + $\displaystyle{\log\left(\frac{Y}{1-Y}\right) = \beta_0 + \beta_1 x + \epsilon}\quad$ (really only makes sense when $Y$ is bounded: $\;0 < Y < 1$) We might transform a predictor variable: + $Y = \beta_0 + \beta_1 \log(x) + \epsilon$ + $Y = \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon$ ...or transform both variables: + $\displaystyle{\frac{1}{Y}} = \beta_0 + \beta_1 \displaystyle{\frac{1}{x}} + \epsilon$ + $\log(Y) = \beta_0 + \beta_1 \log(x) + \epsilon$ #### Why transform variables? + The linear regression model may become (more) appropriate. + Non-linear models more difficult to estimate and interpret + Non-constant variance difficult to summarize succinctly But cannot fix all problems. That is, a non-linear model may be needed. #### Which transformation? Scatterplots Look at lots of scatterplots to help decide + Response ($Y$) vs. predictor(s).$\quad$ For example, a scatterplot matrix, + Standarized residuals vs. fitted values ($\yhat$'s) + Standarized residuals vs. predictors ($x$-variables) + Normal quantile plot of standardized residuals #### Which transformation? Exponential growth/decay Consider the context of the data Is exponential growth/decay expected? + $\displaystyle Y = \beta_0\; e^{\beta_1 x}\; e^{\epsilon}$ + Non-linear model and non-constant spread (multiplicative errors) + Suggested transformation: $\log(Y)$ + Take the logarithm of both sides of the equation to linearize + $\log(Y) = \beta_0 + \beta_1 x + \epsilon$
Here are simulated data from an exponential growth model ($\beta_1 > 0$ for growth, otherwise decay) Model: $\displaystyle Y = \beta_0\; e^{\beta_1 x}\; e^{\epsilon}$ ```{r} b0 <- 2 b1 <- .3 sigma <- 0.3 x <- seq(0, 10, 0.1) n <- length(x) set.seed(12345) y <- b0 * exp( b1*x + rnorm(n, m=0, s=sigma) ) p1 <- gf_point(y ~ x) p2 <- gf_point(y ~ x) %>% gf_coefline(model = lm(y ~ x)) lmfit <- lm(y ~ x) p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>% gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>% gf_hline(yintercept = c(-2, 0, 2), linetype=2) p4 <- gf_qq(~ rstandard(lmfit)) %>% gf_labs(x = "standard normal", y = "standardized residuals") %>% gf_qqline(~ rstandard(lmfit)) grid.arrange(p1, p2, p3, p4, ncol=2) ``` Transformed data: Response is $\log(y)$ ```{r} plota <- gf_histogram(~ y, color = "white", bins=15) plotb <- gf_histogram(~ log(y), color = "white", bins=12) grid.arrange(plota, plotb, ncol=1) ``` Model: $\log(Y) = \beta_0 + \beta_1 x + \epsilon$ ```{r} p5 <- gf_point(log(y) ~ x) p6 <- gf_point(log(y) ~ x) %>% gf_coefline(model = lm(log(y) ~ x)) lmfit <- lm(log(y) ~ x) p7 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>% gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>% gf_hline(yintercept = c(-2, 0, 2), linetype=2) p8 <- gf_qq(~ rstandard(lmfit)) %>% gf_labs(x = "standard normal", y = "standardized residuals") %>% gf_qqline(~ rstandard(lmfit)) grid.arrange(p5, p6, p7, p8, ncol=2) ``` $\log(Y)$ is a **variance-stabilizing transformation** that will often "fix" non-normality at the same time! #### Which transformation? Multiplicative effect of x (predictor) Consider the context of the data Is multiplicative growth/shrinkage expected? + $\displaystyle Y = \beta_0 x^{\beta_1}e^{\epsilon}$ + Non-linear model and non-constant spread (multiplicative errors) + Suggested transformation: $\log(Y)$ and $\log(x)$ + Take the logarithm of both sides of the equation to linearize + $\log(Y) = \beta_0 + \beta_1\log(x) + \epsilon$
Here are simulated data from a multiplicative growth model ($\beta_1 > 0$ for growth, otherwise decay) Model: $\displaystyle Y = \beta_0 x^{\beta_1}e^{\epsilon}$ ```{r} b0 <- 2 b1 <- 0.5 sigma <- 0.5 x <- seq(1, 10, 0.1) n <- length(x) set.seed(1234567) y <- b0 * x^b1 * exp(rnorm(n, m=0, s=sigma) ) ``` ```{r} p1 <- gf_point(y ~ x) p2 <- gf_point(y ~ x) %>% gf_coefline(model = lm(y ~ x)) lmfit <- lm(y ~ x) p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>% gf_labs(x = "fitted values (yhat)", y = "std residuals") %>% gf_hline(yintercept = c(-2, 0, 2), linetype=2) p4 <- gf_qq(~ rstandard(lmfit)) %>% gf_labs(x = "standard normal", y = "standardized residuals") %>% gf_qqline(~ rstandard(lmfit)) grid.arrange(p1, p2, p3, p4, ncol=2) ``` Transformed data: Response is $\log(y)$, predictor is $\log(x)$ ```{r} plota <- gf_histogram(~ y, color = "white", bins=15) plotb <- gf_histogram(~ log(y), color = "white", bins=12) grid.arrange(plota, plotb, ncol=1) ``` Model: $\log(Y) = \beta_0 + \beta_1\log(x) + \epsilon$ ```{r} p5 <- gf_point(log(y) ~ log(x)) p6 <- gf_point(log(y) ~ log(x)) %>% gf_coefline(model = lm(log(y) ~ log(x))) lmfit <- lm(log(y) ~ log(x)) p7 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>% gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>% gf_hline(yintercept = c(-2, 0, 2), linetype=2) p8 <- gf_qq(~ rstandard(lmfit)) %>% gf_labs(x = "standard normal", y = "standardized residuals") %>% gf_qqline(~ rstandard(lmfit)) grid.arrange(p5, p6, p7, p8, ncol=2) ``` #### Which transformation? Responses are proportions. Consider the context of the data Is the response a proportion (binomial model)? + $E(Y\, |\, x) = \pi(x)$ = some function of $x$ with range between 0 and 1 + $\var(Y\, |\, x) = \displaystyle{\frac{\pi(x)[1-\pi(x)]}{n}}$ ...which depends on $x$ and the mean $E(Y\, |\, x)$ + restricted, non-normal response, non-constant spread (varies with the mean) + Suggested transformation: $\;\displaystyle \log\left(\frac{Y}{1-Y}\right)$
After transformation the model is $\;\displaystyle{ \log\left(\frac{Y}{1-Y}\right) = \beta_0 + \beta_1 x + \epsilon }$ So, the suggested model is actually: $\displaystyle{Y = \frac{e^{\beta_0 + \beta_1 x + \epsilon}}{1 + e^{\beta_0 + \beta_1 x + \epsilon}}}\quad$ (after solving for $Y$)
Here are data simulated from the model. Model: $\displaystyle{Y = \frac{e^{\beta_0 + \beta_1 x + \epsilon}}{1 + e^{\beta_0 + \beta_1 x + \epsilon}}}$ ```{r} b0 <- 2 b1 <- 0.5 sigma <- 0.4 x <- seq(-13, 1, 0.2) n <- length(x) set.seed(123452) e <- exp(rnorm(n, m=0, s=sigma)) y <- exp(b0 + b1*x + e) / (1 + exp(b0 + b1*x + e)) p1 <- gf_point(y ~ x) p2 <- gf_point(y ~ x) %>% gf_coefline(model = lm(y ~ x)) lmfit <- lm(y ~ x) p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>% gf_labs(x = "fitted values (yhat)", y = "std residuals") %>% gf_hline(yintercept = c(-2, 0, 2), linetype=2) p4 <- gf_qq(~ rstandard(lmfit)) %>% gf_labs(x = "standard normal", y = "standardized residuals") %>% gf_qqline(~ rstandard(lmfit)) grid.arrange(p1, p2, p3, p4, ncol=2) ``` Transformed data: Response is $\log(y / (1-y))$ ```{r} plota <- gf_histogram(~ y, color = "white", bins=15) plotb <- gf_histogram(~ log(y / (1-y)), color = "white", bins=12) grid.arrange(plota, plotb, ncol=1) ``` Model: $\;\displaystyle{ \log\left(\frac{Y}{1-Y}\right) = \beta_0 + \beta_1 x + \epsilon }$ ```{r} p5 <- gf_point(log(y / (1-y)) ~ x) p6 <- gf_point(log(y / (1-y)) ~ x) %>% gf_coefline(model = lm(log(y / (1-y)) ~ x)) lmfit <- lm(log(y / (1-y)) ~ x) p7 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>% gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>% gf_hline(yintercept = c(-2, 0, 2), linetype=2) p8 <- gf_qq(~ rstandard(lmfit)) %>% gf_labs(x = "standard normal", y = "standardized residuals") %>% gf_qqline(~ rstandard(lmfit)) grid.arrange(p5, p6, p7, p8, ncol=2) ``` **NOTE:** This does not solve the fact that the residuals come from a short-tailed (non-normal) distribution (because the response is bounded by 0 and 1). There are better methods of modeling binomial responses (logistic regression: Chapter 12 and STAT 226) #### Which transformation? Responses are counts of a rare event Consider the context of the data Is the response a count from a fairly rare event (Poisson model)? + $E(Y\, |\, x) = \lambda(x)$ = some function of $x$ ($\lambda(x) > 0$) + $\var(Y\, |\, x) = \lambda(x)$ ...which depends on $x$ and the mean $E(Y\, |\, x)$ + right-skewed response and non-constant spread (increases as the mean increases) + Suggested model: $\sqrt{Y} = \beta_0 + \beta_1 x + \epsilon$ Here are data simulated from a Poisson event model. ```{r} b0 <- 2 b1 <- 3 sigma <- 2 x <- seq(1, 25, 0.25) n <- length(x) #set.seed(1234555) set.seed(1122333) e <- exp(rnorm(n, m=0, s=sigma)) y <- rpois(n, lambda=b0 + b1*x + e) x <- x[-c(1,2,4,5,7)] y <- y[-c(1,2,4,5,7)] ``` ```{r} p1 <- gf_point(y ~ x) p2 <- gf_point(y ~ x) %>% gf_coefline(model = lm(y ~ x)) lmfit <- lm(y ~ x) p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>% gf_labs(x = "fitted values (yhat)", y = "std residuals") %>% gf_hline(yintercept = c(-2, 0, 2), linetype=2) p4 <- gf_qq(~ rstandard(lmfit)) %>% gf_labs(x = "standard normal", y = "standardized residuals") %>% gf_qqline(~ rstandard(lmfit)) grid.arrange(p1, p2, p3, p4, ncol=2) ``` Transformed data: Response is $\sqrt{y}$ ```{r} plota <- gf_histogram(~ y, color = "white", bins=15) plotb <- gf_histogram(~ sqrt(y), color = "white", bins=12) grid.arrange(plota, plotb, ncol=1) ``` Model: $\sqrt{Y} = \beta_0 + \beta_1 x + \epsilon$ ```{r} p5 <- gf_point(sqrt(y) ~ x) p6 <- gf_point(sqrt(y) ~ x) %>% gf_coefline(model = lm(sqrt(y) ~ x)) lmfit <- lm(sqrt(y) ~ x) p7 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>% gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>% gf_hline(yintercept = c(-2, 0, 2), linetype=2) p8 <- gf_qq(~ rstandard(lmfit)) %>% gf_labs(x = "standard normal", y = "standardized residuals") %>% gf_qqline(~ rstandard(lmfit)) grid.arrange(p5, p6, p7, p8, ncol=2) ``` **NOTE:** This does not solve the fact that the residuals come from a right-skewed (non-normal) distribution. There are better methods of modeling Poisson responses (Poisson regression: Chapter 13 and STAT 226) #### Which transformation? Try "ladder of transformation" | $-2$ | $-1$ | $1/2$ | $\log$ | $1/2$ | $1$ | $2$ | |------|------|--------|--------|-------|-----|-----| | $\frac{1}{Y^2}$ | $\frac{1}{Y}$ | $\frac{1}{\sqrt{Y}}$ | $\log(Y)$ | $\sqrt{Y}$ | $Y$ | $Y^2$ | | $\frac{1}{x^2}$ | $\frac{1}{x}$ | $\frac{1}{\sqrt{x}}$ | $\log(x)$ | $\sqrt{x}$ | $x$ | $x^2$ | #### Which transformation? Let "boxcox" suggest An algorithm that suggests the "boxcox transformation" tries to find an "optimal" transformation of $Y$ of the form $Y^{\lambda}$ This method does not suggest transformations for the predictors, only the response. If the method suggests $\lambda = -1.78$, either use $Y^{-1.78}$ or a close, but easier-to-interpret transformation, like $Y^{-2} = \displaystyle \frac{1}{Y^2}$ If $\lambda=0$ is suggested by the boxcox method, then use the $\log(Y)$ transformation. #### What would boxcox method suggest for Poisson response? Reminder: We used the square root transformation for Poisson responses: $\sqrt{y}$. In R: `library(MASS)` `boxcox( lm(y ~ x, data = mydata) )` ```{r fig.height=3} boxcox(lm(y ~ x)) boxcox(lm(y ~ x), lambda = seq(-0, 1, 0.1)) ``` Indeed $\lambda = 0.5$ is within the interval of tranformations suggested by the boxcox method. ### Example 1: Breakdown vs. voltage level ```{r} library(Sleuth3) ## datasets from the statistical sleuth textbook voltageData <- case0802 n <- dim(voltageData)[1] ``` In a quality control experiment, an electronic component was tested for resiliancy against different voltage inputs. $n = `r n`$ components were tested response = time until component breakdown predictor = voltage level supplied to the component #### Regression: raw (voltage) data ```{r} head(select(voltageData, -Group), 10) lmfit <- lm(Time ~ Voltage, data = voltageData) p1 <- gf_histogram(~ Time, data = voltageData, color = "white") p2 <- gf_point(Time ~ Voltage, data = voltageData) %>% gf_labs(x = "Voltage (kV)", y = "breakdown time (min)") %>% gf_coefline(model = lmfit) p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>% gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>% gf_hline(yintercept = c(-2, 0, 2), linetype=2) p4 <- gf_qq(~ rstandard(lmfit)) %>% gf_labs(x = "standard normal", y = "standardized residuals") %>% gf_qqline(~ rstandard(lmfit)) grid.arrange(p1, p2, p3, p4, ncol=2) ``` Non-normal: The response variable is highly skewed (and residuals too). Non-linear: The linear model is clearly not appropriate. #### Ladder of transformation So, what transformation should we use? Let's try the "ladder of transformation" and the boxcox method. Ladder of transformation: $Y^{\lambda}$ Try $\lambda = -2, -1.5, -1, -0.5, 0=\log, 0.5, 1, 1.5, 2$. Well, we clearly don't want to spread out the $y$-values even more than they already are. So, let's not even try $\lambda = 1.5, 2$ Let's try $\lambda = -1, 0.5$ and $\log(Y)$ to start. #### (1) transformation = 1 / Y First, try $\lambda = -1$: tranformation is $1/Y$ ```{r} voltageData <- mutate(voltageData, reciprocalTime = 1 / Time, sqrtTime = sqrt(Time), logTime = log(Time)) lmfit <- lm(reciprocalTime ~ Voltage, data = voltageData) p1 <- gf_histogram(~ reciprocalTime, data = voltageData, color = "white") p2 <- gf_point(reciprocalTime ~ Voltage, data = voltageData) %>% gf_labs(x = "Voltage (kV)", y = "1 / (breakdown time)") %>% gf_coefline(model = lmfit) p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>% gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>% gf_hline(yintercept = c(-2, 0, 2), linetype=2) p4 <- gf_qq(~ rstandard(lmfit)) %>% gf_labs(x = "standard normal", y = "standardized residuals") %>% gf_qqline(~ rstandard(lmfit)) grid.arrange(p1, p2, p3, p4, ncol=2) ``` #### (2) transformation = sqrt(Y) Now, $\lambda = 0.5$: tranformation is $\sqrt{Y}$ ```{r} lmfit <- lm(sqrtTime ~ Voltage, data = voltageData) p1 <- gf_histogram(~ sqrtTime, data = voltageData, color = "white") p2 <- gf_point(sqrtTime ~ Voltage, data = voltageData) %>% gf_labs(x = "Voltage (kV)", y = "sqrt(breakdown time)") %>% gf_coefline(model = lmfit) p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>% gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>% gf_hline(yintercept = c(-2, 0, 2), linetype=2) p4 <- gf_qq(~ rstandard(lmfit)) %>% gf_labs(x = "standard normal", y = "standardized residuals") %>% gf_qqline(~ rstandard(lmfit)) grid.arrange(p1, p2, p3, p4, ncol=2) ``` #### (3) transformation = log(Y) Finally, $\lambda = 0$: tranformation is $\log(Y)$ ```{r} lmfit.log <- lm(logTime ~ Voltage, data = voltageData) p1 <- gf_histogram(~ logTime, data = voltageData, color = "white") p2 <- gf_point(logTime ~ Voltage, data = voltageData) %>% gf_labs(x = "Voltage (kV)", y = "log(breakdown time)") %>% gf_coefline(model = lmfit.log) p3 <- gf_point(rstandard(lmfit.log) ~ fitted(lmfit.log)) %>% gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>% gf_hline(yintercept = c(-2, 0, 2), linetype=2) p4 <- gf_qq(~ rstandard(lmfit.log)) %>% gf_labs(x = "standard normal", y = "standardized residuals") %>% gf_qqline(~ rstandard(lmfit.log)) grid.arrange(p1, p2, p3, p4, ncol=2) ``` The logarithm is by far the best transformation thus far, although the residuals have some left skew. A side-by-side comparison: ```{r} p1 <- gf_point(Time ~ Voltage, data = voltageData) %>% gf_labs(x = "Voltage (kV)", y = "breakdown time (min)") %>% gf_coefline(model = lmfit) p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>% gf_labs(x = "fitted values (yhat): raw data", y = "standardized residuals") %>% gf_hline(yintercept = c(-2, 0, 2), linetype=2) p2 <- gf_point(logTime ~ Voltage, data = voltageData) %>% gf_labs(x = "Voltage (kV)", y = "log(breakdown time)") %>% gf_coefline(model = lmfit.log) p4 <- gf_point(rstandard(lmfit.log) ~ fitted(lmfit.log)) %>% gf_labs(x = "fitted values (yhat): transformed data", y = "standardized residuals") %>% gf_hline(yintercept = c(-2, 0, 2), linetype=2) grid.arrange(p1, p2, p3, p4, ncol=2) ``` #### What does boxcox method suggest? In R: `library(MASS)` `boxcox(lm(Time ~ Voltage, data = voltageData))` ```{r fig.height=3} boxcox(lm(Time ~ Voltage, data = voltageData)) boxcox(lm(Time ~ Voltage, data = voltageData), lambda = seq(-0.25, 0.25, 0.1)) boxcox(lm(Time ~ Voltage, data = voltageData), lambda = seq(-0.25, 0.25, 0.1), plotit = FALSE) ``` Since 0 is "not rejected", the log transformation is also a suggestion from the boxcox method. #### The fitted model with log(Y) ```{r} lmfit <- lm(logTime ~ Voltage, data = voltageData) tidy(lmfit) glance(lmfit)[-c(7:10)] ``` ```{r} lmfit.coef <- tidy(lmfit) b0 <- lmfit.coef$estimate[1] b1 <- lmfit.coef$estimate[2] ``` $$\E(\log({\rm breakdown\ time})) = `r b0` + (`r b1`)\, {\rm voltage}$$ So, a one kV increase in voltage is associated with a $`r b1`$ minute change in average log(breakdown time). What can we say about the relationship between voltage and breakdown time back in the original units (minutes)? ### Units ### Caution: E[log(Y)] < log[E(Y)] ```{r} b0 <- 2 b1 <- .3 sigma <- 0.3 x <- seq(0, 10, 0.1) n <- length(x) set.seed(12345) y <- b0 * exp( b1*x + rnorm(n, m=0, s=sigma) ) p1 <- gf_point(y ~ x) p2 <- gf_point(log(y) ~ x) %>% gf_coefline(model = lm(log(y) ~ x)) lmfit <- lm(log(y) ~ x) p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>% gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>% gf_hline(yintercept = c(-2, 0, 2), linetype=2) p4 <- gf_lm(y ~ x, formula = log(y) ~ x, backtrans = exp) %>%gf_point(y~ x) grid.arrange(p1, p2, p3, p4, ncol=2) ``` The mean of the log is less than the log of the mean (the log is a non-linear transformation). Try this out with some numbers: ```{r} a <- 2 b <- 3 ``` Let $a=`r a`$ and $b = `r b`$, then the average of the logs $= \displaystyle \frac{\log(a) + \log(b)}{2} = \frac{\log(`r a`) + \log(`r b`)}{2} = `r (log(a) + log(b))/2`$ but the log of the average $= \displaystyle \log\left[\frac{a + b}{2}\right] = \log\left[\frac{`r a` + `r b`}{2}\right] = `r log((a+b)/2)`$ So, while it is true that $\log(Y) = \beta_0 + \beta_1 x + \epsilon$ implies $Y = \exp(\log(Y)) = \exp(\beta_0 + \beta_1 x + \epsilon)$ it is **NOT TRUE** that $\E(\log(Y)) = \beta_0 + \beta_1 x$ implies $\E(Y) = \exp(\beta_0 + \beta_1 x)$ because $\exp[\E(\log(Y))] < \exp[\log(\E(Y))] = \E(Y)$. Thus, $\yhat = \exp(\b0 + \b1 x)$ tends to underestimate $\E(Y)$. #### Estimation in the original units: The median Good news: ${\rm Median}(\log(Y)) = \log\left({\rm Median}(Y)\right)$ The log function is monotone increasing, so if half the original distribution of $Y$ values is below $M$, then half of the distribution of $\log(Y)$ values is below $\log(M)$. So, $\yhat = \exp(\b0 + \b1 x)$ is a good estimate for the **median** of $y$ ...if the errors for the original model have at least a symmetric distribution: $\log(Y) = \beta_0 + \beta_1 x + \epsilon$. Then, $E(\log(Y)) = \textrm{ Median}(\log(Y))\;$ by symmetry. A 95% confidence interval for the median breakdown time (minutes) can come from a 95% confidence interval for the mean of log(breakdown time). ```{r} range(~ Voltage, data=voltageData) newVoltage <- data.frame(Voltage = 30) ``` Let's find a confidence interval for the median breakdown time when voltage is $`r as.numeric(newVoltage)`$ kV. First, find a confidence interval for the mean log(breakdown) time when voltage is $`r as.numeric(newVoltage)`$ kV. ```{r} lmfit <- lm(logTime ~ Voltage, data = voltageData) predict(lmfit, newVoltage, interval = "confidence") kable(predict(lmfit, newVoltage, interval = "confidence")) ``` Then, convert the estimate and interval from units of log(breakdown time) to breakdown time. $$e^{\log(y)} = y$$ ```{r} kable(exp(predict(lmfit, newVoltage, interval = "confidence"))) ``` ### Example 2: Elephant matings ```{r eval=FALSE} elephantData <- case2201 ## Dataset from the Statistical Sleuth textbook ``` ```{r} elephantData <- case2201 glimpse(elephantData) n <- dim(elephantData)[1] ``` In an 8-year study of the mating habits of male elephants, data were collected on age and number of successful matings. Larger elephants are typically more successful competing for female attention, and elephants continue to grow throughout their lives. "Thus, males most likely to pass their genes to future generations are those whose characteristics enable them to live long lives." -J. H. Poose, "Mate Guarding, Reproductive Success, and Female Choice in African Elephants", __Animal Behavior__, 37, (1989): 842-49. $n = `r n`$ male elephants were in the study at the start response = number of successful matings predictor = age #### Regression: raw (elephant mating) data First 10 elephants in the sample: ```{r} head(elephantData, 10) lmfit <- lm(Matings ~ Age, data = elephantData) p1 <- gf_histogram(~ Matings, data = elephantData, color = "white") p2 <- gf_point(Matings ~ Age, data = elephantData) %>% gf_labs(x = "Age of Male Elephants", y = "## Successful Matings") %>% gf_coefline(model = lmfit) p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>% gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>% gf_lims(y = c( min(-2.2, min(rstandard(lmfit))), max(2.2, max(rstandard(lmfit))) )) %>% gf_hline(yintercept = c(-2, 0, 2), linetype=2) p4 <- gf_qq(~ rstandard(lmfit)) %>% gf_labs(x = "standard normal", y = "standardized residuals") %>% gf_qqline(~ rstandard(lmfit)) grid.arrange(p1, p2, p3, p4, ncol=2) ``` Do you see problems with any of the regression specifications? **STOP**

Variance non-constant: Increasing with $x$ = Age. The response variable is a count of a rare event (mostly 0, 1, 2, 3). Perhaps a Poisson model for the response makes sense. Which, implies the reponse transformation $\sqrt{y}$ may be beneficial #### What transformation does the boxcox method suggest? The first time I ran `boxcox( lm(Matings ~ Age, data = elephantData) )` R complained that "response variable must be positive" So, I ran this instead to handle the zeros: `boxcox( lm( (Matings + 1) ~ Age, data = elephantData) )` ```{r, fig.height=3} lmfit <- lm(Matings ~ Age, data = elephantData) boxcox( lm( (Matings + 1) ~ Age, data = elephantData) ) boxcox( lm( (Matings + 1) ~ Age, data = elephantData) ,lambda = seq(-0.25, 1, 0.1)) ``` #### Transformation: $sqrt(Y)$ ```{r} elephantData <- mutate(elephantData, sqrtMatings = sqrt(Matings)) lmfit.sqrt <- lm(sqrtMatings ~ Age, data = elephantData) p1 <- gf_histogram(~ sqrtMatings, data = elephantData, color = "white") p2 <- gf_point(sqrtMatings ~ Age, data = elephantData) %>% gf_labs(x = "Age of Male Elephants", y = "sqrt(#) Successful Matings") %>% gf_coefline(model = lmfit.sqrt) p3 <- gf_point(rstandard(lmfit.sqrt) ~ fitted(lmfit.sqrt)) %>% gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>% gf_lims(y = c( min(-2.2, min(rstandard(lmfit.sqrt))), max(2.2, max(rstandard(lmfit.sqrt))) )) %>% gf_hline(yintercept = c(-2, 0, 2), linetype=2) p4 <- gf_qq(~ rstandard(lmfit.sqrt)) %>% gf_labs(x = "standard normal", y = "standardized residuals") %>% gf_qqline(~ rstandard(lmfit.sqrt)) grid.arrange(p1, p2, p3, p4, ncol=2) ``` A side-by-side comparison: ```{r} p1 <- gf_point(Matings ~ Age, data = elephantData) %>% gf_labs(x = "Age of Male Elephants", y = "## Successful Matings") %>% gf_coefline(model = lmfit) p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>% gf_labs(x = "fitted values (yhat): raw data", y = "standardized residuals") %>% gf_lims(y = c( min(-2.2, min(rstandard(lmfit))), max(2.2, max(rstandard(lmfit))) )) %>% gf_hline(yintercept = c(-2, 0, 2), linetype=2) p2 <- gf_point(sqrtMatings ~ Age, data = elephantData) %>% gf_labs(x = "Age of Male Elephants", y = "sqrt(#) Successful Matings") %>% gf_coefline(model = lmfit.sqrt) p4 <- gf_point(rstandard(lmfit.sqrt) ~ fitted(lmfit.sqrt)) %>% gf_labs(x = "fitted values (yhat): transformed data", y = "standardized residuals") %>% gf_lims(y = c( min(-2.2, min(rstandard(lmfit.sqrt))), max(2.2, max(rstandard(lmfit.sqrt))) )) %>% gf_hline(yintercept = c(-2, 0, 2), linetype=2) grid.arrange(p1, p2, p3, p4, ncol=2) ``` ### Example 3: Bacteria Decay ```{r eval=FALSE} bacteriaData <- read.delim("http://statistics.uchicago.edu/~collins/data/RABE5/P168.txt") ``` ```{r} bacteriaData <- read.delim("http://statistics.uchicago.edu/~collins/data/RABE5/P168.txt") n <- dim(bacteriaData)[1] ``` From the textbook, page 167: The data are the number of surviving bacteria (in hundreds) after exposure to 200kV X-rays for intervals of 6 minutes (1 interval, 2 intervals, ..., up to 15 intervals). $t$ = number of 6-minute intervals ($t=1, 2, \ldots, 15$) $N_t$ = number of surviving bacteria after exposure $t$ Let's start by looking at the data ($n = `r n`$). ```{r} kable(bacteriaData) ``` #### Regression: raw data First, consider a simple linear regression model $$N_t = \beta_0 + \beta_1 t + \epsilon$$ ```{r} lmfit <- lm(N_t ~ t, data = bacteriaData) p1 <- gf_histogram(~ N_t, data = bacteriaData, color = "white") p2 <- gf_point(N_t ~ t, data = bacteriaData) %>% gf_labs(x = "## of 6-minute X-rays", y = "## of bacteria remaining") %>% gf_coefline(model = lmfit) p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>% gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>% gf_lims(y = c( min(-2.2, min(rstandard(lmfit))), max(2.2, max(rstandard(lmfit))) )) %>% gf_hline(yintercept = c(-2, 0, 2), linetype=2) p4 <- gf_qq(~ rstandard(lmfit)) %>% gf_labs(x = "standard normal", y = "standardized residuals") %>% gf_qqline(~ rstandard(lmfit)) grid.arrange(p1, p2, p3, p4, ncol=2) ``` The data do not exhibit a linear relationship. According to the scientific theory, "there is a single, vital center in each bacterium, and this must be hit by a ray before the bacteria is inactivated or killed." Further, it is thought that "if the theory is applicable, then $N_t$ and $t$ should be related by" $$N_t = N_0\, e^{\beta_1 t},\quad t \ge 0\qquad \text{(deterministic, scientific model)}$$ $N_0 =$ the number of bacteria at the start of the experiment $\beta_1 =$ the decay rate per 6-minute exposure Note: It's not clear to me why we were not provided $N_0$ with the data. Presumably, they counted the number of bacteria before starting the X-rays. Oh well. We can estimate it from the data. Taking the logarithm of both sides of the theoretical relationship, $$\log(N_t) \;=\; \log(N_0) + \beta_1 t \;=\; \beta_0 + \beta_1 t$$ Of course, we expect variability around this theory, so we'll try the model $$\log(N_t) \;=\; \beta_0 + \beta_1 t + \epsilon_t$$ where $\beta_0 = \log(N_0)$. On the original scale, this means $$N_t = e^{\beta_0 + \beta_1 t + \epsilon_t} \;=\; e^{\beta_0}\, e^{\beta_1 t}\, e^{\epsilon_t} \;=\; N_0\, e^{\beta_1 t}\,\epsilon^\prime_t ,\quad t \ge 0$$ **This is not a linear model.** **Why not?**

The coefficient $\beta_1$ does not enter the model linearly. What about the errors? $\epsilon_t = \log(\epsilon^\prime_t)$ and if $\epsilon \sim N(0, \sigma^2)$, then $\epsilon^\prime_t$ must have a "log-normal" distribution. In practice, we just check as usual that the residuals from the linear model could plausibly have come from a normal distribution (rather than checking the residuals from a non-linear model are "log-normal") #### Transformation: log(Y) ```{r} bacteriaData <- mutate(bacteriaData, logN_t = log(N_t)) lmfit.log <- lm(logN_t ~ t, data = bacteriaData) p1 <- gf_histogram(~ logN_t, data = bacteriaData, color = "white") p2 <- gf_point(logN_t ~ t, data = bacteriaData) %>% gf_labs(x = "# of 6-minute X-rays", y = "log(#) of bacteria remaining") %>% gf_coefline(model = lmfit.log) p3 <- gf_point(rstandard(lmfit.log) ~ fitted(lmfit.log)) %>% gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>% gf_lims(y = c( min(-2.2, min(rstandard(lmfit.log))), max(2.2, max(rstandard(lmfit.log))) )) %>% gf_hline(yintercept = c(-2, 0, 2), linetype=2) p4 <- gf_qq(~ rstandard(lmfit.log)) %>% gf_labs(x = "standard normal", y = "standardized residuals") %>% gf_qqline(~ rstandard(lmfit.log)) grid.arrange(p1, p2, p3, p4, ncol=2) ``` A side-by-side comparison: ```{r} p1 <- gf_point(N_t ~ t, data = bacteriaData) %>% gf_labs(x = "# of 6-minute X-rays", y = "# of bacteria remaining") %>% gf_coefline(model = lmfit) p3 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>% gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>% gf_lims(y = c( min(-2.2, min(rstandard(lmfit))), max(2.2, max(rstandard(lmfit))) )) %>% gf_hline(yintercept = c(-2, 0, 2), linetype=2) p2 <- gf_point(logN_t ~ t, data = bacteriaData) %>% gf_labs(x = "# of 6-minute X-rays", y = "log(#) of bacteria remaining") %>% gf_coefline(model = lmfit.log) p4 <- gf_point(rstandard(lmfit.log) ~ fitted(lmfit.log)) %>% gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>% gf_lims(y = c( min(-2.2, min(rstandard(lmfit.log))), max(2.2, max(rstandard(lmfit.log))) )) %>% gf_hline(yintercept = c(-2, 0, 2), linetype=2) grid.arrange(p1, p2, p3, p4, ncol=2) ``` And what is our estimate for $N_0$? Whatever it is, it had better be larger than $`r max(~ N_t, data = bacteriaData)`$. **Why?**

```{r} kable(head(bacteriaData, 4)) ``` Here are the results of fitting the log-linear regression model to the data: $$\log(N_t) \;=\; \beta_0 + \beta_1 t + \epsilon_t$$ ```{r} tidy(lmfit.log) glance(lmfit.log)[-c(7:10)] ``` ```{r} b0 <- tidy(lmfit.log)$estimate[1] b1 <- tidy(lmfit.log)$estimate[2] ``` So, $\b0 = `r b0`$ and this estimates $\log(N_0)$. Thus, $\widehat{N}_0 = \exp(\b0) = \exp(`r b0`) = `r exp(b0)`$ Reminder: $\log()$ and $\exp()$ are not linear functions. In fact, $\exp[\E(\b0)] \le \E[\exp(\b0)]$ and $\exp[\E(\b0)] = \exp(\beta_0) = N_0$. Thus, $\E[\exp(\b0)] \ge N_0$. So even though $\E(\b0) = \beta_0$ (unbiased), $\E(\widehat{N}_0) = \E[\exp(\b0)] \ge N_0$. That is, the estimate $\widehat{N}_0 = \exp(\b0)$ is biased (tends to overestimate $N_0$). This bias does not effect estimating or testing theory about the decay rate: $\beta_1$. ### Example 4: Brain Weight ```{r eval=FALSE} brainData <- read.csv("http://statistics.uchicago.edu/~collins/data/s224/brain.csv") ``` ```{r} brainData <- read.csv("http://statistics.uchicago.edu/~collins/data/s224/brain.csv") ## I'm going to remove the three observations from dinasours before analyzing brainData <- filter(brainData, bodywt <= 9000) glimpse(brainData) arrange(brainData, bodywt) ``` Look at a scatterplot of the data. ```{r} gf_point(brainwt ~ bodywt, data = brainData) ``` What does the Box-Cox method suggest? ```{r} boxcox( lm(brainwt ~ bodywt, data = brainData) ) ``` The transformation $Y^{1/4}$ is suggested. Let's try that. ```{r} gf_point(brainwt^(1/4) ~ bodywt, data = brainData) ``` Yikes! It seems the Box-Cox method is no help. It only looks for optimal transformations of the response (not predictors). Both body weight ($x$) and brain weight ($y$) have very right-skewed distributions. I would probably try to symmetrize **BOTH** with a log transformation. But, we will investigate other transformations for both variables too. ```{r} p1 <- gf_point(1/brainwt ~ 1/bodywt, data = brainData) p2 <- gf_point(1/sqrt(brainwt) ~1/sqrt(bodywt), data = brainData) p3 <- gf_point(log(brainwt, 10) ~ log(bodywt, 10), data = brainData) p4 <- gf_point(sqrt(brainwt) ~ sqrt(bodywt), data = brainData) grid.arrange(p1, p2, p3, p4, ncol=2) ``` Indeed, the $\log(x)$ and $\log(y)$ transformations linearize the relationship. Do these transformations also symmetrize the $x$ and $y$ distributions? ```{r} p1 <- gf_histogram(~ bodywt, data = brainData, bins=10, color = "white") p2 <- gf_histogram(~ log(bodywt, 10), data = brainData, bins=10, color = "white") p3 <- gf_histogram(~ brainwt, data = brainData, bins=10, color = "white") p4 <- gf_histogram(~ log(brainwt, 10), data = brainData, bins=10, color = "white") grid.arrange(p1, p2, p3, p4, ncol=2) ``` That's better. Much less skew. Here is the model fit and residual diagnostics. ```{r} brainData <- mutate(brainData, logbodywt = log(bodywt, 10), logbrainwt = log(brainwt, 10) ) lmfit <- lm(logbrainwt ~ logbodywt, data = brainData) brainData <- mutate(brainData, yhat = fitted(lmfit), r = rstandard(lmfit) ) tidy(lmfit) glance(lmfit)[-c(7:10)] ``` ```{r} p1 <- gf_point(logbrainwt ~ logbodywt, data = brainData) %>% gf_labs(x = "log(body weight)", y = "log(brain weight)") %>% gf_coefline(model = lmfit) p2 <- gf_histogram(~ r, data = brainData, bins=10, color = "white") %>% gf_labs(x = "standardized residuals") p3 <- gf_point(r ~ yhat, data = brainData) %>% gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>% gf_hline(yintercept = c(-2, 0, 2), linetype=2) %>% gf_lims(y = c(-3, 3)) p4 <- gf_qq(~ r, data = brainData) %>% gf_labs(x = "standard normal", y = "standardized residuals") %>% gf_qqline(~ r, data = brainData) grid.arrange(p1, p2, p3, p4, ncol=2) ``` Two animals are outliers. Out of curiousity, which animals? ```{r} tail(arrange(brainData, r), 7) ``` #### Are outliers influential? Look at Cook's distance and hat values ```{r fig.height=3} brainData <- mutate(brainData, h = hatvalues(lmfit), C = cooks.distance(lmfit) ) p <- 1 ## one predictor n <- dim(brainData)[1] ## sample size) p1 <- gf_point(h ~ r, data = brainData) %>% gf_lims(x = c( min(-2.5, min(~r, data = brainData)), max(2.5, max(~r, data = brainData)))) %>% gf_labs(x = "standardized residuals", y = "hat value") %>% gf_hline(yintercept = 2 * (p+1) / n, linetype = "dashed") %>% gf_vline(xintercept = c(-2,0,2), linetype = "dashed") p2 <- gf_dotplot(~ C, data=brainData) %>% gf_lims(y = c(0, n)) %>% gf_labs(x = "Cook's distance") grid.arrange(p1, p2, ncol=2) ``` How much influence might the various "interesting" observations have? + observations with large hat values = leverage (far from middle in x-direction) + observations with large standardized residuals (far from middle in y-direction) + observations with large Cook's distance (a measure mixing both x- and y-directions) ```{r fig.height=3} p1 <- gf_point(logbrainwt ~ logbodywt, data = brainData, size = ~ h) %>% gf_labs(x = "log(body weight)", y = "log(brain weight)") %>% gf_coefline(model = lmfit) p2 <- gf_point(logbrainwt ~ logbodywt, data = brainData, size = ~ C) %>% gf_labs(x = "log(body weight)", y = "log(brain weight)") %>% gf_coefline(model = lmfit) grid.arrange(p1, p2, ncol=2) ``` ### Summary ### When to do transformations? + If the predictor variables exhibit a non-linear relationship to the response variable + When non-constant variance is observed #### Transformations for non-linearity + If you know (or can guess) the form of the true relationship between response and predictor, + let that guide your transformation. + Write down the equation for the what you believe may be the true relationship, + then do algebra (on both sides of the equation) until you reach a linear form. + Then define new variables $Y^\prime$, $X^\prime$ as needed. + or can try different transformations of the predictor and/or response variables and see what works. See Table 6.1 and Table 6.5 in the textbook for ideas. #### All together now... + transformation may simplify analyses ...if can linearize + using linear regression is much simpler than non-linear regression + but, cannot always fix the problem (may need non-linear model) + Common scenarios and their useful transformations + exponential growth/decay: $\log(y)$ + multiplicative growth/decay: $\log(x)$ and $\log(y)$ + fan-shaped or horn-shaped residuals vs. fitted: try $\log(y)$ + quadratic-shaped residuals vs. predictor: try adding $x^2$ predictor + responses are from a binomial model: try $\log[y\;/\;(1-y)]$ + responses are from a "counting rare events" (Poisson) model: try $\sqrt{y}$ + trial-and-error + ladder of transformation $y^\lambda$ for $\lambda=-2,-1,-0.5,0.5,1,2$ + $\lambda=0$ corresponds to $\log(y) + can also try transforming $x$, with the same or different values of $\lambda$ + algorithmic optimization + Box-Cox transformation to find $\lambda$ + try transformation $y^{\lambda}$ + perhaps use a simpler-to-interpret value close to $\lambda$ + (e.g. $-2, -1, -0.5, -0.25, 0.25, 0.5, 2$) + if method suggests $\lambda=0$, use $\log(y)$ transformation + Beware of using expected values when transforming back to original units + If $y^\prime=\log(y)$, then $\widehat{y^\prime}=E(\log(y))$ but $e^\widehat{y^\prime}$ gives the *median*, not the expected value, of $\widehat{y}$ + This is true whenever the transformation of y is nonlinear but monotonic and when the errors are symmetric (median = mean)