\documentclass[landscape]{seminar} \input epsf.sty \usepackage{epsfig,graphicx,epstopdf,amsmath} \usepackage{fancyhdr} \usepackage{subfigure} \def\hb{\mbox{$\hat{\beta}$}} \def\P{\mbox{\textrm{P}}} \pagestyle{fancy} \lhead{}\chead{}\rhead{} \lfoot{Note: See Canvas for the examples completed with R} \cfoot{p.\ \thepage} \rfoot{} \renewcommand{\headrulewidth}{0pt} \slideframe{none} \slidesmag{4} % this will make the slides bigger or smaller. % For use with pdflatex -- as in TeXShop % Correct for \slidesmag{4} \AtBeginDocument{% \pdfpageheight=5in% \pdfpagewidth=6.5in% } % \slidesmag{2} % Correct for \slidesmag{2} % % A FIRST ATTEMPT -- NOT YET PERFECT! % \AtBeginDocument{% % \pdfpageheight=7.07in % \pdfpagewidth=9.19in % } % } \renewcommand{\familydefault}{\sfdefault} % Commands for slides \newcommand{\checkoff}{\hspace*{-.2in} {\color{red}$\surd$}\ } \newcommand{\brkslide}[1] { \end{slide} \begin{slide} \begin{center} {\bf #1} \end{center} } \newcommand{\brkslideitem}[1] { \end{itemize} \end{slide} \begin{slide} \begin{center} {\bf #1} \end{center} \begin{itemize} } \newcommand{\brkslidedescr}[1] { \end{description} \end{slide} \begin{slide} \begin{center} {\bf #1} \end{center} \begin{description} } %\input{preamb2ld} \input{preamb1} %\textheight 8.0in \textwidth 6.0in %\oddsidemargin 0.25in %\newcommand{\examp}{{\bf Example:} \ } \newcommand{\boxit}[1] { \mbox{ } \\*[-.5\baselineskip] \centerline{\fbox{ \parbox{4.25in}{#1} }} } \begin{document} \begin{slide} %\vspace{-1in} %\texttt{ } %\vspace{-1in} %\centerline{\includegraphics[height=4.1in]{graphs/regAll.pdf}} %\newpage \flushright{ PBHS 32400 / STAT 22400} % \medskip \begin{center} {\bf Regression Models for a Probability of Response Outcome } \end{center} \begin{itemize} \item So far, response variable $Y$ continuous and (approximately) normally distributed. \item What if $Y$ is a binary discrete variable, taking on values $0$ or $1$?\\ \vspace{0.25cm} For example, \begin{itemize} \item bank failure:\\ $Y = 1$ if failed (bankcruptcy), $Y=0$ if not\\ \vspace{0.2cm} \item disease event:\\ $Y = 1$ if with disease (case), 0 otherwise (unaffected/control). \end{itemize} \vspace{0.5cm} \item Each individual observation is like a ``trial" in a binary outcome experiment where $P(Y=1)$ is some value $p$. \newpage \begin{center} {\bf Regression Models for Probability of Response } \end{center} \medskip \item $Y$ is a Bernoulli random variable, with probability of ``success" $p$. \begin{center} \begin{tabular}{l|l} $Y$ & $P(Y)$ \\ \hline $0$ (failure) & $P(Y=0) = 1-p$ \\ $1$ (success) & $P(Y = 1) = p$ \end{tabular} \end{center} \item For each observtion, $Y_i \sim {\rm Bernoulli}(p)$, for $i=1, 2, \ldots, n$ \item $E(Y_i) = p = P(Y_i=1)$ \item $\var(Y_i) = p(1-p) = P(Y_i=1)P(Y_i=0)$ \item So, the variance of $Y$ depends on it's mean! \medskip \item Can we model this $\E(Y)$ as a linear function of predictors $X$? \\ Want to identify factors related to the probability of an event.\\ Want to predict the probability of an event. \newpage \begin{center} {\bf Regression Models for Probability of Response } \end{center} If we try to model $\E(Y)$ as a linear function of predictors $X$:\\ \[ \E(Y | X) = \P(Y = 1 | X) = ``p" = \beta_0 + \beta_1X_1 + \ldots + \beta_pX_p \] Several problems arise:\\ \begin{itemize} \item Model can readily estimate a probability below $0$ or above $1$ \item $Y$ not even approx.\ normally distributed (not even continuous) \item The variance of $Y$ (thus, the error term) is not constant over $X$ \\ since $\E(Y)=p$ and $\var(Y) = p(1-p)$. \\ So if $p$ varies conditional on $X$, so does the variance. \end{itemize} %\item Even though the non-constant variance problem could be addressed via weighted least squares, other problems cannot be addressed\\ \newpage \begin{center} {\bf Regression Models for Probability of Response } \end{center} \item The following transform does permit us to formulate $\P(Y=1)$ (actually, a function of it) as a linear function of predictor variables $X$ \[ \log_e \left[\frac{p}{1-p}\right] = \log_e \left[\frac{\P(Y=1|X)}{1-\P(Y=1|X)} \right]= \beta_0 + \beta_1X_1 + \ldots + \beta_pX_p . \] From here, after some algebra \[ \P(Y=1|X)= \frac{\exp(\beta_0 + \beta_1X_1 + \ldots + \beta_pX_p)}{1+\exp(\beta_0 + \beta_1X_1 + \ldots + \beta_pX_p)} \] For this model, observations with same values of the $X$s \\ have the same probability of ``success" ($\P(Y=1)$). \newpage \begin{center} {\bf Regression Models for Probability of Response } \end{center} \medskip The {\bf logit transform} of a Bernoulli($p$) response variable $Y$ is \[ \logit(p) =\log_e \left(\frac{p}{1-p}\right). \] {\bf Logistic regression} models predict the logit as a linear function of $X$s \item The name derives from the fact that we can write \[ p = \frac{e^{\beta_0 + \beta_1 x}}{1 + e^{\beta_0 + \beta_1 x}} \] and the right-hand side is called a {\bf logistic function.} \newpage \item A logistic function is the inverse of a logit function. \item Logit function ``links" the mean of $Y$ to a linear function of $X$s. \item Logit function is also called the {\bf link function} for logistic regression. \item This method produces valid probability values for any values of $X$.\\ The estimates of the response are always values between $0$ and $1$. \newpage \begin{center} {\bf Analysis of Probability of Response Data } \end{center} \item Taking a step back, how do we typically summarize and analyze data with binary response? We should have analogous basic methods in logistic regression, just as in linear regression, where we can reproduce the two-sample t-test for comparison of means via SLR. \textbf{Stop and review this?} (Chap 1-2 notes: Section 3.3)\\ \medskip If the questions is whether the probability of a binary event differs between, say two groups, we have \begin{itemize} \item Test for difference of proportions between two independent groups - normal-based test, exact binomial test, etc \item Test for independence in a 2x2 contingency table - $\chi^2$ test \end{itemize} \item We also have associated effect measures - difference of proportions, ratio of proportions, and another summary that we will use here, the odds ratio \newpage \begin{center} {\bf Analysis of Probability of Response Data } \end{center} %{\bf Can we model this association via the odds ratio in a regression model?} \item Recall the basic set-up for the {\bf 2 $\times$ 2 Table}: This is for a retrospective, case-control epidemiological study of some exposure \begin{center} \begin{tabular}{r|c|c|c} & \multicolumn{2}{c|}{Exposure} & Total\\ Disease Status & yes & no \\ \hline case & a & b & a+b \\ non-case & c & d & c+d\\ \hline Total & a+c & b+d & a+b+c+d=N\\ \end{tabular} \end{center} \newpage \item To study the association between binary exposure and outcome based on a 2x2 table, generally we may consider the following measures: \begin{itemize} \item {\bf Difference of Proportions, Risk Difference} or {\bf Absolute Risk Difference}: \[ \mbox{risk difference} = \frac{a}{a + c} - \frac{b}{b + d} \] However, in a case-control study, the probability of being a case {\bf is not} the probability (risk) of disease in the population. The samples are collected based on outcome (diseased versus non-diseased). It is generally not possible to calculate and compare the absolute risk of disease (incidence) in a case-control study. Although, the probabilities would be valid in a cohort study. \newpage Why do a retrospective, case-control study vs. a cohort study? \vspace{1cm} In a cohort study, \item determine exposed or not status at the start. \item follow participants to see if get the disease (case) or not \item Could take years and years to get the data. \item ...and if the disease is rare, tough to get enough data on cases %\item {\bf Relative Risk}: %\[ %\mbox{relative risk} = \frac{\frac{a}{a + c}}{\frac{b}{b + d}} %\] %\[ \mbox{relative risk} = { \frac{a/ (a + c)}{b/(b + d)} } \] \newpage \item {\bf The Odds Ratio}: Ratio of odds of event ($Y=1$) (alternative way of expressing probabilities) in each exposure group \medskip Note that in the {\sl exposed} group \[ \mbox{odds} = \frac{p}{1-p } = \frac{a/(a+c)}{c/(a+c) } = a/c \] and in the {\sl unexposed} group \[ \mbox{odds} = \frac{p}{1-p } = \frac{b/(b+d)}{d/(b+d) } = b/d \] Then \[ \mbox{ ratio} = { \frac{a/ c}{b/d} } = \frac{ad}{bc} \] \medskip Is the {\bf odds ratio} or relative odds %\item and the {\bf Odds Ratio}: %Note that %\[ odds_e = \frac{p_e}{1-p_e } = \frac{a/(a+c)}{c/(a+c)} = a/c \] % and %\[ odds_u = \frac{p_u}{1-p_u } = \frac{b/(b+d)}{d/(b+d)} = b/d\] %then %\[\mbox{odds ratio} = \frac{a/ c}{b/d} = \frac{ad}{bc} \] \end{itemize} \newpage \begin{center} {\bf Odds, Odds Ratio} \end{center} \item What are odds? Just another metric for expressing probabilities, expressed as $$ \displaystyle O = \frac{\mbox{proportion of successes}}{\mbox{proportion of failures}}=\frac{p}{1-p} $$ \item Why Odds Ratio? As stated before, in some study designs (e.g., case-control studies), the proportion of $y=1$ in the data is not reflecting the real probability in the population. However, OR, as a relative measure, is still valid. %And for rare outcome, OR $\approx$ Relative Risk. {\bf Show that OR the same from either perspective.} \newpage \begin{center} \begin{tabular}{r|c|c|c} & \multicolumn{2}{c|}{Exposure} & Total\\ Disease Status & yes & no \\ \hline case & a & b & a+b \\ non-case & c & d & c+d\\ \hline Total & a+c & b+d & a+b+c+d=N\\ \end{tabular} \end{center} The odds of exposure among cases: \[ \mbox{odds(exposure)} = \frac{a/(a+b)}{b/(a+b) } = a/b \] The odds of exposure among non-cases: \[ \mbox{odds(exposure)} = \frac{c/(c+d)}{d/(c+d) } = c/d \] $\displaystyle{ \mbox{ ratio} = { \frac{a/ b}{c/d} } = \frac{ad}{bc} }$ = same as OR of case for exposed vs.\ not \newpage \begin{center} {\bf Logistic Regression \\ The Log Odds} \end{center} \item {\bf Example} We are interested in whether there is any difference in ICU mortality rates (outcome) for different initial hospital admission types (exposure, X variable). \vspace{0.5cm} \begin{center} \begin{tabular}{r|c|c|c} & \multicolumn{2}{c|}{Initial Hospital Admission} & \\ Status & Emergency & Elective & Total \\ \hline Death in ICU & a & b & a+b \\ Survived after ICU & c & d & c+d\\ \hline Total & a+c & b+d & a+b+c+d=N\\ \end{tabular} \end{center} \newpage \begin{center} {\bf Logistic Regression \\ The Log Odds} \end{center} \item {\bf Define:} \\ $p_0=$ risk of death in he population of elective admission patients \\ $p_1=$ risk of death in population of emergency admission patients \item The study considers a window of time and classifies ICU stays by response (death or not) and predictor (elective or emergency hospital admission). \item For the elective admission population, the ``odds'' of death is: \[ \frac{p_0}{1-p_0} = \frac{b/(b+d)}{d/(b+d)} = \frac{b}{d} \] $\displaystyle{ \log \left[ \frac{p_0}{1-p_0} \right] }$ is the ``log odds'' of death, or the ``logit'' of $p_0$. \newpage \begin{center} {\bf Logistic Regression \\ The Log Odds} \end{center} \smallskip \item For the emergency admission population, the ``odds'' of death is: \[ \frac{p_1}{1-p_1} = \frac{a/(a+c)}{c/(a+c)} = \frac{a}{c}\] and \[ \log \left\{ \frac{p_1}{1-p_1} \right\}\] \medskip is the ``log odds'' of death, or the ``logit'' of $p_1$. \newpage \begin{center} {\bf Logistic Regression \\ The Log Odds Ratio} \end{center} \vspace{1ex} \item Why log odds (logit)? Rather than model on either the probability or odds ratio scale (both of which have a restricted valid range), we need another metric that is equivalent but not constrained. $0 < p < 1\qquad 0 < OR < \infty \qquad -\infty < \log(\textrm{odds}) < \infty$ \item It turns out that we can take the difference of log odds (in the same way that we work with a difference of means in linear regression as a basic effect measure) \[ \log(p_1 / (1 - p_1) ) - \log( p_0 / (1-p_0)) \] but note that \[ \log(p_1 / (1 - p_1) ) - \log( p_0 / (1-p_0)) = \log \left\{ \frac{ p_1 / (1 - p_1) }{ p_0 / (1-p_0) } \right\} \] from basic rules of logarithms. \newpage \begin{center} {\bf Logistic Regression \\ The Log Odds} \end{center} Then \vspace{1ex} \[ \log \left\{ \frac{ p_1 / (1 - p_1) }{ p_0 / (1-p_0) } \right\} = \log(OR) \] \vspace{2ex} is the population ``log odds ratio'' of death, comparing the emergency admissions to the elective admissions. We label it $\beta_1$ and want our model to estimate it. \newpage \begin{center} {\bf Logistic Regression Model} \end{center} \item Define an indicator variable for admission type: \[ x = \mbox{\tt typ } = \left\{ \begin{array}{l} 1 \mbox{ if emergency} \\ 0 \mbox{ if elective} \end{array} \right. \] %\item We need a model that predicts probability. Note that we cannot simply use $ p = \beta_0 + %\beta_1 x$, since there is nothing in the linear expression that would constrain the model to predict %a valid probability value, that is, a value between $0$ and $1$ \item Algebraically, the log odds can be expressed as a linear function of covariates (log odds has range $(-\infty, \infty)$). Since the log odds is simply a function of the probability of interest (\pr(Y=1)), a model that predicts the log odds is equivalent to a model that predicts the probability. This is the logit transform introduced earlier. \item Here is the logistic regression model for the risk of death in ICU as a function of admission type: \[ \log \left\{ \frac{p}{1-p} \right\} = \beta_0 + \beta_1 x \] \newpage \begin{center} {\bf Logistic Regression Model (cont.)} \end{center} When $x=0$, \[ \log \left( \frac{p}{1-p} \right) = \beta_0 + \beta_1 (0) = \beta_0 = \log \left( \frac{p_0}{1-p_0} \right)\] When $x=1$, \[ \log \left( \frac{p}{1-p} \right) = \beta_0 + \beta_1 (1) = \beta_0 + \beta_1 = \log \left( \frac{p_1}{1-p_1} \right)\] Then, \begin{align*} \log(OR) &= \log \left( \frac{p_1/(1-p_1)}{p_0(1-p_0)} \right)\\ &= \log \left( \frac{p_1}{1-p_1} \right) - log \left( \frac{p_0}{1-p_0} \right) \\ &= \beta_1 \end{align*} $OR = e^{\beta_1}$ \newpage \begin{center} {\bf Logistic Regression Model (cont.)} \end{center} \item Recall that a linear regression model describes the mean of a random variable ($Y$) as a linear combination of some other variables ($X$'s) \vspace{1ex} \item A logistic regression model describes the log odds of a probability (that a binary variable $Y$ takes on the value $1$) as a linear combination of some other variables ($X$'s --covariates or predictors) \item Because error structure is different (Bernoulli vs. assumed normal), a different model estimation method from least squares is used - maximum likelihood estimation \newpage \begin{center} {\bf Logistic Regression - Estimation and Model Interpretation} \end{center} \item To fit this model in {\tt stata}: {\scriptsize \begin{verbatim} .* the data: (admission type: 0=elective, 1 = emergency .* vital status: 0 = alive, 1 = died) . tab sta typ Vital | Admission Type Status | 0 1 | Total -----------+----------------------+---------- 0 | 51 109 | 160 1 | 2 38 | 40 -----------+----------------------+---------- Total | 53 147 | 200 . ** Fit logistic regression model . logit sta typ Iteration 0: log likelihood = -100.08048 Iteration 1: log likelihood = -93.425171 . . . Iteration 4: log likelihood = -92.524467 Iteration 5: log likelihood = -92.524467 Logit estimates Number of obs = 200 LR chi2(1) = 15.11 Prob > chi2 = 0.0001 Log likelihood = -92.524467 Pseudo R2 = 0.0755 --------------------------------------------------------------- sta | Coef. Std.Err. z P>|z| [95% Conf. Int.] -------+------------------------------------------------------- typ | 2.1849 .74504 2.93 0.003 .72464 3.64519 _cons | -3.2386 .72083 -4.49 0.000 -4.6514 -1.82586 --------------------------------------------------------------- \end{verbatim}} \newpage \begin{center} {\bf Logistic Regression Model (cont.)} \end{center} \item In R, we use the GLM module (GLM stands for Generalized Linear Models). For the logit model, we specify the distribution family as binomial {\scriptsize \begin{verbatim} > logitm = glm(sta ~ typ, data=icu,family="binomial") > summary(logitm) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.2387 0.7206 -4.495 6.97e-06 *** typ 2.1849 0.7448 2.934 0.00335 ** --- Number of Fisher Scoring iterations: 5 \end{verbatim}} \medskip \item {\bf Note:} results ($\beta$'s) in Stata and R are given in log odds ratio metric \newpage \begin{center} {\bf Logistic Regression Model Interpretation (cont.)} \end{center} \item Interpretation of regression coefficients: Here, \[ \hat{\beta}_0 = -3.24 \mbox{\ \ and \ \ } \hat{\beta}_1 = 2.18\] The model $\displaystyle{\log \left\{ \frac{p}{1-p} \right\} = -3.24 + 2.18 (typ)}$ predicts the log odds (and odds) for each exposure group:\\ \medskip when X=0 (elective group): log odds or logit = $ -3.24$, then $$ \mbox{odds}_0 = e^{-3.24} = 0.039 $$ % $$ = \frac{2}{51}$$ %\vspace{3ex} when X=1 (emergency group): $$ \mbox{odds}_1 = e^{-3.24 + 2.18} = 0.346 $$ \vspace{0.1cm} $\displaystyle{ \textrm{OR} = \textrm{odds ratio} = \frac{\mbox{odds}_1}{\mbox{odds}_0} = e^{2.18} = 8.89 = e^{\widehat{\beta}_1} }$ \newpage \begin{center} {\bf Logistic Regression Model Interpretation (cont.)} \end{center} \item[-] Back to the data for a minute {\scriptsize \begin{verbatim} . tab sta typ Vital | Admission Type Status | 0 1 | Total -----------+----------------------+---------- 0 | 51 109 | 160 1 | 2 38 | 40 -----------+----------------------+---------- Total | 53 147 | 200 \end{verbatim}} \item[-] {\bf From this table, note that odds are} elective: 2/51 = 0.039 emergency 38/109 = 0.346 \newpage \begin{center} {\bf Logistic Regression - Estimation and Model Interpretation} \end{center} \item Then, from the table $ \mbox{OR} = \frac{51 \times 38}{2 \times 109} = 8.89 $ \item From the model, the ratio of predicted odds (i.e odds ratio) is $$ \frac{0.346}{0.039} = 8.89 = \mbox{OR}$$ \item But also note that the difference of log odds $\equiv$ log odds ratio is {\bf directly given} by the model ($\beta_1$). An estimate of the odds ratio is therefore: \[\widehat{OR} = e^{\hat{\beta}_1} = e^{2.18} = 8.89 \] {\bf So, the $\beta$ coefficient for {\tt typ} gives the log odds ratio for type of admission.} $e^{\hat \beta}$ gives the OR estimate. \newpage \begin{center} {\bf Logistic Regression - Estimation and Model Interpretation} \end{center} \item The model was intended to predict probabilities. Does it do this? Use the logistic function to obtain $$ \Pr(Y = 1 | X=0) = \frac{e^{-3.24}}{1 + e^{-3.24}} = 0.0377 = 2/53$$ $$ \Pr(Y = 1 | X=1) = \frac{e^{-3.24+2.18}}{1 + e^{-3.24+2.18}} = 0.2573 = 38/147$$ \bigskip check these probabilities against the 2x2 table above. The model predicts $\P(Y=1)$ by admission type, equaling the proportion of deaths in each group. \item If both OR and probabilities are valid, why would we still care about the probability while we have odds ratio?\\ \item OR is on a relative scale while the probability tells you how large the actual risk is. There are certain cases when the probability is very very low but the OR is high. Although it is concerning that OR is high, but since the probability is so small. People may still be willing to take the risk. \newpage \begin{center} {\bf Model Interpretation (cont.)} \end{center} \item The {\bf odds ratio} calculation is directly given in {\tt stata} using the {\tt , or} option: {\scriptsize \begin{verbatim} . logit sta typ , or . . . Logit estimates Number of obs = 200 LR chi2(1) = 15.11 Prob > chi2 = 0.0001 Log likelihood = -92.524467 Pseudo R2 = 0.0755 ---------------------------------------------------- sta | Odds Ratio Std.Err. z P>|z| [95% Conf.Int.] ----+----------------------------------------------- typ | 8.8899 6.6234 2.93 0.003 2.0640 38.2899 ---------------------------------------------------- \end{verbatim}} *note: unless predicting probabilities, we don't need $\beta_0$. \\ It might not be interpretable in some study designs. \newpage \begin{center} {\bf Logistic Regression Model Interpretation (cont.)} \end{center} \item Test of the hypothesis that ICU death is associated with type of hospital admission: \[ H_0: OR = e^{\beta_1} = 1 \] is same as \[ H_0: \log( OR ) = \beta_1 = 0 \] The test statistic reported in the {\tt stata} output is : \[ Z = \frac{\hat{\beta}_1}{\hat{\se}( \hat{\beta}_1 )} = \frac{2.185}{ .745 } = 2.93 \] and the $\pval$ is \[ \pval = \pr \{ |Z| > 2.93 \} = 0.003 \] \newpage \begin{center} {\bf Model Interpretation (cont.)} \end{center} \medskip \item From the model estimation run, a confidence interval for $\beta_1$ is constructed as: \[ \left( \hat\beta_1- \textrm{CriticalValue}\times s.e.(\hat\beta_1) , \hat\beta_1 + \textrm{CriticalValue}\times s.e.(\hat\beta_1) \right) \] Exponentiating the boundaries of the CI gives a CI for the $OR$: \[ \left(\exp\left( \hat\beta_1- \textrm{CriticalValue}\times s.e.(\hat\beta_1) \right) , \exp\left(\hat\beta_1 + \textrm{CriticalValue}\times s.e.(\hat\beta_1) \right)\right) \] This is what {\tt stata} does via the {\tt logit . . . , or} command \newpage \begin{center} {\bf Logistic Regression Model Interpretation (cont.)} \end{center} \vspace{.5ex} \item Overall conclusion: ICU death is significantly and positively associated with emergency (versus elective) hospital admission ($\pval =0.003$). $\widehat{OR} = 8.89$, 95\% CI: $[2.06, 38.3]$ \item Note: test on $\beta_1$ in this model is equivalent to this test: \[ H_0: \mbox{ no association of row and column frequencies in 2x2 table} \] vs \[ H_A: \mbox{ association of row and column frequencies in 2x2 table} \] via the $\chi^2$ test. Advantage of logistic regression is that we have an effect measure estimate with associated CI \item Another advantage: Logistic regression also can be readily used with predictors that are not binary or categorical, where tables cannot be readily formed or would result in loss of information. \newpage \begin{center} {\bf Logistic Regression \\ Multiple and Continuous Covariates} \end{center} \begin{itemize} \item We can easily expand the analysis to also accommodate multiple predictors. Suppose that we were interested in the association between both age and admission type, and ICU mortality \item Then as before we have a (0,1) indicator for admission type \[ X_1 = \mbox{\tt typ } = \left\{ \begin{array}{l} 1 \mbox{ if emergency} \\ 0 \mbox{ if elective} \end{array} \right. \] and define \[ X_2 = \mbox{age (yrs)} \] and propose a logistic regression model for death as a function of these two covariates: \[ \log \left\{ \frac{p}{1-p} \right\} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \] \item What is the interpretation of $\beta_2$? \item Let's compare the log odds of death of a 50 year old to a 49 year old subject, both admitted on an emergency basis {\small \[ \log\left(\frac{P(Y=1|X_1=1,X_2=50)}{1-P(Y=1|X_1=1,X_2=50)}\right)=\beta_0+\beta_1+\beta_2\times 50\] \[ \log\left(\frac{P(Y=1|X_1=1,X_2=49)}{1-P(Y=1|X_1=1,X_2=49)}\right)=\beta_0+\beta_1+\beta_2\times 49\] \begin{eqnarray*} &&\log\left(\frac{P(Y=1|X_1=1,X_2=50)}{1-P(Y=1|X_1=1,X_2=50)}\right) - \log\left(\frac{P(Y=1|X_1=1,X_2=49)}{1-P(Y=1|X_1=1,X_2=49)}\right)\\ &&\log\left(\frac{P(Y=1|X_1=1,X_2=50)}{1-P(Y=1|X_1=1,X_2=50)}\big\slash \frac{P(Y=1|X_1=1,X_2=49)}{1-P(Y=1|X_1=1,X_2=49)}\right)\\ &=& \beta_2 \end{eqnarray*}} Therefore: $\beta_2$ is the log odds ratio for death for a 50 year old compared to a 49 year old, both of whom had emergency admissions to the hospital \newpage \begin{center} {\bf Logistic Regression \\ Multiple and Continuous Covariates} \end{center} \item Some important notes about $\beta_2$: \begin{itemize} \item As in linear regression, $\beta_2$ represents the increment in %expected log odds for a one unit change in $X$, holding other predictors constant. Here, $\beta_2$ is the log odds ratio for death, and $\exp(\beta_2)$ is the odds ratio for death comparing two subjects who differ in age at admission by one year {\bf and} who are of the same admission type (be it emergency or elective). \item This is what we mean by an {\bf adjusted effect}: $\beta_2$ is the log odds ratio and $\exp(\beta_2)$ is the odds ratio comparing two subjects who differ in age at admission by one year, {\bf adjusted for admission type}. \end{itemize} \item What is the interpretation of $\beta_1$? A: The log odds ratio (emergency vs. elective) for two subjects of a given age, and $e^{\beta_1}$ is the odds ratio. \newpage \begin{center} {\bf Logistic Regression \\ Multiple and Continuous Covariates} \end{center} \item This model in {\tt Stata} {\scriptsize \begin{verbatim} . ** Fit multiple logistic regression model . logit sta age typ Logit estimates Number of obs = 200 LR chi2(2) = 27.09 Prob > chi2 = 0.0000 Log likelihood = -86.537821 Pseudo R2 = 0.1353 --------------------------------------------------------------- sta | Coef. Std.Err. z P>|z| [95% Conf.Int.] ------+-------------------------------------------------------- age | .034016 .01069 3.18 0.001 .013055 .05497 typ | 2.45354 .75257 3.26 0.001 .978525 3.92854 _cons |-5.50876 1.03351 -5.33 0.000 -7.53440 -3.48311 --------------------------------------------------------------- \end{verbatim}} \newpage \begin{center} {\bf Logistic Regression \\ Multiple and Continuous Covariates} \end{center} Or, on the odds ratio scale \ldots {\scriptsize \begin{verbatim} . logit sta age typ , or Logit estimates Number of obs = 200 LR chi2(2) = 27.09 Prob > chi2 = 0.0000 Log likelihood = -86.53782 Pseudo R2 = 0.1353 ---------------------------------------------------------------- sta | Odds Ratio Std.Err. z P>|z| [95% Conf.Int.] ------+-------------------------------------------------------- age | 1.03460 .0110644 3.18 0.001 1.0131 1.0565 typ | 11.62939 8.751927 3.26 0.001 2.6605 50.8329 --------------------------------------------------------------- \end{verbatim}} \newpage %\begin{center} \bf The likelihood ratio (LR) test \end{center} %\item What is a likelihood? %\item We are familiar with the concept of probability density functions $f(y|,\beta_1,\ldots,\beta_p)$. The likelihood is a function of the parameters of a model given the data. %\[ %L(\beta_1,\ldots,\beta_p |y) = f(y|,\beta_1,\ldots,\beta_p) %\] %\item To compare two statistical models, %$H_0: \beta_1=\ldots=\beta_p=0$ versus $H_A:$ at least one $\beta_j$ is non-zero. %The likelihood ratio test %$\Lambda(x)= -2 \ln \left( \frac{L(\beta_1,\ldots,\beta_p \textrm{ under } H_0 |y)}%{L(\beta_1,\ldots,\beta_p \textrm{ under }H_A |y)} \right)$.\\ %Under the null hypothesis, the LR statistic follows a $\chi_{df}^2$ distribution with degree of freedom equalling the number of parameters ($p$). %\newpage \begin{center} {\bf Logistic Regression \\ Multiple and Continuous Covariates} \end{center} \medskip \item Note that admission type effect is even larger taking age into account (OR = 11.6). And age is a potential {\bf confounder} here. \item Age can also {\bf modify} the effect of admission type on risk of death. Older individuals are more likely to die under either admission type (since they are older), whereas among younger individuals, emergency admissions may be more strongly associated with death than elective admissions. At a given age, emergency admissions have much higher (11.6-fold) greater risk of death. \medskip \item The age effect is to increase odds of death by 1.035, or about 3.5\%, per year of age increase. \end{itemize} \newpage \begin{center} {\bf Logistic Regression - Model Significance and Goodness of Fit Measures} \end{center} \item For the model overall, a likelihood ratio test is performed, contrasting the model in question with a null model containing no predictors. This is analogous to the F-test in SLR/MLR. \item The test statistic $$ \Lambda = -2 (\mbox{log-likelihood}_{null} - \mbox{log-likelihood}_{full})$$ is $\chi^2_{df}$ where degrees of freedom df = number of predictors. \item This same type of test is used for contrasting two nested models, say dropping out 3 predictors out of 7 $$ LR\; Test = -2 (\mbox{log-likelihood}_{smaller \;model} - \mbox{log-likelihood}_{bigger \;model})$$ is $\chi^2_3$ (3 degrees of freedom) and tests whether the three parameters considered for dropping simultaneously have $\beta_j = 0$. \newpage \begin{center} {\bf Multiple Logistic Regression \\ Inference, uses, etc} \end{center} \vspace{1ex} From a multiple logistic regression analysis, we obtain several quantities: \begin{itemize} \item likelihood ratio (LR) test - overall test for any `significant' predictors $X$ of log odds ratio (analogous to the $F$-test in the linear model). \medskip \item We have $Z$--tests for individual $\beta$ coefficients ($H_0: \beta =0$) for the $X's$, which are in fact tests for corresponding ORs ($ H_0: \mbox{OR} = 1.0$). %\item No completely satisfactory $R^2$ type of statistic, but some similar statistics that measure the overall model fit have been derived, for example, the pseudo-$R^2$. Pseudo $R^2$ can be used to compare models, but its absolute scale is not very meaningful. It does not represent the proportion of explained variation by the model. %\medskip \item Pseudo-$R^2$ measures proportion reduction in log-likelihood over null model. This is a useful measure, but more like another F-test than a measure of model explanatory power. \newpage \begin{center} {\bf Multiple Logistic Regression \\ Inference, uses, etc} \end{center} \vspace{1ex} \item Predicted probabilities for individuals.\\ \[ \Pr(Y=1 | X )= \frac{\exp(\beta_0 + \beta_1X_1 + \ldots + \beta_qX_q)}{1+\exp(\beta_0 + \beta_1X_1 + \ldots + \beta_qX_q)} \] These can be used to develop classification algorithms, predict probability of response prospectively, etc. \vspace{0.5cm} {\bf Example:} For all data records, Y's are either 1 (event, death, etc) or 0 (non-event,etc), but the model predicts Prob(Y=1) for each record. Depending on cut-point for predicting/assigning a case to be an event, how many do you get right? Adjust the cut-point based on this to optimize classification. \medskip \end{itemize} \end{itemize} %\newpage %\begin{center} {\bf More on Logistic Regression} \end{center} %\begin{itemize} %\item Recall that logistic regression provides a way of relating predictor variables $X$ to binary (0,1) outcome variable $Y$. This outcome variable is an indicator for "yes/no", "event/non-event", etc %\item The model actually predicts $\Pr(Y=1 | X)$. This equals the expected value (mean) of $Y$, as in linear regression. How? %\begin{itemize} %\item Note that for a binary variable, the mean of $Y$ is $ \sum_{j=1}^{N_{i}} Y_i / N_i = p_i$, or the proportion of $Y$ being equal to 1. %\item For a binomial random variable (independent draws, probability $p$ of success for each draw), this proportion of $Y$ equalling 1 is the same as $\Pr(Y=1)$ %\end{itemize} %\item the model for $\Pr(Y=1 | X)$ is nonlinear. We model linearly via the logit transform: %\[ \log \left\{ \frac{p}{1-p} \right\} = \beta_0 + \beta_1 x \] %\newpage %\begin{center} {\bf Logistic Regression} \end{center} %\item[-] The model for $\Pr(Y=1 | X)$ is nonlinear. We model linearly via the logit transform: %\[ \log \left\{ \frac{p}{1-p} \right\} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots \beta_q X_q \] %\item[-] $\beta_j$ in the model is the difference in log odds for a one unit increment in $X_j$. This is same as log odds ratio. Exponentiating gives the odds ratio, reflecting the relative odds of event (being a `1`) for a one-unit increment in $X$. Note that this is a multiplicative difference (rather than additive as in difference of means) %\item[-] Ex: $\beta_{age} = 0.336$. Then $\exp(0.336) = 1.4$. For a one-year increase in age, relative odds of event goes up by 1.4, or 40\%, adjusting for other predictors. %\newpage %\begin{center} {\bf Logistic Regression} \end{center} %\item[-] The model can produce a probability of event for each case in the dataset, based on the predictors $X$, via the equation %\[ %\Pr(Y=1 | X )= \frac{\exp(\beta_0 + \beta_1X_1 + \ldots + \beta_qX_q)}{1+\exp(\beta_0 + \beta_1X_1 + \ldots + \beta_qX_q)} %\] %\item With these values, one can assess how well individual cases may be classified with respect to outcome. \newpage \begin{center} {\bf Assumptions and Model Diagnostic} \end{center} \vspace{1ex} \begin{itemize} \item {\bf The assumptions:} No systematic bias in measurement assumption is still required. Uncorrelated observations assumption is still required. No strong multicollinearity is still required (and you can still use VIF to check for violation of assumption). Linearity (w.r.t link function) is still required. No influential observations is still required. Normality and constant variance assumptions are no longer required. The error term is Bernoulli distributed. \item Many of the approaches we discussed before for addressing assumption violations in MLR may still apply for logistic regression. Confounders should still be always included. Interaction terms should not be left alone without main effects. Categorical variables should be considered as a whole set. ...and so on. \newpage \begin{center} {\bf Logistic Regression - Diagnostics} \end{center} \item Several diagnostic quantities, aiming to detect outliers and influential points, are defined (C\&H 12.5). These borrow concepts from linear regression. \begin{itemize} \item {\sl Pearson residuals} and {\sl deviance residuals} plotted against the predicted probabilities or an index measure are similarly assessed for large deviations. \item A similar {\sl Leverage} measure can be derived to identify the effect of particular observations. \item The {\sl DBETA} measure determines the change in regression coefficients when observation $i$ is omitted, implicitly evaluating influence. %are defined as $$ \frac{y_i - \hat y_i }{\sqrt{\var(\hat y_i)}} $$ %where mean of Y is $\mu_i=n_i p_i $ and $V(y_i)=n_i pi_i(1-p_i)$ %$$ r_i=\frac{y_i-\hat{\mu}_i}{\sqrt{\hat{V}(y_i)}}=\frac{y_i-n_i\hat{\pi}_i}{\sqrt{n_i\hat{\pi}_i(1-\hat{\pi}_i)}} $$ \end{itemize} \newpage \begin{center} {\bf Logistic Regression - Model Significance and Goodness of Fit Measures} \end{center} \item[-] A commonly used goodness of fit measure in logistic regression is the Hosmer-Lemeshow test. The test groups the $n$ observations into groups (according to their estimated probability of event) and calculates the corresponding generalized Pearson $\chi^2$ statistic. Usually deciles (10 groups) are used. $$ H = \sum_{g=1}^G \frac{(O_g - E_g)^2 }{ n_g(\hat p_g (1 - \hat p_g)) }, $$ where $O_g$ is the number of events in group $g$ and $E_g = n_g \times \hat p_g$ is the expected number of events \item[-] In this type of test, a {\sl large} p-value ($>$ 0.05) indicates good correspondence between observed and predicted outcomes. \newpage \begin{center} {\bf Logistic Regression - Model Significance and Goodness of Fit Measures} \end{center} {\scriptsize \begin{verbatim} . logit sta typ age Iteration 0: log likelihood = -100.08048 Iteration 1: log likelihood = -87.895217 . . . Iteration 5: log likelihood = -86.537821 Logistic regression Number of obs = 200 LR chi2(2) = 27.09 Prob > chi2 = 0.0000 Log likelihood = -86.537821 Pseudo R2 = 0.1353 ------------------------------------------------------------------------------ sta | Coef. Std. Err. z P>|z| [95\% Conf. Interval] -------------+---------------------------------------------------------------- typ | 2.453535 .75257 3.26 0.001 .978525 3.928545 age | .0340162 .0106944 3.18 0.001 .0130556 .0549767 _cons | -5.508762 1.033511 -5.33 0.000 -7.534407 -3.483118 ------------------------------------------------------------------------------ . estat gof, group(10) table Logistic model for sta, goodness-of-fit test (Table collapsed on quantiles of estimated probabilities) +--------------------------------------------------------+ | Group | Prob | Obs_1 | Exp_1 | Obs_0 | Exp_0 | Total | |-------+--------+-------+-------+-------+-------+-------| | 1 | 0.0362 | 0 | 0.5 | 20 | 19.5 | 20 | | 2 | 0.0478 | 1 | 0.9 | 20 | 20.1 | 21 | | 3 | 0.0812 | 1 | 1.2 | 18 | 17.8 | 19 | | 4 | 0.1209 | 2 | 1.8 | 18 | 18.2 | 20 | | 5 | 0.1916 | 2 | 3.1 | 18 | 16.9 | 20 | |-------+--------+-------+-------+-------+-------+-------| | 6 | 0.2531 | 7 | 4.6 | 14 | 16.4 | 21 | | 7 | 0.2936 | 6 | 6.1 | 16 | 15.9 | 22 | | 8 | 0.3376 | 5 | 5.8 | 13 | 12.2 | 18 | | 9 | 0.3846 | 8 | 7.3 | 12 | 12.7 | 20 | | 10 | 0.5186 | 8 | 8.7 | 11 | 10.3 | 19 | +--------------------------------------------------------+ number of observations = 200 number of groups = 10 Hosmer-Lemeshow chi2(8) = 2.95 Prob > chi2 = 0.9372 \end{verbatim}} \newpage \begin{center} {\bf Logistic Regression - Goodness of Fit Measures} \end{center} \item This result looks very positive - good fit. However, Pseudo-$R^2 = 13\%$, and is not very high. \item Pseudo-$R^2$ measures proportion reduction in log-likelihood over null model. This is a useful measure, but more like another F-test than a measure of model explanatory power. \item The Hosmer-Lemeshow g.o.f. test is more valuable as a means to identify major systematic variation that is not explained. Large p-value does not assure that prediction will be highly accurate, etc. \item AIC and BIC measures can also be used to select among models. %\newpage %\begin{center} {\bf Logistic Regression - Predictive Ability} \end{center} %\item Since the model generates $\Pr(Y=1)$ for all cases, we can assess how well we might predict failure or success with the model, if $\Pr(Y=1)$ is meaningful. Specifically: %\begin{itemize} %\item We can simply assess how many correct and incorrect predictions the model made for a given $\Pr(Y=1)$ criterion, such as 0.5. Ordinary sensitivity/specificity type calculations can be used to assess the result. %\item We can try to select a probability value that optimizes the above parameters. Note that as in sensitivity/specificity problems, there is a trade-off between predicting cases and non-cases correctly. %\end{itemize} \newpage \begin{center} {\bf Summary -- Logistic Regression Models} \end{center} \medskip \item Logistic regression can be extended to a multinomial outcome variable, where $Y$ equals one of several mutually exclusive nominal categories (see C\&H). \item Similarly, ordinal logistic regression permits an ordered categorical response variable. \item These models fit into a general class of models in which a ``link" function of the mean of $Y$ is modeled by a linear combination of predictors. %%%%% END OF LOGISTIC REGRESSION \end{itemize} \end{slide} \end{document} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%