GLM Intro: Framingham Heart Study

The Framingham heart study is a cohort longitudinal study on residents of a small Massachusetts town (Framingham) that has been studying risk factors for cardiovascular disease (CVD) for over 70 years, since 1948. The data set we’ll use here contains data on 4699 subjects (men and women between the ages of 30 and 62). The subjects were given biennial exams for blood pressure, serum cholesterol, and relative weight. These data are the 30-year follow-up data. Major endpoints include the occurrence of coronary heart disease (CHD) and deaths from CHD, cerebrovasular accident (CVA or stroke), cancer, and other causes.

Analysis

Load in Framingham data:

fram <- read.table("http://www.math.montana.edu/shancock/data/Framingham.txt")
# Check data were read in correctly:
head(fram)

##   sex sbp dbp scl chdfate followup age  bmi month   id
## 1   1 120  80 267       1       18  55 25.0     8 2642
## 2   1 130  78 192       1       35  53 28.4    12 4627
## 3   1 144  90 207       1      109  61 25.1     8 2568
## 4   1  92  66 231       1      147  48 26.2    11 4192
## 5   1 162  98 271       1      169  39 28.4    11 3977
## 6   2 212 118 182       1      199  61 33.3     2  659

tail(fram)

##      sex sbp dbp scl chdfate followup age  bmi month   id
## 4694   2 136  92 197       0    11688  45 23.1     5 1976
## 4695   1 130  88 213       0    11688  47 28.4     9 3195
## 4696   2 112  68 252       0    11688  40 22.0     4 1674
## 4697   1 112  76 177       0    11688  40 23.5    12 4526
## 4698   2 125  75 244       0    11688  47 30.4    12 4536
## 4699   2 142  70 255       0    11688  44 23.3     4 1786

dim(fram)

## [1] 4699   10

names(fram)

##  [1] "sex"      "sbp"      "dbp"      "scl"      "chdfate"  "followup"
##  [7] "age"      "bmi"      "month"    "id"

We will only consider one potential predictor, systolic blood pressure, sbp on the response variable chdfate. We would like to investigate the research question: Does SBP influence probability of CHD?

Data Visualization and Descriptive Statistics

Univariate distributions:

fram %>% ggplot(aes(x = chdfate)) +
    geom_bar()

hist(fram$sbp)

Simple scatterplot and side-by-side boxplots (note response and explanatory variables reversed axes than typically used):

plot(chdfate~sbp, xlab="Systolic BP", ylab="Coronary Heart Disease Indicator", data=fram)

boxplot(sbp ~ chdfate, data=fram)

Since it is hard to visualize a response variable of 0’s and 1’s, let’s create subgroups for SBP and look at the proportion who experienced CHD in each group.

fram$sbpgrp <- with(fram,
               cut(sbp, breaks = c(min(sbp),126,146,166,max(sbp)),
            include.lowest=TRUE)
)
# Count number of CHD incidences within each group:
counts <- with(fram,
               table(list(chdfate,sbpgrp))
)
counts

##    .2
## .1  [80,126] (126,146] (146,166] (166,270]
##   0     1663       978       395       190
##   1      534       503       255       181

# Proportions in each group:
props <- counts[2,]/colSums(counts)
props

##  [80,126] (126,146] (146,166] (166,270] 
## 0.2430587 0.3396354 0.3923077 0.4878706

Now plot proportions and add to the original scatterplot.

# SBP Midpoints:
mids = c(103,136,156,218)
plot(chdfate~sbp, xlab="Systolic BP", ylab="Coronary Heart Disease Indicator", data=fram)
points(props~mids, xlab="Systolic BP", ylab="Proportion with CHD", col="tomato", pch = 16)
lines(lowess(fram$chdfate~fram$sbp), col="red")

Models

Linear probability model:

mod1 <- glm(chdfate ~ sbp, family=binomial(link="identity"), data=fram)
summary(mod1)

## 
## Call:
## glm(formula = chdfate ~ sbp, family = binomial(link = "identity"), 
##     data = fram)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8829  -0.8729  -0.7565   1.3610   1.9201  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.1889767  0.0386344  -4.891    1e-06 ***
## sbp          0.0037744  0.0002925  12.906   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5844.1  on 4698  degrees of freedom
## Residual deviance: 5689.0  on 4697  degrees of freedom
## AIC: 5693
## 
## Number of Fisher Scoring iterations: 4

Logistic regression model: The default link for binomial family is logit.

mod2 <- glm(chdfate~sbp, family=binomial, data=fram)
summary(mod2)

## 
## Call:
## glm(formula = chdfate ~ sbp, family = binomial, data = fram)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8320  -0.8668  -0.7634   1.3677   1.8368  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.008809   0.189815  -15.85   <2e-16 ***
## sbp          0.016593   0.001385   11.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5844.1  on 4698  degrees of freedom
## Residual deviance: 5695.7  on 4697  degrees of freedom
## AIC: 5699.7
## 
## Number of Fisher Scoring iterations: 4

Probit regression model:

mod3 <- glm(chdfate~sbp, family=binomial(link="probit"), data=fram)
summary(mod3)

## 
## Call:
## glm(formula = chdfate ~ sbp, family = binomial(link = "probit"), 
##     data = fram)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8426  -0.8680  -0.7618   1.3660   1.8506  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.8529512  0.1144066  -16.20   <2e-16 ***
## sbp          0.0102093  0.0008407   12.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5844.1  on 4698  degrees of freedom
## Residual deviance: 5694.3  on 4697  degrees of freedom
## AIC: 5698.3
## 
## Number of Fisher Scoring iterations: 4

Add the three fitted models to the scatterplot:

plot(chdfate~sbp, xlab="Systolic BP", ylab="Coronary Heart Disease Indicator", data=fram)
points(props~mids, xlab="Systolic BP", ylab="Proportion with CHD", col="tomato", pch = 16)
# Linear:
abline(mod1,lwd=2,col="darkgreen")
# Logistic:
curve(exp(mod2$coef[1]+mod2$coef[2]*x)/(1+exp(mod2$coef[1]+mod2$coef[2]*x)),
        add=T,lwd=2,lty=2,from=90, to=300,col="red")
# Probit:
curve(pnorm(mod3$coef[1]+mod3$coef[2]*x),
        add=T,lwd=2,lty=3,col="blue")
legend("topleft",c("Linear","Logit","Probit"),lty=c(1,2,3),
    col=c("darkgreen","red","blue"))

Let’s continue using mod2, the logistic regression model.

Interpreting coefficients

Interpretations are always on the odds scale in logistic regression.

exp(mod2$coef)

## (Intercept)         sbp 
##  0.04935042  1.01673181

(Intercept) The estimated odds of CHD for an individual with 0 mmHg systolic blood pressure are 0.0494 to 1.

sbp A one mmHg increase in systolic blood pressure is associated with a 1.673% increase in estimated odds of CHD.

Confidence intervals for coefficients

Confidence interval for slope - interpret on odds scale:

exp(confint(mod2))

## Waiting for profiling to be done...

##                  2.5 %     97.5 %
## (Intercept) 0.03394333 0.07144715
## sbp         1.01398776 1.01951091

A one mmHg increase in systolic blood pressure is associated with an increase of between 1.40% to 1.95% in the odds of CHD.

Odds ratios for an x change in SBP

Estimated odds ratio of CHD for SBP = 140 compared to SBP = 120:

exp(mod2$coef[2] * (140-120))

##      sbp 
## 1.393568

# same as
exp(mod2$coef[2] * (110-90))

##      sbp 
## 1.393568

To calculate a 95% CI for this odds ratio, first calculate a CI for \(\beta\):

ztable <- summary(mod2)$coefficients
ztable

##                Estimate  Std. Error   z value     Pr(>|z|)
## (Intercept) -3.00880898 0.189815484 -15.85123 1.378551e-56
## sbp          0.01659338 0.001385338  11.97785 4.642017e-33

CIbeta <- ztable[2,1] + c(-1,1)*qnorm(.975)*ztable[2,2]
CIbeta

## [1] 0.01387816 0.01930859

Then multiply by 20 and transform the endpoints to get CI for \(e^{20\beta}\):

exp(CIbeta * 20)

## [1] 1.319910 1.471337

Median effective level

Estimated median effective level:

-mod2$coef[1]/mod2$coef[2]

## (Intercept) 
##    181.3259

The estimated probability of CHD is 0.5 when systolic blood pressure is 181.33 mmHg.

Your rate of change in the odds of CHD with SBP is highest around an SBP of 181.33 mmHg.

Predictions

Estimated risk of CHD and rate of change in risk for a one unit increase in SBP when SBP is around 140 (hypertension):

co <- mod2$coef
p140 <- exp(co[1]+co[2]*140)/(1+exp(co[1]+co[2]*140))
# Risk
p140

## (Intercept) 
##   0.3349822

# Rate of change
co[2] * p140 * (1-p140)

##         sbp 
## 0.003696492

The predicted probability of CHD for an individual with systolic blood pressure of 140 mmMg is 0.335. That is, there is an estimated 33.5% chance of CHD at some point in the 30 years of the study.

For each additional mmMg for that individual, the predicted probability of CHD increases (additively) by 0.0037.

More efficiently, we can write a function to calculate estimated probabilities:

prob <- function(co,x){
    # co = vector of coefficients (logistic model 1 predictor)
    # x = x-value
    as.numeric(exp(co[1]+co[2]*x)/(1+exp(co[1]+co[2]*x)))
}
# Predicted risk of CHD when SBP = 120:
prob(mod2$coef,120)

## [1] 0.2654944

Predictions using Logistic model, by hand:

co <- mod2$coef
# Odds for sbp = 110:
exp(mod2$coef[1]+mod2$coef[2]*110)

## (Intercept) 
##   0.3061936

# Probability for sbp = 110:
exp(mod2$coef[1]+mod2$coef[2]*110)/(1+exp(mod2$coef[1]+mod2$coef[2]*110))

## (Intercept) 
##   0.2344167

or by using the predict function:

?predict.glm
newdata <- data.frame(sbp = c(110,150))
# Predicted probabilities:
predict(mod2,newdata,type="response")

##         1         2 
## 0.2344167 0.3728984

# Predicted log-odds:
predict(mod2,newdata,type="link")

##          1          2 
## -1.1835377 -0.5198026

# Predicted odds:
exp(predict(mod2,newdata,type="link"))

##         1         2 
## 0.3061936 0.5946379

Checking Predictive Power

Check the predictive power of our logistic regression model. First, create a classification table using a cutoff of the sample proportion of those that experienced CHD.

# Predictive probabilities
probs <- fitted(mod2)
# Proportion in sample with chdfate = 1
mean(fram$chdfate)  # About 0.31

## [1] 0.313471

# Use 0.31 as cutoff to calculate sensitivity and specificity:
# pred prob > 0.31 -> 1; <= 0.31 -> 0
preds <- as.numeric(probs>0.31)
table(fram$chdfate,preds)

##    preds
##        0    1
##   0 2036 1190
##   1  698  775

Sensitivity = P(y-hat = 1 | y = 1):

775/(698+775)

## [1] 0.5261371

Specificity = P(y-hat = 0 | y = 0):

2036/(2036+1190)

## [1] 0.6311221

The classification table is sensitive to our choice of cutoff. How do answers change if we use a cutoff of 0.5?

preds <- as.numeric(probs>0.5)
table(fram$chdfate,preds)

##    preds
##        0    1
##   0 3134   92
##   1 1381   92

ROC Curve

library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

# arguments = fitted probabilities; observed 0/1
pred.obj <- prediction(probs,fram$chdfate)  
# Plot ROC curve
plot(performance(pred.obj,"tpr","fpr"))

# Area under the ROC curve
performance(pred.obj,"auc")

## An object of class "performance"
## Slot "x.name":
## [1] "None"
## 
## Slot "y.name":
## [1] "Area under the ROC curve"
## 
## Slot "alpha.name":
## [1] "none"
## 
## Slot "x.values":
## list()
## 
## Slot "y.values":
## [[1]]
## [1] 0.6122628
## 
## 
## Slot "alpha.values":
## list()

Our model is a very poor predictor of CHD.

Examine logistic model with predictors sbp and sex

# Recode sex to be an indicator of female
fram$sex <- fram$sex-1

No interaction

Logistic regression with two predictors - no interaction between sbp and sex:

mod4 <- glm(chdfate ~ sbp + sex, family = "binomial", data = fram)
summary(mod4)

## 
## Call:
## glm(formula = chdfate ~ sbp + sex, family = "binomial", data = fram)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7431  -0.8935  -0.6838   1.2765   1.9910  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.76479    0.19401  -14.25   <2e-16 ***
## sbp          0.01785    0.00142   12.57   <2e-16 ***
## sex         -0.78261    0.06535  -11.97   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5844.1  on 4698  degrees of freedom
## Residual deviance: 5549.7  on 4696  degrees of freedom
## AIC: 5555.7
## 
## Number of Fisher Scoring iterations: 4

exp(coef(mod4))

## (Intercept)         sbp         sex 
##  0.06298909  1.01801052  0.45720929

Interpretations?

spb:

For every one mmMg increase in SBP, the estimated increase of odds of CHD is 1.8%, controlling for sex.
A one mmMg increase in SBP is associated with an estimated 1.8% of the odds of CHD, adjusting for sex.
A one mmMg increase in SBP is associated with an estimated 1.8% of the odds of CHD when comparing individuals of the same sex.

sex:

The estimated odds of CHD for females are 54.28% lower than for males, controlling for SBP.
Among individuals with the same SBP, the odds ratio of CHD for females to males is 0.457.

Interaction

Logistic regression with two predictors - interaction between sbp and sex:

mod5 <- glm(chdfate ~ sbp*sex, family = "binomial", data = fram)
summary(mod5)

## 
## Call:
## glm(formula = chdfate ~ sbp * sex, family = "binomial", data = fram)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9162  -0.9263  -0.6713   1.2860   2.0446  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.088799   0.310533  -6.726 1.74e-11 ***
## sbp          0.012758   0.002315   5.512 3.55e-08 ***
## sex         -1.866700   0.400881  -4.656 3.22e-06 ***
## sbp:sex      0.008048   0.002934   2.744  0.00608 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5844.1  on 4698  degrees of freedom
## Residual deviance: 5542.2  on 4695  degrees of freedom
## AIC: 5550.2
## 
## Number of Fisher Scoring iterations: 4

exp(coef(mod5))

## (Intercept)         sbp         sex     sbp:sex 
##   0.1238357   1.0128399   0.1546331   1.0080807

Look at the model within each sex:

Males (sex = 0): (\(x\) = sbp) \[ \log\left(\frac{\hat{\pi}}{1-\hat{\pi}}\right) = -2.0888 + 0.0128x \]

The estimated odds of CHD for males increase by 1.28% for each additional mmMg in systolic blood pressure.

A one mmMG increase in systolic blood pressure is associated with an estimated 1.28% increase in the odds of CHD for males.

Females (sex = 1): (\(x\) = sbp)

# Intercept for females
coef(mod5)[1]+coef(mod5)[3]

## (Intercept) 
##   -3.955499

# Slope for females
coef(mod5)[2]+coef(mod5)[4]

##        sbp 
## 0.02080645

\[ \begin{eqnarray*} \log\left(\frac{\hat{\pi}}{1-\hat{\pi}}\right) &= (-2.0888-1.8667) + (0.0128+0.008048)x \\ &= -3.9555 + 0.0208x \end{eqnarray*} \] For every one mmMg increase in SBP, the estimated increase of odds of CHD is 2.1% for females.

Interpret interaction coefficient directly

exp(coef(mod5))

## (Intercept)         sbp         sex     sbp:sex 
##   0.1238357   1.0128399   0.1546331   1.0080807

How the effect of SBP changes with sex:

The increase in estimated odds for a one mmHg increase in SBP is 0.8% higher for females than for males.

How the effect of sex changes with SBP:

The change in estimated odds from males to females is 0.8% higher for each 1 mmHG increase in SBP.

The estimated odds ratio of CHD for females compared to males increases by 0.8% for each additional mmHG in systolic blood pressure.