The Framingham heart study is a cohort longitudinal study on residents of a small Massachusetts town (Framingham) that has been studying risk factors for cardiovascular disease (CVD) for over 70 years, since 1948. The data set we’ll use here contains data on 4699 subjects (men and women between the ages of 30 and 62). The subjects were given biennial exams for blood pressure, serum cholesterol, and relative weight. These data are the 30-year follow-up data. Major endpoints include the occurrence of coronary heart disease (CHD) and deaths from CHD, cerebrovasular accident (CVA or stroke), cancer, and other causes.
Identify the common factors or characteristics that contribute to cardiovascular disease (CVD) by following its development over a long period of time in a large group of participants who had not yet developed overt symptoms of CVD or suffered a heart attack or stroke.
sex : Sex (1 = male; 2 = female)
sbp : Systolic Blood Pressure
dbp : Diastolic Blood Pressure
scl : Serum Cholesterol
chdfate : Coronary Heart Disease Indicator followup : Follow-up in Days
age : Age in Years
bmi : Body Mass Index (wt (kg) / h^2 (m)
month : Study Month of Baseline Exam
id : Subject ID
Load in Framingham data:
fram <- read.table("http://www.math.montana.edu/shancock/data/Framingham.txt")
# Check data were read in correctly:
head(fram)##   sex sbp dbp scl chdfate followup age  bmi month   id
## 1   1 120  80 267       1       18  55 25.0     8 2642
## 2   1 130  78 192       1       35  53 28.4    12 4627
## 3   1 144  90 207       1      109  61 25.1     8 2568
## 4   1  92  66 231       1      147  48 26.2    11 4192
## 5   1 162  98 271       1      169  39 28.4    11 3977
## 6   2 212 118 182       1      199  61 33.3     2  659tail(fram)##      sex sbp dbp scl chdfate followup age  bmi month   id
## 4694   2 136  92 197       0    11688  45 23.1     5 1976
## 4695   1 130  88 213       0    11688  47 28.4     9 3195
## 4696   2 112  68 252       0    11688  40 22.0     4 1674
## 4697   1 112  76 177       0    11688  40 23.5    12 4526
## 4698   2 125  75 244       0    11688  47 30.4    12 4536
## 4699   2 142  70 255       0    11688  44 23.3     4 1786dim(fram)## [1] 4699   10names(fram)##  [1] "sex"      "sbp"      "dbp"      "scl"      "chdfate"  "followup"
##  [7] "age"      "bmi"      "month"    "id"We will only consider one potential predictor, systolic blood pressure, sbp on the response variable chdfate. We would like to investigate the research question: Does SBP influence probability of CHD?
Univariate distributions:
fram %>% ggplot(aes(x = chdfate)) +
    geom_bar()hist(fram$sbp) Simple scatterplot and side-by-side boxplots (note response and explanatory variables reversed axes than typically used):
plot(chdfate~sbp, xlab="Systolic BP", ylab="Coronary Heart Disease Indicator", data=fram)boxplot(sbp ~ chdfate, data=fram)Since it is hard to visualize a response variable of 0’s and 1’s, let’s create subgroups for SBP and look at the proportion who experienced CHD in each group.
fram$sbpgrp <- with(fram,
               cut(sbp, breaks = c(min(sbp),126,146,166,max(sbp)),
            include.lowest=TRUE)
)
# Count number of CHD incidences within each group:
counts <- with(fram,
               table(list(chdfate,sbpgrp))
)
counts##    .2
## .1  [80,126] (126,146] (146,166] (166,270]
##   0     1663       978       395       190
##   1      534       503       255       181# Proportions in each group:
props <- counts[2,]/colSums(counts)
props##  [80,126] (126,146] (146,166] (166,270] 
## 0.2430587 0.3396354 0.3923077 0.4878706Now plot proportions and add to the original scatterplot.
# SBP Midpoints:
mids = c(103,136,156,218)
plot(chdfate~sbp, xlab="Systolic BP", ylab="Coronary Heart Disease Indicator", data=fram)
points(props~mids, xlab="Systolic BP", ylab="Proportion with CHD", col="tomato", pch = 16)
lines(lowess(fram$chdfate~fram$sbp), col="red")Linear probability model:
mod1 <- glm(chdfate ~ sbp, family=binomial(link="identity"), data=fram)
summary(mod1)## 
## Call:
## glm(formula = chdfate ~ sbp, family = binomial(link = "identity"), 
##     data = fram)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8829  -0.8729  -0.7565   1.3610   1.9201  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.1889767  0.0386344  -4.891    1e-06 ***
## sbp          0.0037744  0.0002925  12.906   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5844.1  on 4698  degrees of freedom
## Residual deviance: 5689.0  on 4697  degrees of freedom
## AIC: 5693
## 
## Number of Fisher Scoring iterations: 4Logistic regression model: The default link for binomial family is logit.
mod2 <- glm(chdfate~sbp, family=binomial, data=fram)
summary(mod2)## 
## Call:
## glm(formula = chdfate ~ sbp, family = binomial, data = fram)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8320  -0.8668  -0.7634   1.3677   1.8368  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.008809   0.189815  -15.85   <2e-16 ***
## sbp          0.016593   0.001385   11.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5844.1  on 4698  degrees of freedom
## Residual deviance: 5695.7  on 4697  degrees of freedom
## AIC: 5699.7
## 
## Number of Fisher Scoring iterations: 4Probit regression model:
mod3 <- glm(chdfate~sbp, family=binomial(link="probit"), data=fram)
summary(mod3)## 
## Call:
## glm(formula = chdfate ~ sbp, family = binomial(link = "probit"), 
##     data = fram)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8426  -0.8680  -0.7618   1.3660   1.8506  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.8529512  0.1144066  -16.20   <2e-16 ***
## sbp          0.0102093  0.0008407   12.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5844.1  on 4698  degrees of freedom
## Residual deviance: 5694.3  on 4697  degrees of freedom
## AIC: 5698.3
## 
## Number of Fisher Scoring iterations: 4Add the three fitted models to the scatterplot:
plot(chdfate~sbp, xlab="Systolic BP", ylab="Coronary Heart Disease Indicator", data=fram)
points(props~mids, xlab="Systolic BP", ylab="Proportion with CHD", col="tomato", pch = 16)
# Linear:
abline(mod1,lwd=2,col="darkgreen")
# Logistic:
curve(exp(mod2$coef[1]+mod2$coef[2]*x)/(1+exp(mod2$coef[1]+mod2$coef[2]*x)),
        add=T,lwd=2,lty=2,from=90, to=300,col="red")
# Probit:
curve(pnorm(mod3$coef[1]+mod3$coef[2]*x),
        add=T,lwd=2,lty=3,col="blue")
legend("topleft",c("Linear","Logit","Probit"),lty=c(1,2,3),
    col=c("darkgreen","red","blue"))Let’s continue using mod2, the logistic regression model.
Interpretations are always on the odds scale in logistic regression.
exp(mod2$coef)## (Intercept)         sbp 
##  0.04935042  1.01673181(Intercept) The estimated odds of CHD for an individual with 0 mmHg systolic blood pressure are 0.0494 to 1.
sbp A one mmHg increase in systolic blood pressure is associated with a 1.673% increase in estimated odds of CHD.
Confidence interval for slope - interpret on odds scale:
exp(confint(mod2))## Waiting for profiling to be done...##                  2.5 %     97.5 %
## (Intercept) 0.03394333 0.07144715
## sbp         1.01398776 1.01951091A one mmHg increase in systolic blood pressure is associated with an increase of between 1.40% to 1.95% in the odds of CHD.
Estimated odds ratio of CHD for SBP = 140 compared to SBP = 120:
exp(mod2$coef[2] * (140-120))##      sbp 
## 1.393568# same as
exp(mod2$coef[2] * (110-90))##      sbp 
## 1.393568To calculate a 95% CI for this odds ratio, first calculate a CI for \(\beta\):
ztable <- summary(mod2)$coefficients
ztable##                Estimate  Std. Error   z value     Pr(>|z|)
## (Intercept) -3.00880898 0.189815484 -15.85123 1.378551e-56
## sbp          0.01659338 0.001385338  11.97785 4.642017e-33CIbeta <- ztable[2,1] + c(-1,1)*qnorm(.975)*ztable[2,2]
CIbeta## [1] 0.01387816 0.01930859Then multiply by 20 and transform the endpoints to get CI for \(e^{20\beta}\):
exp(CIbeta * 20)## [1] 1.319910 1.471337Estimated median effective level:
-mod2$coef[1]/mod2$coef[2]## (Intercept) 
##    181.3259The estimated probability of CHD is 0.5 when systolic blood pressure is 181.33 mmHg.
Your rate of change in the odds of CHD with SBP is highest around an SBP of 181.33 mmHg.
Estimated risk of CHD and rate of change in risk for a one unit increase in SBP when SBP is around 140 (hypertension):
co <- mod2$coef
p140 <- exp(co[1]+co[2]*140)/(1+exp(co[1]+co[2]*140))
# Risk
p140## (Intercept) 
##   0.3349822# Rate of change
co[2] * p140 * (1-p140)##         sbp 
## 0.003696492The predicted probability of CHD for an individual with systolic blood pressure of 140 mmMg is 0.335. That is, there is an estimated 33.5% chance of CHD at some point in the 30 years of the study.
For each additional mmMg for that individual, the predicted probability of CHD increases (additively) by 0.0037.
More efficiently, we can write a function to calculate estimated probabilities:
prob <- function(co,x){
    # co = vector of coefficients (logistic model 1 predictor)
    # x = x-value
    as.numeric(exp(co[1]+co[2]*x)/(1+exp(co[1]+co[2]*x)))
}
# Predicted risk of CHD when SBP = 120:
prob(mod2$coef,120)## [1] 0.2654944Predictions using Logistic model, by hand:
co <- mod2$coef
# Odds for sbp = 110:
exp(mod2$coef[1]+mod2$coef[2]*110)## (Intercept) 
##   0.3061936# Probability for sbp = 110:
exp(mod2$coef[1]+mod2$coef[2]*110)/(1+exp(mod2$coef[1]+mod2$coef[2]*110))## (Intercept) 
##   0.2344167or by using the predict function:
?predict.glm
newdata <- data.frame(sbp = c(110,150))
# Predicted probabilities:
predict(mod2,newdata,type="response")##         1         2 
## 0.2344167 0.3728984# Predicted log-odds:
predict(mod2,newdata,type="link")##          1          2 
## -1.1835377 -0.5198026# Predicted odds:
exp(predict(mod2,newdata,type="link"))##         1         2 
## 0.3061936 0.5946379Check the predictive power of our logistic regression model. First, create a classification table using a cutoff of the sample proportion of those that experienced CHD.
# Predictive probabilities
probs <- fitted(mod2)
# Proportion in sample with chdfate = 1
mean(fram$chdfate)  # About 0.31## [1] 0.313471# Use 0.31 as cutoff to calculate sensitivity and specificity:
# pred prob > 0.31 -> 1; <= 0.31 -> 0
preds <- as.numeric(probs>0.31)
table(fram$chdfate,preds)##    preds
##        0    1
##   0 2036 1190
##   1  698  775Sensitivity = P(y-hat = 1 | y = 1):
775/(698+775)## [1] 0.5261371Specificity = P(y-hat = 0 | y = 0):
2036/(2036+1190)## [1] 0.6311221The classification table is sensitive to our choice of cutoff. How do answers change if we use a cutoff of 0.5?
preds <- as.numeric(probs>0.5)
table(fram$chdfate,preds)##    preds
##        0    1
##   0 3134   92
##   1 1381   92library(ROCR)## Loading required package: gplots## 
## Attaching package: 'gplots'## The following object is masked from 'package:stats':
## 
##     lowess# arguments = fitted probabilities; observed 0/1
pred.obj <- prediction(probs,fram$chdfate)  
# Plot ROC curve
plot(performance(pred.obj,"tpr","fpr"))# Area under the ROC curve
performance(pred.obj,"auc")## An object of class "performance"
## Slot "x.name":
## [1] "None"
## 
## Slot "y.name":
## [1] "Area under the ROC curve"
## 
## Slot "alpha.name":
## [1] "none"
## 
## Slot "x.values":
## list()
## 
## Slot "y.values":
## [[1]]
## [1] 0.6122628
## 
## 
## Slot "alpha.values":
## list()Our model is a very poor predictor of CHD.
# Recode sex to be an indicator of female
fram$sex <- fram$sex-1Logistic regression with two predictors - no interaction between sbp and sex:
mod4 <- glm(chdfate ~ sbp + sex, family = "binomial", data = fram)
summary(mod4)## 
## Call:
## glm(formula = chdfate ~ sbp + sex, family = "binomial", data = fram)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7431  -0.8935  -0.6838   1.2765   1.9910  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.76479    0.19401  -14.25   <2e-16 ***
## sbp          0.01785    0.00142   12.57   <2e-16 ***
## sex         -0.78261    0.06535  -11.97   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5844.1  on 4698  degrees of freedom
## Residual deviance: 5549.7  on 4696  degrees of freedom
## AIC: 5555.7
## 
## Number of Fisher Scoring iterations: 4exp(coef(mod4))## (Intercept)         sbp         sex 
##  0.06298909  1.01801052  0.45720929Interpretations?
spb:
sex:
Logistic regression with two predictors - interaction between sbp and sex:
mod5 <- glm(chdfate ~ sbp*sex, family = "binomial", data = fram)
summary(mod5)## 
## Call:
## glm(formula = chdfate ~ sbp * sex, family = "binomial", data = fram)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9162  -0.9263  -0.6713   1.2860   2.0446  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.088799   0.310533  -6.726 1.74e-11 ***
## sbp          0.012758   0.002315   5.512 3.55e-08 ***
## sex         -1.866700   0.400881  -4.656 3.22e-06 ***
## sbp:sex      0.008048   0.002934   2.744  0.00608 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5844.1  on 4698  degrees of freedom
## Residual deviance: 5542.2  on 4695  degrees of freedom
## AIC: 5550.2
## 
## Number of Fisher Scoring iterations: 4exp(coef(mod5))## (Intercept)         sbp         sex     sbp:sex 
##   0.1238357   1.0128399   0.1546331   1.0080807Look at the model within each sex:
Males (sex = 0): (\(x\) = sbp) \[ \log\left(\frac{\hat{\pi}}{1-\hat{\pi}}\right) = -2.0888 + 0.0128x \]
The estimated odds of CHD for males increase by 1.28% for each additional mmMg in systolic blood pressure.
OR
A one mmMG increase in systolic blood pressure is associated with an estimated 1.28% increase in the odds of CHD for males.
Females (sex = 1): (\(x\) = sbp)
# Intercept for females
coef(mod5)[1]+coef(mod5)[3]## (Intercept) 
##   -3.955499# Slope for females
coef(mod5)[2]+coef(mod5)[4]##        sbp 
## 0.02080645\[ \begin{eqnarray*} \log\left(\frac{\hat{\pi}}{1-\hat{\pi}}\right) &= (-2.0888-1.8667) + (0.0128+0.008048)x \\ &= -3.9555 + 0.0208x \end{eqnarray*} \] For every one mmMg increase in SBP, the estimated increase of odds of CHD is 2.1% for females.
exp(coef(mod5))## (Intercept)         sbp         sex     sbp:sex 
##   0.1238357   1.0128399   0.1546331   1.0080807How the effect of SBP changes with sex:
The increase in estimated odds for a one mmHg increase in SBP is 0.8% higher for females than for males.
OR
How the effect of sex changes with SBP:
The change in estimated odds from males to females is 0.8% higher for each 1 mmHG increase in SBP.
OR
The estimated odds ratio of CHD for females compared to males increases by 0.8% for each additional mmHG in systolic blood pressure.