---
title: 'GLM Intro: Framingham Heart Study'
author: "Stacey Hancock"
date: "2/6/2020"
output:
  html_document: default
  pdf_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
```

The [Framingham heart study](https://www.framinghamheartstudy.org/fhs-about/) is a cohort longitudinal study on residents of a small Massachusetts town (Framingham) that has been
studying risk factors for cardiovascular disease (CVD) for over 70 years, since 1948. The data set we'll use here contains data
on 4699 subjects (men and women between the ages of 30 and 62).
The subjects were given biennial exams for blood pressure,
serum cholesterol, and relative weight. These data are
the 30-year follow-up data. Major endpoints include
the occurrence of coronary heart disease (CHD) and
deaths from CHD, cerebrovasular accident (CVA or stroke),
cancer, and other causes. 

## Scientific Objective
Identify the common factors or characteristics that contribute to cardiovascular disease (CVD) by following its development over a long period of time in a large group of participants who had not yet developed overt symptoms of CVD or suffered a heart attack or stroke.

## Variable Descriptions
sex      : Sex (1 = male; 2 = female)  
sbp      : Systolic Blood Pressure  
dbp      : Diastolic Blood Pressure  
scl      : Serum Cholesterol  
chdfate  : Coronary Heart Disease Indicator
followup : Follow-up in Days  
age      : Age in Years  
bmi      : Body Mass Index (wt (kg) / h^2 (m)  
month    : Study Month of Baseline Exam  
id       : Subject ID

## Analysis
Load in Framingham data:
```{r}
fram <- read.table("http://www.math.montana.edu/shancock/data/Framingham.txt")
# Check data were read in correctly:
head(fram)
tail(fram)
dim(fram)
names(fram)
```

We will only consider one potential predictor, systolic blood pressure, `sbp` on the response variable `chdfate`. We would like to investigate
the research question: _Does SBP influence probability of CHD?_

### Data Visualization and Descriptive Statistics

Univariate distributions:
```{r}
fram %>% ggplot(aes(x = chdfate)) +
    geom_bar()
hist(fram$sbp)
```
Simple scatterplot and side-by-side boxplots (note
response and explanatory variables reversed axes than
typically used):
```{r}
plot(chdfate~sbp, xlab="Systolic BP", ylab="Coronary Heart Disease Indicator", data=fram)
boxplot(sbp ~ chdfate, data=fram)
```

Since it is hard to visualize a response variable of 0's and 1's,
let's create subgroups for SBP and look at the proportion
who experienced CHD in each group.
```{r}
fram$sbpgrp <- with(fram,
               cut(sbp, breaks = c(min(sbp),126,146,166,max(sbp)),
			include.lowest=TRUE)
)
# Count number of CHD incidences within each group:
counts <- with(fram,
               table(list(chdfate,sbpgrp))
)
counts
# Proportions in each group:
props <- counts[2,]/colSums(counts)
props
```

Now plot proportions and add to the original scatterplot.
```{r}
# SBP Midpoints:
mids = c(103,136,156,218)
plot(chdfate~sbp, xlab="Systolic BP", ylab="Coronary Heart Disease Indicator", data=fram)
points(props~mids, xlab="Systolic BP", ylab="Proportion with CHD", col="tomato", pch = 16)
lines(lowess(fram$chdfate~fram$sbp), col="red")
```


### Models

_Linear probability model_:
```{r}
mod1 <- glm(chdfate ~ sbp, family=binomial(link="identity"), data=fram)
summary(mod1)
```


_Logistic regression model_:
The default link for binomial family is logit.
```{r}
mod2 <- glm(chdfate~sbp, family=binomial, data=fram)
summary(mod2)
```


_Probit regression model_:
```{r}
mod3 <- glm(chdfate~sbp, family=binomial(link="probit"), data=fram)
summary(mod3)
```

Add the three fitted models to the scatterplot:
```{r}
plot(chdfate~sbp, xlab="Systolic BP", ylab="Coronary Heart Disease Indicator", data=fram)
points(props~mids, xlab="Systolic BP", ylab="Proportion with CHD", col="tomato", pch = 16)
# Linear:
abline(mod1,lwd=2,col="darkgreen")
# Logistic:
curve(exp(mod2$coef[1]+mod2$coef[2]*x)/(1+exp(mod2$coef[1]+mod2$coef[2]*x)),
		add=T,lwd=2,lty=2,from=90, to=300,col="red")
# Probit:
curve(pnorm(mod3$coef[1]+mod3$coef[2]*x),
		add=T,lwd=2,lty=3,col="blue")
legend("topleft",c("Linear","Logit","Probit"),lty=c(1,2,3),
	col=c("darkgreen","red","blue"))
```

Let's continue using `mod2`, the logistic regression model.

### Interpreting coefficients

Interpretations are always on the odds scale in logistic regression.
```{r}
exp(mod2$coef)
```
`(Intercept)` The estimated odds of CHD for an individual with 0 mmHg systolic blood pressure are 0.0494 to 1.

`sbp` A one mmHg increase in systolic blood pressure is associated with a 1.673\% increase in estimated odds of CHD.

### Confidence intervals for coefficients

Confidence interval for slope - interpret on odds scale:
```{r}
exp(confint(mod2))
```
A one mmHg increase in systolic blood pressure is associated with an increase of between 1.40\% to 1.95\% in the odds of CHD.

### Odds ratios for an x change in SBP
Estimated odds ratio of CHD for SBP = 140 compared to SBP = 120:
```{r}
exp(mod2$coef[2] * (140-120))
# same as
exp(mod2$coef[2] * (110-90))
```

To calculate a 95% CI for this odds ratio, first calculate a CI
for $\beta$:
```{r}
ztable <- summary(mod2)$coefficients
ztable
CIbeta <- ztable[2,1] + c(-1,1)*qnorm(.975)*ztable[2,2]
CIbeta
```
Then multiply by 20 and transform the endpoints to get CI for $e^{20\beta}$:
```{r}
exp(CIbeta * 20)
```

### Median effective level
Estimated median effective level:
```{r}
-mod2$coef[1]/mod2$coef[2]
```
The estimated probability of CHD is 0.5 when systolic blood pressure is 181.33 mmHg.

Your rate of change in the odds of CHD with SBP is highest around an SBP of 181.33 mmHg.


### Predictions
Estimated risk of CHD and rate of change in risk for a one unit increase in SBP
when SBP is around 140 (hypertension):
```{r}
co <- mod2$coef
p140 <- exp(co[1]+co[2]*140)/(1+exp(co[1]+co[2]*140))
# Risk
p140
# Rate of change
co[2] * p140 * (1-p140)
```

The predicted probability of CHD for an individual with systolic blood pressure of 140 mmMg is 0.335. That is,
there is an estimated 33.5\% chance of CHD at some point in the 30 years of the study. 

For each additional mmMg for that individual, the predicted probability of CHD increases (additively) by 0.0037.

More efficiently, we can write a function to calculate estimated probabilities:
```{r}
prob <- function(co,x){
	# co = vector of coefficients (logistic model 1 predictor)
	# x = x-value
	as.numeric(exp(co[1]+co[2]*x)/(1+exp(co[1]+co[2]*x)))
}
# Predicted risk of CHD when SBP = 120:
prob(mod2$coef,120)
```


Predictions using Logistic model, by hand:
```{r}
co <- mod2$coef
# Odds for sbp = 110:
exp(mod2$coef[1]+mod2$coef[2]*110)
# Probability for sbp = 110:
exp(mod2$coef[1]+mod2$coef[2]*110)/(1+exp(mod2$coef[1]+mod2$coef[2]*110))
```
or by using the predict function:
```{r}
?predict.glm
newdata <- data.frame(sbp = c(110,150))
# Predicted probabilities:
predict(mod2,newdata,type="response")
# Predicted log-odds:
predict(mod2,newdata,type="link")
# Predicted odds:
exp(predict(mod2,newdata,type="link"))
```

#### Checking Predictive Power
Check the predictive power of our logistic regression model. First, create a classification table using a cutoff of the sample proportion of those that experienced CHD.
```{r}
# Predictive probabilities
probs <- fitted(mod2)
# Proportion in sample with chdfate = 1
mean(fram$chdfate)  # About 0.31

# Use 0.31 as cutoff to calculate sensitivity and specificity:
# pred prob > 0.31 -> 1; <= 0.31 -> 0
preds <- as.numeric(probs>0.31)
table(fram$chdfate,preds)
```

Sensitivity = P(y-hat = 1 | y = 1):
```{r}
775/(698+775)
```

Specificity = P(y-hat = 0 | y = 0):
```{r}
2036/(2036+1190)
```
The classification table is sensitive to our choice of cutoff. How do answers change if we use a cutoff of 0.5?
```{r}
preds <- as.numeric(probs>0.5)
table(fram$chdfate,preds)
```

##### ROC Curve
```{r}
library(ROCR)
# arguments = fitted probabilities; observed 0/1
pred.obj <- prediction(probs,fram$chdfate)  
# Plot ROC curve
plot(performance(pred.obj,"tpr","fpr"))
# Area under the ROC curve
performance(pred.obj,"auc")
```
Our model is a very poor predictor of CHD.

### Examine logistic model with predictors sbp and sex
```{r}
# Recode sex to be an indicator of female
fram$sex <- fram$sex-1
```

#### No interaction
Logistic regression with two predictors - no interaction between sbp and sex:
```{r}
mod4 <- glm(chdfate ~ sbp + sex, family = "binomial", data = fram)
summary(mod4)
exp(coef(mod4))
```

_Interpretations?_

`spb`:  

- For every one mmMg increase in SBP, the estimated increase of odds of CHD is 1.8\%, controlling for sex.
- A one mmMg increase in SBP is associated with an estimated 1.8\% of the odds of CHD, adjusting for sex.
- A one mmMg increase in SBP is associated with an estimated 1.8\% of the odds of CHD when comparing individuals of the same sex.

`sex`:

- The estimated odds of CHD for females are 54.28\% lower than for males, controlling for SBP.
- Among individuals with the same SBP, the odds ratio of CHD for females to males is 0.457.

#### Interaction
Logistic regression with two predictors - interaction between sbp and sex:
```{r}
mod5 <- glm(chdfate ~ sbp*sex, family = "binomial", data = fram)
summary(mod5)
exp(coef(mod5))
```

Look at the model within each sex:

_Males_ (sex = 0): ($x$ = sbp)
$$
\log\left(\frac{\hat{\pi}}{1-\hat{\pi}}\right) = -2.0888 + 0.0128x 
$$

The estimated odds of CHD for _males_ increase by 1.28\% for each additional mmMg in systolic blood pressure.

OR

A one mmMG increase in systolic blood pressure is associated with an estimated 1.28\% increase in the odds of CHD for _males_. 

_Females_ (sex = 1): ($x$ = sbp)
```{r}
# Intercept for females
coef(mod5)[1]+coef(mod5)[3]
# Slope for females
coef(mod5)[2]+coef(mod5)[4]
```
$$
\begin{eqnarray*}
\log\left(\frac{\hat{\pi}}{1-\hat{\pi}}\right) &= (-2.0888-1.8667) + (0.0128+0.008048)x \\
&= -3.9555 + 0.0208x
\end{eqnarray*}
$$
For every one mmMg increase in SBP, the estimated increase of odds of CHD is 2.1\% for _females_.

##### Interpret interaction coefficient directly
```{r}
exp(coef(mod5))
```

How the effect of SBP changes with sex:

The increase in estimated odds for a one mmHg increase in SBP
is 0.8\% higher for females than for males.

OR

How the effect of sex changes with SBP:

The change in estimated odds from males to females is 0.8\% higher for each 1 mmHG increase in SBP.

OR

The estimated odds ratio of CHD for females compared to males increases by 0.8\% for each additional mmHG in systolic blood pressure.