---
title: "Models for Matched Pairs"
author: "Stacey Hancock"
date: "4/9/2020"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(PropCIs)  # diffprop.Wald.mp and scoreci.mp
library(gee)  # marginal models with generalized estimating equations
library(lme4) # generalized linear mixed models
```

# Dependent Proportions: McNemar Test

## Example: Asthma Prevalence over Time

Researchers record whether 500 children have asthma or not
at age 13, then follow up with these same children to see
if they have asthma or not 7 years later (age 20).
```{r}
dat <- matrix(c(50,8,22,420),2,2)
dimnames(dat) <- list(Age13 = c("Yes","No"), 
                      Age20 = c("Yes","No"))
dat
```

Researchers think that prevalence of asthma decreases over time:

$H_0$: The marginal probability of asthma at age 13 is *equal to* the marginal probability of asthma at 20.  
$H_a$: The marginal probability of asthma at age 13 is *greater than* the marginal probability of asthma at 20.

Under $H_0$ that the two marginal probabilities are equal, the
off-diagonal cell counts each have a binomial distribution with
$22+8 = 30$ trials and probability 0.5:  (Yes,No) cell ~ Bin(22+8,.5).
We can use this distribution to calculate our p-value:
```{r}
# One-sided p-value:
pbinom(21,30,.5,lower.tail=FALSE)
```
That is, the probability of 22 or more individuals in the (Yes,No)
cell is 0.008. Our observed data are unlikely to have occurred under $H_0$.
Thus, we have strong evidence that the probability of asthma decreases
from age 13 to 20. But by how much?


### Confidence intervals for correlated difference in proportions

Do in class: Calculations by "hand".
```{r}
diffpropci.Wald.mp(dat[2,1], dat[1,2],
                   sum(dat), 0.95)  # n21, n12, n, conf.level
scoreci.mp(dat[2,1], dat[1,2],
           sum(dat), 0.95)
```


### `mcnemar.test` R function
The built-in function `mcnemar.test` calculates a two-sided
p-value using a chi-squared approximation:
```{r}
mcnemar.test(dat,correct=FALSE)
# Test statistic
X2 <- (22 - 8)^2/(22+8)
X2
# p-value
pchisq(X2, 1, lower.tail=FALSE)
```

### Wrong analysis

What if we had ignored the dependence, and recorded 
the data as two independent samples of 500 individuals each:
```{r}
dat.indep <- matrix(c(72,58,428,442),2,2)
dimnames(dat.indep) <- list(Age = c("13","20"),
                            Asthma = c("Yes","No"))
dat.indep
chisq.test(dat.indep,correct=FALSE)
```

The chi-squared test for independence is a two-sided alternative,
so our one-sided p-value would be 
```{r}
0.188/2
```
Our conclusion is substantially different!

# Dependent Proportions: Logistic Regression

First, we need the data in long format: Each row is a single observation
on a single individual (so each individual has two rows).
```{r}
dat.long <- data.frame(
    ID = rep(1:500, each = 2),
    Age = c(rep(c(1,2), 500)),  # 1 = age 13, 2 = age 20
    y = c(rep(c(1,1),50),  # Yes, Yes
          rep(c(1,0),22),  # Yes, No
          rep(c(0,1),8),   # No, Yes
          rep(c(0,0),420)) # No, No
)
head(dat.long)
tail(dat.long)
```

## Marginal Model
```{r}
mod.gee <- gee(y ~ Age, id = ID, family=binomial(link = "logit"),
               corstr = "exchangeable", data = dat.long)
summary(mod.gee)
```

## Generalized Linear Mixed Model
```{r}
mod.glmm <- glmer(y ~ (1|ID) + Age, family=binomial(link = "logit"),
                  data = dat.long)
summary(mod.glmm)
```