--- title: "Models for Matched Pairs" author: "Stacey Hancock" date: "4/9/2020" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(PropCIs) # diffprop.Wald.mp and scoreci.mp library(gee) # marginal models with generalized estimating equations library(lme4) # generalized linear mixed models ``` # Dependent Proportions: McNemar Test ## Example: Asthma Prevalence over Time Researchers record whether 500 children have asthma or not at age 13, then follow up with these same children to see if they have asthma or not 7 years later (age 20). ```{r} dat <- matrix(c(50,8,22,420),2,2) dimnames(dat) <- list(Age13 = c("Yes","No"), Age20 = c("Yes","No")) dat ``` Researchers think that prevalence of asthma decreases over time: $H_0$: The marginal probability of asthma at age 13 is *equal to* the marginal probability of asthma at 20. $H_a$: The marginal probability of asthma at age 13 is *greater than* the marginal probability of asthma at 20. Under $H_0$ that the two marginal probabilities are equal, the off-diagonal cell counts each have a binomial distribution with $22+8 = 30$ trials and probability 0.5: (Yes,No) cell ~ Bin(22+8,.5). We can use this distribution to calculate our p-value: ```{r} # One-sided p-value: pbinom(21,30,.5,lower.tail=FALSE) ``` That is, the probability of 22 or more individuals in the (Yes,No) cell is 0.008. Our observed data are unlikely to have occurred under $H_0$. Thus, we have strong evidence that the probability of asthma decreases from age 13 to 20. But by how much? ### Confidence intervals for correlated difference in proportions Do in class: Calculations by "hand". ```{r} diffpropci.Wald.mp(dat[2,1], dat[1,2], sum(dat), 0.95) # n21, n12, n, conf.level scoreci.mp(dat[2,1], dat[1,2], sum(dat), 0.95) ``` ### `mcnemar.test` R function The built-in function `mcnemar.test` calculates a two-sided p-value using a chi-squared approximation: ```{r} mcnemar.test(dat,correct=FALSE) # Test statistic X2 <- (22 - 8)^2/(22+8) X2 # p-value pchisq(X2, 1, lower.tail=FALSE) ``` ### Wrong analysis What if we had ignored the dependence, and recorded the data as two independent samples of 500 individuals each: ```{r} dat.indep <- matrix(c(72,58,428,442),2,2) dimnames(dat.indep) <- list(Age = c("13","20"), Asthma = c("Yes","No")) dat.indep chisq.test(dat.indep,correct=FALSE) ``` The chi-squared test for independence is a two-sided alternative, so our one-sided p-value would be ```{r} 0.188/2 ``` Our conclusion is substantially different! # Dependent Proportions: Logistic Regression First, we need the data in long format: Each row is a single observation on a single individual (so each individual has two rows). ```{r} dat.long <- data.frame( ID = rep(1:500, each = 2), Age = c(rep(c(1,2), 500)), # 1 = age 13, 2 = age 20 y = c(rep(c(1,1),50), # Yes, Yes rep(c(1,0),22), # Yes, No rep(c(0,1),8), # No, Yes rep(c(0,0),420)) # No, No ) head(dat.long) tail(dat.long) ``` ## Marginal Model ```{r} mod.gee <- gee(y ~ Age, id = ID, family=binomial(link = "logit"), corstr = "exchangeable", data = dat.long) summary(mod.gee) ``` ## Generalized Linear Mixed Model ```{r} mod.glmm <- glmer(y ~ (1|ID) + Age, family=binomial(link = "logit"), data = dat.long) summary(mod.glmm) ```