Researchers record whether 500 children have asthma or not at age 13, then follow up with these same children to see if they have asthma or not 7 years later (age 20).
dat <- matrix(c(50,8,22,420),2,2)
dimnames(dat) <- list(Age13 = c("Yes","No"),
Age20 = c("Yes","No"))
dat
## Age20
## Age13 Yes No
## Yes 50 22
## No 8 420
Researchers think that prevalence of asthma decreases over time:
\(H_0\): The marginal probability of asthma at age 13 is equal to the marginal probability of asthma at 20.
\(H_a\): The marginal probability of asthma at age 13 is greater than the marginal probability of asthma at 20.
Under \(H_0\) that the two marginal probabilities are equal, the off-diagonal cell counts each have a binomial distribution with \(22+8 = 30\) trials and probability 0.5: (Yes,No) cell ~ Bin(22+8,.5). We can use this distribution to calculate our p-value:
# One-sided p-value:
pbinom(21,30,.5,lower.tail=FALSE)
## [1] 0.008062401
That is, the probability of 22 or more individuals in the (Yes,No) cell is 0.008. Our observed data are unlikely to have occurred under \(H_0\). Thus, we have strong evidence that the probability of asthma decreases from age 13 to 20. But by how much?
mcnemar.test
R functionThe built-in function mcnemar.test
calculates a two-sided p-value using a chi-squared approximation:
mcnemar.test(dat,correct=FALSE)
##
## McNemar's Chi-squared test
##
## data: dat
## McNemar's chi-squared = 6.5333, df = 1, p-value = 0.01059
# Test statistic
X2 <- (22 - 8)^2/(22+8)
X2
## [1] 6.533333
# p-value
pchisq(X2, 1, lower.tail=FALSE)
## [1] 0.01058714
What if we had ignored the dependence, and recorded the data as two independent samples of 500 individuals each:
dat.indep <- matrix(c(72,58,428,442),2,2)
dimnames(dat.indep) <- list(Age = c("13","20"),
Asthma = c("Yes","No"))
dat.indep
## Asthma
## Age Yes No
## 13 72 428
## 20 58 442
chisq.test(dat.indep,correct=FALSE)
##
## Pearson's Chi-squared test
##
## data: dat.indep
## X-squared = 1.733, df = 1, p-value = 0.188
The chi-squared test for independence is a two-sided alternative, so our one-sided p-value would be
0.188/2
## [1] 0.094
Our conclusion is substantially different!
First, we need the data in long format: Each row is a single observation on a single individual (so each individual has two rows).
dat.long <- data.frame(
ID = rep(1:500, each = 2),
Age = c(rep(c(1,2), 500)), # 1 = age 13, 2 = age 20
y = c(rep(c(1,1),50), # Yes, Yes
rep(c(1,0),22), # Yes, No
rep(c(0,1),8), # No, Yes
rep(c(0,0),420)) # No, No
)
head(dat.long)
## ID Age y
## 1 1 1 1
## 2 1 2 1
## 3 2 1 1
## 4 2 2 1
## 5 3 1 1
## 6 3 2 1
tail(dat.long)
## ID Age y
## 995 498 1 0
## 996 498 2 0
## 997 499 1 0
## 998 499 2 0
## 999 500 1 0
## 1000 500 2 0
mod.gee <- gee(y ~ Age, id = ID, family=binomial(link = "logit"),
corstr = "exchangeable", data = dat.long)
## Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
## running glm to get initial regression estimate
## (Intercept) Age
## -1.5340473 -0.2484098
summary(mod.gee)
##
## GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
## gee S-function, version 4.13 modified 98/01/27 (1998)
##
## Model:
## Link: Logit
## Variance to Mean Relation: Binomial
## Correlation Structure: Exchangeable
##
## Call:
## gee(formula = y ~ Age, id = ID, data = dat.long, family = binomial(link = "logit"),
## corstr = "exchangeable")
##
## Summary of Residuals:
## Min 1Q Median 3Q Max
## -0.144 -0.144 -0.116 -0.116 0.884
##
##
## Coefficients:
## Estimate Naive S.E. Naive z Robust S.E. Robust z
## (Intercept) -1.5340473 0.17788632 -8.623751 0.17800595 -8.617955
## Age -0.2484098 0.09661973 -2.571005 0.09679686 -2.566300
##
## Estimated Scale Parameter: 1.002004
## Number of Iterations: 1
##
## Working Correlation
## [,1] [,2]
## [1,] 1.0000000 0.7423729
## [2,] 0.7423729 1.0000000
mod.glmm <- glmer(y ~ (1|ID) + Age, family=binomial(link = "logit"),
data = dat.long)
summary(mod.glmm)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: y ~ (1 | ID) + Age
## Data: dat.long
##
## AIC BIC logLik deviance df.resid
## 416.1 430.8 -205.1 410.1 997
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.68798 -0.00766 -0.00247 -0.00247 1.83701
##
## Random effects:
## Groups Name Variance Std.Dev.
## ID (Intercept) 345.1 18.58
## Number of obs: 1000, groups: ID, 500
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.4569 0.9161 -8.140 3.94e-16 ***
## Age -2.2633 0.7187 -3.149 0.00164 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## Age -0.660