Dependent Proportions: McNemar Test

Example: Asthma Prevalence over Time

Researchers record whether 500 children have asthma or not at age 13, then follow up with these same children to see if they have asthma or not 7 years later (age 20).

dat <- matrix(c(50,8,22,420),2,2)
dimnames(dat) <- list(Age13 = c("Yes","No"), 
                      Age20 = c("Yes","No"))
dat
##      Age20
## Age13 Yes  No
##   Yes  50  22
##   No    8 420

Researchers think that prevalence of asthma decreases over time:

\(H_0\): The marginal probability of asthma at age 13 is equal to the marginal probability of asthma at 20.
\(H_a\): The marginal probability of asthma at age 13 is greater than the marginal probability of asthma at 20.

Under \(H_0\) that the two marginal probabilities are equal, the off-diagonal cell counts each have a binomial distribution with \(22+8 = 30\) trials and probability 0.5: (Yes,No) cell ~ Bin(22+8,.5). We can use this distribution to calculate our p-value:

# One-sided p-value:
pbinom(21,30,.5,lower.tail=FALSE)
## [1] 0.008062401

That is, the probability of 22 or more individuals in the (Yes,No) cell is 0.008. Our observed data are unlikely to have occurred under \(H_0\). Thus, we have strong evidence that the probability of asthma decreases from age 13 to 20. But by how much?

Confidence intervals for correlated difference in proportions

Do in class: Calculations by “hand”.

diffpropci.Wald.mp(dat[2,1], dat[1,2],
                   sum(dat), 0.95)  # n21, n12, n, conf.level
## 
## 
## 
## data:  
## 
## 95 percent confidence interval:
##  0.006670404 0.049329596
## sample estimates:
## [1] 0.028
scoreci.mp(dat[2,1], dat[1,2],
           sum(dat), 0.95)
## 
## 
## 
## data:  
## 
## 95 percent confidence interval:
##  0.006972441 0.051656601

mcnemar.test R function

The built-in function mcnemar.test calculates a two-sided p-value using a chi-squared approximation:

mcnemar.test(dat,correct=FALSE)
## 
##  McNemar's Chi-squared test
## 
## data:  dat
## McNemar's chi-squared = 6.5333, df = 1, p-value = 0.01059
# Test statistic
X2 <- (22 - 8)^2/(22+8)
X2
## [1] 6.533333
# p-value
pchisq(X2, 1, lower.tail=FALSE)
## [1] 0.01058714

Wrong analysis

What if we had ignored the dependence, and recorded the data as two independent samples of 500 individuals each:

dat.indep <- matrix(c(72,58,428,442),2,2)
dimnames(dat.indep) <- list(Age = c("13","20"),
                            Asthma = c("Yes","No"))
dat.indep
##     Asthma
## Age  Yes  No
##   13  72 428
##   20  58 442
chisq.test(dat.indep,correct=FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  dat.indep
## X-squared = 1.733, df = 1, p-value = 0.188

The chi-squared test for independence is a two-sided alternative, so our one-sided p-value would be

0.188/2
## [1] 0.094

Our conclusion is substantially different!

Dependent Proportions: Logistic Regression

First, we need the data in long format: Each row is a single observation on a single individual (so each individual has two rows).

dat.long <- data.frame(
    ID = rep(1:500, each = 2),
    Age = c(rep(c(1,2), 500)),  # 1 = age 13, 2 = age 20
    y = c(rep(c(1,1),50),  # Yes, Yes
          rep(c(1,0),22),  # Yes, No
          rep(c(0,1),8),   # No, Yes
          rep(c(0,0),420)) # No, No
)
head(dat.long)
##   ID Age y
## 1  1   1 1
## 2  1   2 1
## 3  2   1 1
## 4  2   2 1
## 5  3   1 1
## 6  3   2 1
tail(dat.long)
##       ID Age y
## 995  498   1 0
## 996  498   2 0
## 997  499   1 0
## 998  499   2 0
## 999  500   1 0
## 1000 500   2 0

Marginal Model

mod.gee <- gee(y ~ Age, id = ID, family=binomial(link = "logit"),
               corstr = "exchangeable", data = dat.long)
## Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
## running glm to get initial regression estimate
## (Intercept)         Age 
##  -1.5340473  -0.2484098
summary(mod.gee)
## 
##  GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
##  gee S-function, version 4.13 modified 98/01/27 (1998) 
## 
## Model:
##  Link:                      Logit 
##  Variance to Mean Relation: Binomial 
##  Correlation Structure:     Exchangeable 
## 
## Call:
## gee(formula = y ~ Age, id = ID, data = dat.long, family = binomial(link = "logit"), 
##     corstr = "exchangeable")
## 
## Summary of Residuals:
##    Min     1Q Median     3Q    Max 
## -0.144 -0.144 -0.116 -0.116  0.884 
## 
## 
## Coefficients:
##               Estimate Naive S.E.   Naive z Robust S.E.  Robust z
## (Intercept) -1.5340473 0.17788632 -8.623751  0.17800595 -8.617955
## Age         -0.2484098 0.09661973 -2.571005  0.09679686 -2.566300
## 
## Estimated Scale Parameter:  1.002004
## Number of Iterations:  1
## 
## Working Correlation
##           [,1]      [,2]
## [1,] 1.0000000 0.7423729
## [2,] 0.7423729 1.0000000

Generalized Linear Mixed Model

mod.glmm <- glmer(y ~ (1|ID) + Age, family=binomial(link = "logit"),
                  data = dat.long)
summary(mod.glmm)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: y ~ (1 | ID) + Age
##    Data: dat.long
## 
##      AIC      BIC   logLik deviance df.resid 
##    416.1    430.8   -205.1    410.1      997 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.68798 -0.00766 -0.00247 -0.00247  1.83701 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  ID     (Intercept) 345.1    18.58   
## Number of obs: 1000, groups:  ID, 500
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -7.4569     0.9161  -8.140 3.94e-16 ***
## Age          -2.2633     0.7187  -3.149  0.00164 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##     (Intr)
## Age -0.660