--- title: "Randomization-based Inference in R" author: "Stacey Hancock" date: "1/23/2020" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` This R Markdown file accompanies the examples in the "Randomization-Based Inference for Contingency Tables" handout in Stat 439. Examples were taken from a handout by Allan Rossman at the Northwest Two-Year College Mathematics Conference, April 23, 2016. We need to first load the `mosaic` package: ```{r, results=FALSE, message=FALSE} library(mosaic) ``` The following R functions are in the `mosaic` package: `shuffle`, `prop`, `diffprop`, `do`, `histogram`, `oddsRatio`. The `mosaic` package also loads the `ggformula` package, which contains the R function `gf_bar`. ## Tagging Penguins Are metal bands used for tagging harmful to penguins? Researchers (Saraux et al., 2011) investigated this question with a sample of 20 penguins near Antarctica. All of these penguins had already been tagged with RFID chips, and the researchers randomly assigned 10 of them to receive a metal band on their flippers in addition to the RFID chip. The other 10 penguins did not receive a metal band. Researchers then kept track of which penguins survived for the 4.5-year study and which did not. The researchers found that 9 of the 20 penguins survived, of whom 3 had a metal band and 6 did not. First, input these data as a data frame where each row is one penguin. ```{r} penguins <- data.frame( band = factor(rep(c("metal","control"), each=10), levels=c("metal","control")), survive = factor(rep(c("yes","no","yes","no"), times=c(3,7,6,4)), levels=c("yes","no")) ) ``` Descriptive statistics: ```{r} # Create 2x2 table tab <- xtabs(~band+survive, data = penguins) tab # Sample proportions prop(~survive|band, data = penguins) # Visualizations gf_bar(~survive|band, data = penguins, position = position_dodge(), fill = ~survive) mosaicplot(tab, main = "Survival by Band") ``` We will conduct a randomization test to test if metal bands used for tagging are harmful to penguins, where "harmful" is measured by survival. ### Using difference in proportions as our summary statistic ```{r} # Observed difference in proportions obs <- diffprop(survive ~ band, data = penguins) # Calculate 1000 differences in proportions simulated # under the null hypothesis. nulldist <- do(1000)*diffprop(survive ~ shuffle(band), data = penguins) # Estimated p-value mean(nulldist >= obs) histogram(~ nulldist, groups=(nulldist >= obs), xlab="Simulated Difference in Proportions") ``` ### Using odds ratio or relative risk as our summary statistic ```{r} obs <- oddsRatio(tab) ## Note direction! See help file oddsRatio(tab, verbose = TRUE) nulldist <- do(1000)*oddsRatio(xtabs(~shuffle(band)+survive, data = penguins)) # Estimated p-value mean(nulldist >= obs) hist(nulldist$OR, xlab="Simulated Odds Ratio", breaks=20) abline(v=obs) ``` _See_ [Randomization-based inference using the `mosaic` package](https://cran.r-project.org/web/packages/mosaic/vignettes/Resampling.pdf) _by Kaplan, Horton, and Pruim for more information._ ## Fisher's Exact Test Let's perform Fisher's Exact Test on these data. First, examine the probability distribution of the count in the (1,1) cell under the assumption that the null hypothesis holds, and all marginal totals need to remain fixed. ```{r} probs <- dhyper(0:9, 9, 11, 10) obs.prob <- dhyper(3, 9, 11, 10) barplot(probs, names.arg = 0:9, xlab = "(1,1) Cell Count", main = "Probabilities under Null") abline(h=obs.prob, col="red") # p-value for one-sided alternative: # metal bands less likely to survive --> P(n11 <= 3) phyper(3, 9, 11, 10) # p-value for two-sided alternative # = probability of something as or less likely than observed sum(probs[probs<=obs.prob]) ``` R also has a built-in function to conduct Fisher's Exact Test: ```{r} fisher.test(tab, alternative="less") ```