---
title: "Chi-squared Tests of Independence"
author: "Stacey Hancock"
date: "1/28/2020"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r include=FALSE}
library(tidyverse)
```

There are two popular large-sample test statistics
used for testing independence between two
categorical variables: the _Pearson chi-squared statistic_,
$$
X^2 = \sum_{i,j} \frac{(n_{ij} - \mbox{exp}_{ij})^2}{\mbox{exp}_{ij}},
$$
and the _likelihood ratio test statistic_,
$$
G^2 = 2 \sum_{i,j} n_{ij}\log\left(\frac{n_{ij}}{\mbox{exp}_{ij}}\right).
$$
In each case, $\mbox{exp}_{ij}$ is called the "expected count" for the ($i$, $j$)th cell, and is computed by
$$
\mbox{exp}_{ij} = \frac{(\mbox{row $i$ total}) \times (\mbox{column $j$ total})}{n}.
$$
This is the value we would expect to
see in the ($i$, $j$)th cell _if the
two variables were independent_. Why? It's
the value that would make the distribution of
conditional proportions identical for each row (or column).

Each of these test statistics is a summary statistic
that attempts to measure how far away the observed counts
are from what we'd expect to see under the null hypothesis
of independence, summarized into a single value.
In an $I\times J$ table, for large samples (e.g., at least 5 in each cell), each of these statistics has an
approximate $\chi^2$ distribution with degrees of freedom
$(I - 1)\times(J-1)$ when $H_0$ holds.

### Example 1: Swedish Fish Consumption and Prostate Cancer
#### Data input as a table

Medical researchers followed 6272 Swedish men for 30 
years to see if there was an association between the 
amount of fish in their diet and prostate cancer 
(“Fatty Fish Consumption and Risk of Prostate Cancer,” Lancet, June 2001). 

Here are the data (in a 2x2 table):
```{r}
fish <- matrix(c(110,2420,2769,507,
                 14,201,209,42),
               nrow = 4, ncol = 2,
               dimnames = list(fish_consumption = c("never_seldom","small","moderate","large"),
                               prostate_cancer = c("no","yes")))
fish
```

Since we have a large sample (at least 5 in each cell),
we are going to use a chi-squared test of independence to
test the null hypothesis that fish consumption level and
incidence of prostate cancer are independent. First,
let's do the calculations "by hand".

We can calculate the expected counts using a bit
of matrix algebra. Specifically, calculate the outer product
between the vector of row sums and the vector of column sums.
```{r}
# Row sums
rowSums(fish)
# Column sums
colSums(fish)
# Sample size
sum(fish)
# Expected counts = (row total) x (col total)/6272
# Calculate using an outer product of row and col sums
exp_counts <- outer(rowSums(fish), colSums(fish))/sum(fish)
```

Now that we have our observed and expected counts,
we can calculate our chi-squared test statistic
and the p-value. With four rows and two columns, our
degrees of freedom are $(4 - 1)\times (2 - 1) = 3$.
```{r}
test_stat <- sum((fish - exp_counts)^2/(exp_counts))
test_stat
pval <- pchisq(test_stat, df = 3, lower.tail=FALSE)
pval
```
With this large of a p-value, we have little evidence
of an association between fish consumption level and
incidence of prostate cancer among Swedish men
similar to those recruited for the study.

Now, let's check our calculations using the
`chisq.test` function. If we assign the output
of this function to an object, we can then
extract many features of the test out of this object.
```{r}
fish_test <- chisq.test(fish, correct=FALSE)
attributes(fish_test)
```

The code below displays (in the following order):

- the default test output
- observed counts
- expected counts
- chi-squared test statistic
- p-value
- residuals
- standardized residuals

```{r}
fish_test
fish_test$observed
fish_test$expected
fish_test$statistic
fish_test$p.value
fish_test$residuals
fish_test$stdres
```

The residuals are the Pearson residuals:
$$
\frac{\mbox{observed} - \mbox{expected}}{\sqrt{\mbox{expected}}}
$$
Note that the sum of the squared residuals is equal to the
chi-squared test statistic.
```{r}
fish_test$statistic
sum(fish_test$residuals^2)
```
The standardized residuals are calculated as given
in the formula (2.5) in Section 2.4.5 of the Agresti textbook.
For large samples, under $H_0$, these standardized residuals
have an approximate standard normal distribution.
Thus, standardized residuals beyond -2 or 2 indicate
lack of fit with the independence assumption. Visually,
this can be seen in this mosaic plot:
```{r}
mosaicplot(fish, shade=TRUE)
```
For example, the Never/Seldom-No Prostate Cancer cell was slightly
lower than expected, and the Moderate-No Prostate Cancer
was slightly larger than expected. However,
since every standardized residual in this
table was between $-2.0$ and $2.0$, we
don't see much departure from what we'd
expect to see if the two variables were independent.

We can also compute the likelihood ratio test statistic.
```{r}
lrt_test_stat <- 2*sum(fish * log(fish/exp_counts))
lrt_test_stat
pchisq(lrt_test_stat, df = 3, lower.tail=FALSE)
```
With the large sample size, the values of the chi-squared
test statistic $X^2$ and the likelihood ratio test-statistic
$G^2$ are similar, and they give the same conclusion.

### Example 2: Nightlights and Nearsightedness
#### Data input as a data.frame

A survey of 479 children found that those who had
slept with a nightlight or in a fully lit room
before the age of 12 had a higher incidence
of nearsightedness (myopia) later in childhood
(_Sacramento Bee_, May 13, 1999, pp. A1, A18).
(Taken from Example 2.2 in Utts and Heckard, 5th ed.)

Import the raw data into R:
```{r}
eyesight <- read.csv("http://www.math.montana.edu/shancock/data/Nighlights_Nearsightedness.csv")
# Re-order ordinal factors (since R orders alphabetically)
eyesight$SleptWith = factor(eyesight$SleptWith,
                            levels = c("Darkness",
                                       "Nightlight",
                                       "Full Light" ))
```
The object `eyesight` should have appeared in your RStudio Environment. Click on it to view the data set.

A two-way table summarizing these data can be
created using `xtabs`.
```{r}
xtabs(~ SleptWith + Nearsightedness, data = eyesight)
```

First, let's visualize the data with a bar plot.
Note that the following code requires the `dplyr` and
`ggplot2` packages from the `tidyverse`, which should
have been loaded at the beginning of your .Rmd file.
```{r}
eyesight %>%
    ggplot(aes(x = SleptWith, fill = Nearsightedness)) +
    geom_bar(position = position_fill()) +
    labs(
        title = "Level of Myopia vs Sleeping Light Level",
        x = "Slept With", y = "Proportion"
    )
```

We can do a chi-squared test using the raw data frame with
the following syntax.
```{r}
chisq.test(eyesight$SleptWith, eyesight$Nearsightedness)
```

Note the warning! Why might the chi-squared approximation
be incorrect in this case? Hint: Look at the counts
in each cell. What other method would be more appropriate?

Fisher's Exact Test can be extended to two-way tables
of larger dimensions than $2\times 2$ using a
multivariate extension of the hypergeometric distribution.
```{r}
fisher.test(eyesight$SleptWith, eyesight$Nearsightedness)
```

We see that there is strong evidence that the level
of light slept with as a child has an association
with development of myopia later in life.

Even though the assumptions are violated for the
chi-squared test, the residuals still provide us
with information on where the dependence
between the two variables is strongest.
```{r}
mosaicplot(~ SleptWith + Nearsightedness,
           data = eyesight, shade = TRUE, las = 2,
           main = "Nearsightedness vs Nightlights")
```

We see that there are a lot fewer subjects
in the Myopia-Darkness cell than expected
under the assumption of independence, and
a lot more subjects in the Myopia-Full Light
cell than expected. Thus, it seems that
the probability of developing Myopia increases
with the level of light.