Lab 5 - Built-in R Functions for Inference
R Functions for Inference
We have been introduced to three built-in R functions that will conduct normal-based hypothesis tests and confidence intervals for particular scenarios:
R Function | Scenario(s) | Conditions |
prop.test |
|
At least 10 "successes" and 10 "failures" in each group |
chisq.test |
|
At least 5 in each cell of the table |
t.test |
|
Either the quantitative response variable(s) (or differences) need to be approximately normal, or your sample size(s) should be at least 30 (larger if the data are heavily skewed). |
We practiced using prop.test in Lab 4. In this lab, will be practice using the chisq.test and t.test functions. Look up their help files to get a sense of the inputs.
?chisq.test
?t.test
Chi-squared Tests
Goodness of fit test for a single categorical variable
In this section, we are going to practice summarizing, visualizing, and analyzing categorical data using the chisq.test function. We will start with one-way tables and move to two-way tables:
Are the tulip colors in a field equally distributed? Consider the tulip data below from a random sample of tulips in the field:
Color | Count |
Red | 123 |
White | 102 |
Yellow | 81 |
Let's summarize and visualize the data.
counts <- c(123, 102, 81)
n <- sum(counts)
# Sample proportions:
counts/n
barplot(counts/n, names.arg = c("Red","White","Yellow"), ylab="Proportion",
main = "Observed Distribution of Tulip Colors")
It seems that Red is most common, occurring 40% of the time, and yellow least common at 26%. Does this mean tulips in the entire field are not equally distributed? Or did this just happen by chance. Let's test the hypotheses
H0: pred = 1/3, pwhite = 1/3, pyellow = 1/3
Ha: At least two probabilities differ.
First, do the test calculations "by hand":
# Expected counts
expected <- n*c(1/3, 1/3, 1/3)
expected
# Chi-squared test statistic
stat <- sum( (counts - expected)^2/expected )
stat
# p-value
pchisq(stat, df = 2, lower.tail=FALSE)
With a p-value of 0.0132, there is fairly strong evidence that the tulip colors are not equally distributed (do not occur in the same proportions) across the entire field. Specifically how do the observed counts differ from what we'd expect to see under the null?
residuals <- (counts - expected)/sqrt(expected)
residuals
We can also plot the expected counts versus the observed counts in a bar plot.
barplot(rbind(counts, expected), names.arg = c("Red","White","Yellow"), beside=TRUE,
ylab="Count", legend.text=c("Observed","Expected"), col=c(3,5),
main="Observed versus Expected Counts")
There were more Red tulips than we would expect, and less White tulips than we would expect, if the colors were equally distributed. Note that the squared residuals add up to the chi-squared test statistic.
Note: The R function "rbind" stands for "row bind". It creates a matrix where each row is one argument in the function. When you give the "barplot" function a matrix, it plots the columns of the matrix for each category.
Exercise
Based on the tulip data above, perform a goodness-of-fit test to determine if the distribution pred = .40, pwhite = .35, pyellow = .25 is a reasonable fit to the data.
We can use the chisq.test function to check our calculations above:
chisq.test(counts, p=c(1/3,1/3,1/3), correct=FALSE)
chisq.test(counts, p=c(.4,.35,.25), correct=FALSE)
Chi-squared tests of independence for two categorical variables
Returning to the tulip example, suppose we wanted to know whether the distribution of tulip colors differed across three fields, A, B, and C. We already looked at data from field A. Here is a summary of the entire data set:
Field | |||
Color | A | B | C |
Red | 123 | 240 | 100 |
White | 102 | 200 | 35 |
Yellow | 81 | 163 | 47 |
Let's first visualize the data.
red <- c(123,240,100)
white <- c(102,200,35)
yellow <- c(81,163,47)
tulips <- rbind(red, white, yellow)
tulips
barplot(tulips, names.arg=c("A","B","C"), col=c("red","white","yellow"),
ylab="Frequency", xlab="Field", beside=TRUE,
main="Number of tulips by color and field")
Note that a barplot of counts is not really a fair comparison since the sample sizes differ across fields. Instead, we'd like to look at the conditional proportions within fields:
# Number of tulips collected in each field
colSums(tulips)
tulips.prop <- rbind(red/colSums(tulips), white/colSums(tulips), yellow/colSums(tulips))
barplot(tulips.prop, names.arg=c("A","B","C"), col=c("red","white","yellow"),
ylab="Conditional Proportions", xlab="Field", beside=FALSE,
main="Distribution of Colors by Field")
The tulip color distribution seems very similar between fields A and B, but C appears to have a larger number of red tulips. Is this due to chance, or do the distributions truly differ?
H0: Field and tulip color are independent.
Ha: Field and tulip color and dependent.
The chisq.test function will do our test calculations for us.
chisq.test(tulips, correct=FALSE)
chisq.test(tulips, correct=FALSE)$observed
chisq.test(tulips, correct=FALSE)$expected
These data provide evidence to suggest that the distribution of tulip color is dependent on field.
Exercise
Calculate the expected counts, the chi-squared test statistic, and the p-value for the previous example without using the chisq.test function.
t-Tests
For the following sections, let's use the anorexia data set from lecture. Read this data set into your R session:
Anorexia <- read.table("http://www.math.montana.edu/shancock/data/Anorexia.txt", header=TRUE)
Single Mean
The primary diagnosis for anorexia is a weight that is less than 85% of what is considered normal for that person's height and age. In this population of patients, researchers consider a patient anorexic if his or her weight is lower than 90 lbs. Is there evidence that this sample of patients is anorexic prior to treatment? Let's use the t.test to answer this question.
t.test(Anorexia$Before, alternative = "less", mu = 90, conf.level = 0.95)
The first argument to the t.test function is the vector of responses. We then specify the direction of the alternative ("less", "greater", or "two.sided" (default)), the null value, and a confidence level. (The t.test function output gives both a hypothesis test and a confidence interval for the parameter). The sample mean weight before treatment was 82.44 lbs, and we have strong evidence the true mean weight of the population from where we took this sample from is less than 90 lbs.
Note that with a one-sided alternative, R returns a "one-sided" confidence interval. If we want a two-sided confidence interval, we need to specify a two-sided alternative:
t.test(Anorexia$Before, alternative = "two.sided", mu = 90, conf.level = 0.95)
Paired Mean Difference
Let's consider only patients on the Family treatment.
Fam <- Anorexia[ Anorexia$Therapy == "Family", ]
Fam
Was the average weight gain in this group large enough to provide evidence that the Family treatment worked? We can use the t.test in two ways to answer this question. First, we could calculate the difference in weight (After - Before) for each patient, and then use the t.test function as if it's a single mean scenario with a zero null value.
t.test(Fam$Y, alternative="greater", mu = 0, conf.level = 0.95)
Or, we gave give the t.test the before weights and the after weights and tell R that the data are paired.
t.test(Fam$After, Fam$Before, alternative="greater", mu = 0, conf.level = 0.95, paired = TRUE)
The default null value is zero, and the default confidence level is 0.95, so we could have run the above code without those arguments.
t.test(Fam$After, Fam$Before, alternative="greater", paired = TRUE)
Difference in Means
To compare two means from independent groups, we give the t.test function the two vectors of responses as the first and second argument. We could specify "paired = FALSE" as an argument, but this is the default, so it is unnecessary. Here is the R code to calculate a hypothesis test of H0: "The true mean weight gain is equal between the Family and Control treatments" versus Ha: "The true mean weight gain is greater on the Family treatment than the Control". R will also output a confidence interval for the true difference in means (first argument - second argument).
Gains.Fam <- Anorexia$Y[ Anorexia$Therapy == "Family" ]
Gains.Cont <- Anorexia$Y[ Anorexia$Therapy == "Control" ]
t.test(Gains.Fam, Gains.Cont, alternative="greater")
Inference for quantitative data
For extra practice, work through the Inference for quantitative data lab by Andrew Bray, but instead of using his inference function (which is not a built-in R function - it loads with the .Rdata file for the lab), practice using the t.test function. If you are unable to load the data set from his lab page, use the following R command instead:
nc <- read.table("http://www.math.montana.edu/shancock/data/nc.txt", header=TRUE)