Load required libraries:

library(tidyverse)

In this case study, we will examine part of the analysis in Dalal et al. (1989) on O-ring data on shuttle launches prior to the 1986 Challenger shuttle explosion. Variables in this data set are:

Note that the Challenger launched at a temperature of 31 degrees F.

Copy and paste data from: http://www.math.montana.edu/shancock/data/Challenger_Data.html

Orings <- structure(list(
  Flight = structure(c(1L, 2L, 3L, 8L, 17L, 22L, 
    23L, 24L, 4L, 5L, 6L, 7L, 9L, 11L, 12L, 10L, 14L, 13L, 15L, 16L, 
    18L, 19L, 20L), .Label = c("1", "2", "3", "41-B", "41-C", "41-D", 
    "41-G", "5", "51-A", "51-B", "51-C", "51-D", "51-F", "51-G", 
    "51-I", "51-J", "6", "61-A", "61-B", "61-C", "61-I", "7", "8", 
    "9"), class = "factor"), 
  Date = structure(c(8L, 22L, 5L, 21L, 
    6L, 11L, 15L, 24L, 4L, 7L, 16L, 18L, 20L, 2L, 9L, 10L, 12L, 13L, 
    14L, 17L, 19L, 23L, 1L), .Label = c("01/12/86", "01/24/85", "01/28/86", 
    "02/03/84", "03/22/82", "04/04/83", "04/06/84", "04/12/81", "04/12/85", 
    "04/29/85", "06/16/83", "06/17/85", "07/29/85", "08/27/85", "08/30/83", 
    "08/30/84", "10/03/85", "10/05/84", "10/30/85", "11/08/84", "11/11/82", 
    "11/12/81", "11/26/85", "11/28/83"), class = "factor"), 
  Field = c(0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 3, 0, 0, 0, 0, 0, 0, 2, 0, 1), 
  Temp = c(66, 70, 69, 68, 67, 72, 73, 70, 57, 63, 70, 78, 
    67, 53, 67, 75, 70, 81, 76, 79, 75, 76, 58), 
  Pres = c(50, 50, 50, 50, 50, 50, 100, 100, 200, 200, 200, 
    200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200)), 
  .Names = c("Flight", "Date", "Field", "Temp", "Pres"), 
  row.names = c(NA, 23L), class = "data.frame")

Let’s replicate the statistical analysis of scientists before the Challenger launch. First, they only looked at flights that experienced some thermal distress. Let’s subset the data:

Orings.fail <- Orings %>%
  filter(Field != 0)

They examined a plot of these data, and then fit a simple linear regression.

Orings.fail %>% ggplot(aes(x = Temp, y = Field)) +
  geom_point() + 
  ylim(0,6) +
  labs(x = "Temperature (F)",
       y = "Number of O-ring Failures",
       title = "Launches with at least one O-ring failure") +
  geom_smooth(method = "lm", se = FALSE)

mod1 <- lm(Field ~ Temp, data = Orings.fail)
summary(mod1)
## 
## Call:
## lm(formula = Field ~ Temp, data = Orings.fail)
## 
## Residuals:
##       1       2       3       4       5       6       7 
## -0.2690 -0.5991 -0.4467 -0.2690  1.2994  0.8580 -0.5737 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  3.04649    2.66929   1.141    0.305
## Temp        -0.02539    0.04160  -0.610    0.568
## 
## Residual standard error: 0.8315 on 5 degrees of freedom
## Multiple R-squared:  0.06934,    Adjusted R-squared:  -0.1168 
## F-statistic: 0.3726 on 1 and 5 DF,  p-value: 0.5683

With a p-value of 0.568, it was concluded that temperature had no significant effect on the predicted number of O-ring failures…. what is wrong with this analysis?

Let’s look at an appropriate analysis, including all of the past launches (not just those that experienced O-ring failure!).

Orings %>% ggplot(aes(x = Temp, y = Field)) +
  geom_point() + 
  ylim(0,6) +
  labs(x = "Temperature (F)",
       y = "Number of O-ring Failures",
       title = "All past launches") +
  geom_smooth(method = "lm", se = FALSE)
## Warning: Removed 13 rows containing missing values (geom_smooth).

mod2 <- lm(Field ~ Temp, data = Orings)
summary(mod2)
## 
## Call:
## lm(formula = Field ~ Temp, data = Orings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6582 -0.4389 -0.1594  0.1551  1.9058 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  4.79365    1.40926   3.402  0.00269 **
## Temp        -0.06266    0.02016  -3.108  0.00532 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6673 on 21 degrees of freedom
## Multiple R-squared:  0.3151, Adjusted R-squared:  0.2825 
## F-statistic: 9.661 on 1 and 21 DF,  p-value: 0.005321

Now, the p-value for the slope of a simple linear regression is 0.00532 – a very different conclusion!

But is simple linear regression appropriate? (Hint: Is the response variable normally distributed?)

For a binomial response variable, we use logistic regression.

mod3 <- glm(cbind(Field, 6) ~ Temp, family = binomial, data = Orings)
summary(mod3)
## 
## Call:
## glm(formula = cbind(Field, 6) ~ Temp, family = binomial, data = Orings)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9301  -0.7612  -0.5216  -0.1925   2.4573  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  5.20471    2.79211   1.864  0.06231 . 
## Temp        -0.11816    0.04368  -2.705  0.00683 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24.038  on 22  degrees of freedom
## Residual deviance: 16.386  on 21  degrees of freedom
## AIC: 34.564
## 
## Number of Fisher Scoring iterations: 5

Logistic regression models a transformation of the probability of failure.

Orings %>% ggplot(aes(x = Temp, y = Field/6)) +
  geom_point() + 
  ylim(0,1) +
  labs(x = "Temperature (F)",
       y = "Probability of O-ring Failure",
       title = "All past launches") +
    stat_function(fun = function(x){
        exp(mod3$coef[1] + mod3$coef[2]*x)/(1+exp(mod3$coef[1] + mod3$coef[2]*x))
    })

So what was the predicted probability an O-ring fails at a temperature of 31 degrees F?

pred.prob <- predict(mod3, newdata = data.frame(Temp = 31), type = "response")
pred.prob
##         1 
## 0.8237404

Thus, the probability of at least one O-ring failing is

sum(dbinom(1:6, 6, pred.prob))
## [1] 0.99997