--- title: "Model Selection with Horseshoe Crab Data" author: "Stacey Hancock" date: "3/25/2020" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(MASS) # For glm.nb function library(bestglm) # For automated model selection source("http://www.math.montana.edu/shancock/courses/stat439/R/binary.gof.R") ``` # Data set from Agresti, 2007 Study of nesting horseshoe crabs (J. Brockman, Ethology, 1996). Each female crab in the study had a male crab attached to her in her nest. The study investigated factors that affect whether the female crab had any other males, called satellites, residing nearby her. Explanatory variables thought possibly to affect this included the female crab's color, spine condition, weight, and carapace width. The response outcome for each female crab is her number of satellites. ### Variable descriptions - color: 1 - light medium, 2 - medium, 3 - dark medium, 4 - dark - spine: 1 - both good, 2 - one worn or broken, 3 - both worn or broken - width: carapace width in cm - satell: number of satellites - weight: weight in kg # Data import First, read in the data set and convert color and spine to factors: ```{r} crab <- read.table("http://www.math.montana.edu/shancock/data/horseshoe.txt",header=T) crab$colorF <- factor(crab$color, levels=c(1,2,3,4), labels=c("LM","M","DM","D")) crab$spineF <- factor(crab$spine, levels=c(1,2,3), labels=c("Good","Soso","Poor")) ``` # Research Goal Our goal is to determine, of the four predictors available, what best predicts the number of satellites. ### Model Selection Plan 1. Exploratory data analysis to examine relationships between number of satellites and each predictor individually. 2. Model selection via analysis of deviance (`anova`), forward/backward selection (`drop1`,`add1`,`step`), or best subsets (`bestglm`). 3. Check assumptions, diagnostics, and goodness-of-fit. 4. Write paragraph summarizing what the model tells us about the relationships between predictors and response. # Part I: Model Selection with Poisson Response Recall: Overdispersion was a problem with these count data. We will use Negative Binomial regression to address this issue. ```{r} mod1 <- glm.nb(satell ~ width, data = crab) summary(mod1) ``` ### Model Selection # Part I: Model Selection with Binomial Response Let's convert our response variable to an indicator variable of whether the female crab had satellites or not. ```{r} crab$satell_ind <- as.numeric(crab$satell > 0) ``` Note: Overdispersion is not a problem with binary data -- why? See the help file for the `bestglm` function -- it needs the response variable in the last column of the data set. ```{r} crab <- crab[, names(crab) != "satell"] names(crab) ``` The `bestglm` function will fit every combination of predictors (all subsets), and return the one with the minimum value of the information criteria you select. ```{r} best.AIC <- bestglm(crab, family = binomial, IC = "AIC") best.BIC <- bestglm(crab, family = binomial, IC = "BIC") ``` ### Model Selection