STAT 408 - Statistical Learning Predictive Modeling

Exercise - Prediction for Capital Bike Share

bikes <- read.csv('http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/Bike.csv')
set.seed(11142017)
num.obs <- nrow(bikes)
test.ids <- base::sample(1:num.obs, size=round(num.obs*.3))
test.bikes <- bikes[test.ids,]
train.bikes <- bikes[(1:num.obs)[!(1:num.obs) %in% 
  test.ids],]
dim(bikes)

## [1] 10886    12

dim(test.bikes)

## [1] 3266   12

dim(train.bikes)

## [1] 7620   12

Exercise - Prediction for Capital Bike Share

lm.bikes <- lm(count ~ holiday + atemp,
               data=train.bikes)
lm.mad <- mean(abs(test.bikes$count -
                     predict(lm.bikes,test.bikes)))

Create another predictive model and compare the results to the MAD of the linear model above (\(129\)). However, don’t use casual and registered in your model as those two will sum to the total count.

Exercise: Predict Titanic Survival

titanic <- read.csv(
  'http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/titanic.csv')
set.seed(11142017)
titanic <- titanic %>% filter(!is.na(Age))
num.pass <- nrow(titanic)
test.ids <- base::sample(1:num.pass, size=round(num.pass*.3))
test.titanic <- titanic[test.ids,]
train.titanic <- titanic[(1:num.pass)[!(1:num.pass) %in% 
  test.ids],]
dim(titanic)

## [1] 714  12

dim(test.titanic)

## [1] 214  12

dim(train.titanic)

## [1] 500  12

Exercise: Predict Titanic Survival

See if you can improve the classification error from the model below.

glm.titanic <- glm(Survived ~ Age, data=train.titanic, family = binomial)
Class.Error <- mean(test.titanic$Survived != round(predict(glm.titanic, test.titanic, type='response')))

The logistic regression model only using age is wrong \(40\)% of the time.