STAT 408 - Week 4: Tidy Data, Data Manipulation, and Processing

The dataset

First read in the data set which is available at: http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/BaltimoreTowing.csv.

baltimore.tow <- 
  read.csv('http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/BaltimoreTowing.csv', 
           stringsAsFactors = F)
baltimore.tow$totalNumeric <- 
  as.numeric(substr(baltimore.tow$totalPaid, 
                    start = 2, stop=nchar(baltimore.tow)))
str(baltimore.tow)

## 'data.frame':    30263 obs. of  6 variables:
##  $ vehicleType      : chr  "Van" "Car" "Car" "Car" ...
##  $ vehicleMake      : chr  "LEXUS" "Mercedes" "Chysler" "Chevrolet" ...
##  $ vehicleModel     : chr  "" "" "Cirrus" "Cavalier" ...
##  $ receivingDateTime: chr  "10/24/2010 12:41:00 PM" "04/28/2015 09:27:00 AM" "07/23/2015 07:55:00 AM" "10/23/2010 11:35:00 AM" ...
##  $ totalPaid        : chr  "$322.00" "$130.00" "$280.00" "$1057.00" ...
##  $ totalNumeric     : num  322 130 280 1057 469 ...

Exercise: group_by()

Now also use the group by procedure to compute the average towing cost for all vehicle types.

Goal 1: Vehicles Towed by Year

The first goal is to determine how many vehicles were towed for each year in the data set.

Given that the we don’t have a column for year and the first observation for receiving date is “10/24/2010 12:41:00 PM”.
Describe the process for obtaining this information.
What R functions are you familiar with that might be useful here?

Exercise: Using the substr() function

Use the substr() function to extract year and create a new variable in R.

# baltimore.tow$Year <-

Exercise: strsplit function

Now we can extract year from this chunk of code contained in pieces.mat.

#baltimore.tow$Year <-

Goal 2. Type of Vehicles Towed by Month

Next we wish to compute how many vehicles were towed in the AM and PM for each type of vehicle.

However, we want to take a close look at the vehicle types in the data set and perhaps create more useful groups.

Messy Data: Data Cleaning

Spelling errors can be addressed, by reassigning vehicles to the correct spelling.

baltimore.tow$vehicleMake[baltimore.tow$vehicleMake == 
                            'Peterbelt'] <- 'Peterbilt'
baltimore.tow$vehicleMake[baltimore.tow$vehicleMake == 
                            'Izuzu'] <- 'Isuzu'
baltimore.tow$vehicleMake[baltimore.tow$vehicleMake == 
                            'Frightliner'] <- 'Freightliner'
baltimore.tow$vehicleMake[baltimore.tow$vehicleMake == 
                            'Internantional'] <- 'International'

Also note that many of the groupings have mis-classified vehicles, but we will not focus on that yet.

Exercise: Delete Misc. Type Vehicles

First we will delete golf carts, boats, and trailers. There are several ways to do this, consider making a new data frame called balt.tow.small that does not include golf carts, boats, and trailers.

balt.tow.small <-

Exercise: Create Additional Groups

Now we need to create a variable for the additional groups below.

Cars - (Car, convertible)
Large Cars - (SUV, Station Wagon, Sport Utility Vehicle, Van, Taxi)
Trucks - (Pick-up Truck, Pickup Truck)
Large Trucks - (Truck, Tractor Trailer, Tow Truck, Tractor, Construction Equipment, Commercial Truck)
Bikes - (Motor Cycle (Street Bike), Dirt Bike, All terrain - 4 wheel bike, Mini-Bike)

Solution: Create Additional Groups

One way to create groups is by creating a new variable

Ready for Calculations?

First we need to extract the AM/PM tag from the time-date character string.

As the tag that we are looking for falls at the end of the string, we can use nchar() to find the length of the string.

Solution: Aggregate

We could use aggregate, as such:

Solution: dplyr