Data Science On a Bicycle, Part 4: Not So Fast!

You’d think that after three episodes I should have had enough of the Bay Area BikeShare data. Wrong! As always, you can get the code on github!

I was thinking about the two kinds of cyclists in the data set, customers and subscribers. The former, probably mostly tourists, get an access card for a few days, while the latter have a yearly subscription and presumably cycle regularly. So the subscribers should be in better shape than the customers, right? Let’s find out!

We calculate the pace for a trip between two stations as the duration of a person’s trip, divided by the average time the trip takes.

library(readr)
library(ggplot2)
library(dplyr)
library(lubridate)
# Read data
trip.data <- read_csv('201508_trip_data.csv.gz')
trips.with.pace <- trip.data %>% 
  group_by(`Start Station`, `End Station`) %>%
  transform(Pace=Duration/mean(Duration)) %>%
  filter(Pace < 3)
ggplot(trips.with.pace, aes(Pace, fill=Subscriber.Type)) +
  geom_histogram(position = 'fill')

Unmistakably, subscribers are faster. So they’re better in shape, right? Not necessarily. They might also just know some shortcuts or a customer, likely being a tourist, might enjoy the scenic route between two stations more often. A way to (partially) answer this question is to look at differences within the group of subscribers, for which we know the zip code, and see if there are differences between the zip codes. Locals should know the better shortcuts than outsiders.

Let’s first find out what the fastest zip code is. We want a reasonable number of trips to sample from and a small standard deviation of paces, such that we can make statements with some certainty. This means we have to filter our data.

pace <- trips.with.pace %>%
  group_by(Zip.Code) %>%
  summarize(`Average Pace` = mean(Pace), `Sd Pace` = sd(Pace), `Samples` = n()) %>%
  filter(`Sd Pace` < `Average Pace` * 0.5, Samples > 300) %>%
  arrange(`Average Pace`) %>%
  head(10)
ggplot(pace, aes(x=Zip.Code, y=`Average Pace`)) +
  geom_bar(stat='identity', fill='steelblue4', color='black') +
  geom_errorbar(aes(ymin=`Average Pace` - `Sd Pace`,
                    ymax=`Average Pace` + `Sd Pace`),
                width=.3)

pace_vs_zip

The good people from Pleasanton (94588) seem to outperform the rest a significant number of times. Do they know a shortcut?

trips.with.pace %>%
  filter(Zip.Code == "94588") %>%
  group_by(Start.Station, End.Station) %>%
  summarize(Count=n(), `Average Pace`=mean(Pace)) %>%
  ungroup() %>%
  arrange(desc(Count)) %>%
  head() %>%
  print() # nifty!

Source: local data frame [5 x 4]

              Start.Station       End.Station Count
1         Steuart at Market   2nd at Townsend   169
2           2nd at Townsend Steuart at Market   138
3 Broadway St at Battery St Market at Sansome    79
4         Market at Sansome 2nd at South Park    79
5    Embarcadero at Sansome Steuart at Market    52
Variables not shown: Average Pace (dbl)

The route between Steuart at Market and 2nd at Townsend is so far from their home, that it’s safe to assume that Pleasantoneans don’t know a particular good short cut. The fastest route (you can convince yourself on google maps) is also the scenic one. However, the relatively large and almost equal number of trips in both directions could mean that one particularly adept cyclist from Pleasanton makes up most of the faster-than-average trips. Whoever you are, keep pushing on!