Data Science On a Bicycle, Part 2: Rich & Poor

After our first look at the Bay Area BikeShare data, it’s now time to dig a bit deeper. We will see that the data has many more stories to tell, for example about where the rich work and when the poor rise. And who knows, maybe I’ll be able to get some of the excitement I feel when crunching number across to you! Take some time to think about what questions you would like to ask the data and leave me a comment!

As last time, we have our trusty R by our side, together with dplyr, readr, lubridate, and ggplot2 from the Hadleyverse. I’ve updated the githup repository to get you started if you feel like playing with the data yourself.

More Data

One of the things you’ll always want to do after you’ve had your first look at a data set is enhancing it, finding some other data that you can combine with your original set to get additional information. We’ll use the median household income per zip code from the US census bureau. We have Zip code of the customers for may of the trips in our trips CSV file. So let’s join them with the median income data.

library(readr)
library(ggplot2)
library(dplyr)
library(lubridate)
# Read data
trip_data <- read_csv('201508_trip_data.csv.gz')
# Convert data columns
trip_data$`Start Date` <-
  parse_date_time(trip_data$`Start Date`, orders = c(&quot;mdyHM&quot;))
trip_data$`End Date` <-
  parse_date_time(trip_data$`End Date`, orders = c(&quot;mdyHM&quot;))
# Load census data and join it with the trip data
census_data <- read_csv('DEC_00_SF3_HCT012.csv.gz')
trips_with_demographics <- trip_data %<%
  inner_join(census_data, c(&quot;Zip Code&quot; = &quot;GEO.id2&quot;))

Note that the column names don’t match up, which is why we have to use the second argument to

inner_join

. Now what can we do with this new data frame? We could ask where people from a neighborhood with high or low median household income go.

# Calculate per-destination mean income
# NB: 'mean' is the mean of the median household income
trips_by_income <- trips_with_demographics %<%
  filter(!is.na(`Total`)) %<%
  group_by(`End Station`) %<%
  summarize(`Mean Income` = mean(Total),
            `Trip Count` = n()) %<%
  arrange(desc(`Mean Income`))
# Calculate and plot where people from areas
# with high/low median household income go
rich_destinations <- trips_by_income %<% head(20)
poor_destinations <- trips_by_income %<% tail(20)
ggplot(rich_destinations, aes(x=`End Station`, y=`Mean Income`)) +
  geom_bar(stat='identity', fill='steelblue') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle('Trip Destinations, High Income')
ggplot(poor_destinations, aes(x=`End Station`, y=`Mean Income`)) +
  geom_bar(stat='identity', fill='steelblue') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle('Trip Destinations, Low Income')

R never ceases to amaze me. In a few dozen lines of code, you go from raw CSV to pretty looking plots. Another thing to be mentioned here is that, data cleaning and obtaining sets to enhance your data aside, 90% of EDA is grouping and summarizing the right way. The plots look like this:

high_income_destinations low_income_destinations

Now look at that! From the lowest average income to the highest, we have a factor of two. This is a big effect and probably says a lot about those destinations. Now I’m not very familiar with the bay area’s demographics, but let’s have a closer look anyway. A popular destination might be a place where someone works, or shops, but for some places a high number of trips ending there will be explained by residents arriving. Let’s look at “Grant Avenue at Columbus Avenue”, a popular low-income destination, and just around the corner of “The Stinking Rose”, a garlic-themed restaurant that I visited a while back and would warmly recommend (seriously). Now the restaurant aside, the area is clearly residential, so looking at the zip codes should reveal that people arriving there are predominantly residents (plus a few garlic enthusiasts).

# Extract and plot zip codes of people goint to Grant/Columbus
grant_ave_zips <- trip_data %<%
  filter(`End Station` == &quot;Grant Avenue at Columbus Avenue&quot;) %<%
  group_by(`Zip Code`) %<%
  summarise(`Trip Count` = n()) %<%
  arrange(desc(`Trip Count`)) %<%
  head(10)
ggplot(grant_ave_zips, aes(x=`Zip Code`, y=`Trip Count`)) +
  geom_bar(stat='identity', fill='steelblue') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle('Grant Avenue Arrivals, by Zip Code')

The plot supports our suspicion.

grant_ave_arrivals

So how can we filter out residents? How about we look at trips in the morning, say, before 11. People should be on their commute at that time. Shall we find out where high earners work?

# Let's now look at the early trips, which should end where
# people work or switch to different means of transportation ...
early_trips_top_income <- trips_with_demographics %<%
  filter(hour(`End Date`) < 11) %<%
  group_by(`End Station`) %<%
  summarize(`Mean Income` = mean(Total),
            `Trip Count` = n()) %<%
  arrange(desc(`Mean Income`)) %<%
  filter(!is.na(`Mean Income`)) %<%
  head(10)
ggplot(early_trips_top_income, aes(x=`End Station`, y=`Mean Income`)) +
  geom_bar(stat='identity', fill='steelblue') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle(&quot;Popular Destinations Before 11am, High Income&quot;)

popular_early_high_income

Hello high income, hello Palo Alto, hi Mountain View! The biggest (positive) surprise here is that the silicon valley elite seems to like cycling.

One last thing. The saying goes that the early bird catches the worm. But how well does the early bird earn? Let’s plot average income vs. trip time and marvel one last time how easy our life is in R.

# Do early risers earn more? Let's find out!
trips_for_plot <- trips_with_demographics %<%
  filter(hour(`Start Date`) <= 5, !is.na(Total)) %<%
  transmute(Income = Total,
            Hour=hour(`Start Date`) +
             round(minute(`Start Date`) / 10) / 6) %<%
  group_by(Hour) %<%
  summarize(`Mean Income`=mean(Income), `Trip Count`=n())
ggplot(trips_for_plot, aes(x=`Hour`, y=`Mean Income`)) +
  geom_path(colour='steelblue') +
  ggtitle(&quot;Mean Income vs. Trip Time&quot;)

time_vs_income

It seems like the saying falls short. The early bird is actually not that well off. The sweet spot seems to be around 8am to 9am. Also, high-income seems to favor finishing work before 6pm. Every hour after that seems to cost you a few thousand dollars of yearly income. Let’s find out how much the next episode! I hope you enjoyed my rambling and will stay tuned.