After our first look at the Bay Area BikeShare data, it’s now time to dig a bit deeper. We will see that the data has many more stories to tell, for example about where the rich work and when the poor rise. And who knows, maybe I’ll be able to get some of the excitement I feel when crunching number across to you! Take some time to think about what questions you would like to ask the data and leave me a comment!
As last time, we have our trusty R by our side, together with dplyr, readr, lubridate, and ggplot2 from the Hadleyverse. I’ve updated the githup repository to get you started if you feel like playing with the data yourself.
One of the things you’ll always want to do after you’ve had your first look at a data set is enhancing it, finding some other data that you can combine with your original set to get additional information. We’ll use the median household income per zip code from the US census bureau. We have Zip code of the customers for may of the trips in our trips CSV file. So let’s join them with the median income data.
Note that the column names don’t match up, which is why we have to use the second argument to
. Now what can we do with this new data frame? We could ask where people from a neighborhood with high or low median household income go.
R never ceases to amaze me. In a few dozen lines of code, you go from raw CSV to pretty looking plots. Another thing to be mentioned here is that, data cleaning and obtaining sets to enhance your data aside, 90% of EDA is grouping and summarizing the right way. The plots look like this:
Now look at that! From the lowest average income to the highest, we have a factor of two. This is a big effect and probably says a lot about those destinations. Now I’m not very familiar with the bay area’s demographics, but let’s have a closer look anyway. A popular destination might be a place where someone works, or shops, but for some places a high number of trips ending there will be explained by residents arriving. Let’s look at “Grant Avenue at Columbus Avenue”, a popular low-income destination, and just around the corner of “The Stinking Rose”, a garlic-themed restaurant that I visited a while back and would warmly recommend (seriously). Now the restaurant aside, the area is clearly residential, so looking at the zip codes should reveal that people arriving there are predominantly residents (plus a few garlic enthusiasts).
The plot supports our suspicion.
So how can we filter out residents? How about we look at trips in the morning, say, before 11. People should be on their commute at that time. Shall we find out where high earners work?
Hello high income, hello Palo Alto, hi Mountain View! The biggest (positive) surprise here is that the silicon valley elite seems to like cycling.
One last thing. The saying goes that the early bird catches the worm. But how well does the early bird earn? Let’s plot average income vs. trip time and marvel one last time how easy our life is in R.
It seems like the saying falls short. The early bird is actually not that well off. The sweet spot seems to be around 8am to 9am. Also, high-income seems to favor finishing work before 6pm. Every hour after that seems to cost you a few thousand dollars of yearly income. Let’s find out how much the next episode! I hope you enjoyed my rambling and will stay tuned.