We’re back with the third part of a series of posts in which I try to convey what a pleasure it is to let data tell you stories when you have the right tools and know how to use them. We work with R, using dplyr, readr, lubridate, and ggplot2. There is a githup repository with all the code should you wish to follow along.
We are looking at a data set of bike trips around the bay area. In part 1, we looked at where people go at what times and in part 2 we combined our data with the median household incomes of the zip codes of some of the cyclists. Let’s load our data once more.
Last time, we saw that the median income varies quite a bit with the time the trip was made. Let’s have a look at it again.
We have the nine-to-five white collar workers stick out somewhat (okay, it’s more like eight-to-four, but let’s not split hairs). The less well-off tend to start their commute earlier, and their trips spread out to later hours. But how much more sleep comes with a higher income? Let’s find out. We’ll fit a linear model to the data between, say, 5am and 9am and look at the coefficients. In the times of deep learning beating humans at Go and gradient boosting winning competitions left and right, I feel that good old linear models get too little attention. Yes, they might not perform that well in many cases, but we get coefficients in units we can easily understand and P-values which, besides all controversy give us some idea about the legitimacy of what we do.
Call: lm(formula = `Median Income` ~ Hour, data = morning_trips) Residuals: Min 1Q Median 3Q Max -62541 -15982 -470 14990 128412 Coefficients: Estimate Std. Error t value Pr(>\|t\|) (Intercept) 52556.58 563.26 93.31 <2e-16 *** Hour 1081.33 66.72 16.21 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 21830 on 112855 degrees of freedom Multiple R-squared: 0.002322, Adjusted R-squared: 0.002314 F-statistic: 262.7 on 1 and 112855 DF, p-value: < 2.2e-16
The output of R’s linear model scares some people, but it’s actually quite easy to interpret. In our case, it says that you get an hour more sleep for every 1100 dollar more median income, with an error of 70 or so dollars. R is very sure that this not a random fluctuation in the data, it estimates the probability that there is no effect at all to be virtually zero. However, the R squared is really as bad as it gets, so we really shouldn’t have made this fit. But since this post is more educational than scientific, let’s go with what we’ve got for now.
So how much longer do you have to stay at work if you’re not well off? Let’s repeat our analysis for the afternoon.
Call: lm(formula = `Median Income` ~ Hour, data = afternoon_trips) Residuals: Min 1Q Median 3Q Max -66309 -14323 3 14477 127949 Coefficients: Estimate Std. Error t value Pr(>\|t\|) (Intercept) 92084.4 808.1 113.95 <2e-16 *** Hour -1707.0 45.7 -37.35 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 21520 on 133114 degrees of freedom Multiple R-squared: 0.01037, Adjusted R-squared: 0.01036 F-statistic: 1395 on 1 and 133114 DF, p-value: < 2.2e-16
R thinks that for every 1700 dollars more income (with an error of about 50 dollars), you’ll be home an hour earlier. The P-value being tiny says again that there is virtually no chance for the effect being random.
How do we interpret this? It would be wrong to conclude that people from low-income areas work longer hours. Most likely the effect is caused by white collar workers having a more regular work schedule, while the commuting times of lower-paid workers are simply spread out more. Of course we can’t be sure, but it seems a reasonable explanation of what we see. But that’s the thing with data. It can support your hypothesis, but never really prove it. Maybe we’ll dig a little deeper next time.