We’re back with the third part of a series of posts in which I try to convey what a pleasure it is to let data tell you stories when you have the right tools and know how to use them. We work with R, using dplyrreadrlubridate, and ggplot2. There is a githup repository with all the code should you wish to follow along.

We are looking at a data set of bike trips around the bay area. In part 1, we looked at where people go at what times and in part 2 we combined our data with the median household incomes of the zip codes of some of the cyclists. Let’s load our data once more.

library(readr)
library(ggplot2)
library(dplyr)
library(lubridate)
# Read data
trip_data <- read_csv('201508_trip_data.csv.gz')
# Convert data columns
trip_data$`Start Date` <- parse_date_time(trip_data$`Start Date`, orders = c("mdyHM"))
trip_data$`End Date` <- parse_date_time(trip_data$`End Date`, orders = c("mdyHM"))
# Load census data and join it with the trip data
census_data <- read_csv('DEC_00_SF3_HCT012.csv.gz')
trips_with_demographics <- trip_data %>%
  inner_join(census_data, c("Zip Code" = "GEO.id2"))

Last time, we saw that the median income varies quite a bit with the time the trip was made. Let’s have a look at it again.

# Clean up the data a little ...
income_by_start_hour <- trips_with_demographics %>%
  transmute(`Median Income`=Total,
            Hour = hour(`Start Date`) +
              minute(`Start Date`) / 60) %>%
  na.omit()
# And plot it.
ggplot(income_by_start_hour, aes(x=Hour, y=`Median Income`)) +
  geom_smooth()

income_vs_time_smoothWe have the nine-to-five white collar workers stick out somewhat (okay, it’s more like eight-to-four, but let’s not split hairs). The less well-off tend to start their commute earlier, and their trips spread out to later hours. But how much more sleep comes with a higher income? Let’s find out. We’ll fit a linear model to the data between, say, 5am and 9am and look at the coefficients. In the times of deep learning beating humans at Go and gradient boosting winning competitions left and right, I feel that good old linear models get too little attention. Yes, they might not perform that well in many cases, but we get coefficients in units we can easily understand and P-values which, besides all controversy give us some idea about the legitimacy of what we do.

morning_trips <- income_by_start_hour %>%
  filter(Hour > 4, Hour < 10)
morning_model <- lm(`Median Income`~Hour, morning_trips)
print(summary(morning_model))
Call:
lm(formula = `Median Income` ~ Hour, data = morning_trips)

Residuals:
   Min     1Q Median     3Q    Max 
-62541 -15982   -470  14990 128412 

Coefficients:
            Estimate Std. Error t value Pr(>\|t\|)    
(Intercept) 52556.58     563.26   93.31   <2e-16 ***
Hour         1081.33      66.72   16.21   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21830 on 112855 degrees of freedom
Multiple R-squared:  0.002322,	Adjusted R-squared:  0.002314 
F-statistic: 262.7 on 1 and 112855 DF,  p-value: < 2.2e-16

The output of R’s linear model scares some people, but it’s actually quite easy to interpret. In our case, it says that you get an hour more sleep for every 1100 dollar more median income, with an error of 70 or so dollars. R is very sure that this not a random fluctuation in the data, it estimates the probability that there is no effect at all to be virtually zero. However, the R squared is really as bad as it gets, so we really shouldn’t have made this fit. But since this post is more educational than scientific, let’s go with what we’ve got for now.

So how much longer do you have to stay at work if you’re not well off? Let’s repeat our analysis for the afternoon.

afternoon_trips &lt;- income_by_start_hour %&gt;%
  filter(Hour &gt; 15, Hour &lt; 21)
afternoon_model &lt;- lm(`Median Income`~Hour, afternoon_trips)
print(summary(afternoon_model))
Call:
lm(formula = `Median Income` ~ Hour, data = afternoon_trips)

Residuals:
   Min     1Q Median     3Q    Max 
-66309 -14323      3  14477 127949 

Coefficients:
            Estimate Std. Error t value Pr(>\|t\|)    
(Intercept)  92084.4      808.1  113.95   <2e-16 ***
Hour         -1707.0       45.7  -37.35   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21520 on 133114 degrees of freedom
Multiple R-squared:  0.01037,	Adjusted R-squared:  0.01036 
F-statistic:  1395 on 1 and 133114 DF,  p-value: < 2.2e-16

R thinks that for every 1700 dollars more income (with an error of about 50 dollars), you’ll be home an hour earlier. The P-value being tiny says again that there is virtually no chance for the effect being random.

How do we interpret this? It would be wrong to conclude that people from low-income areas work longer hours. Most likely the effect is caused by white collar workers having a more regular work schedule, while the commuting times of lower-paid workers are simply spread out more. Of course we can’t be sure, but it seems a reasonable explanation of what we see. But that’s the thing with data. It can support your hypothesis, but never really prove it. Maybe we’ll dig a little deeper next time.