We have talked about the BABS open data data set many times before. It lists bike trips in the San Francisco Bay area, with start and end point, date, time, and some extra information about the rider. What we want to look at in this episode is some basic clustering, and some surprising results from this well-know data set. The plan is to find classes of typical days in terms of bike usage. One would e.g. expect different usage patterns between weekdays and weekends, and we will actually discover some fun things beyond these basics as we go along. Let’s dive right in.

## Some Data Processing

We start off by dissecting our data set into dates, hours, and trip counts. We can do this with one line of Python code, using the pandas library. Send me a if you want some future post to give an introduction to pandas. For now, all we need is the groupby function and some housekeeping.

The first step (#1) is grouping all rows with the same date and hour together, the second (#2) counts how many rows (i.e. bike trips) we have on a given day, for a given hour. The last two steps (#3) and (#4) make a data frame with sensible column names. The result should look like this.

date hour count
0 2014-09-01 0 3
1 2014-09-01 3 1
2 2014-09-01 4 2
3 2014-09-01 5 1
4 2014-09-01 6 1

Now we want to pivot the table we created to get the hourly counts, creating one row with hourly counts for each day. Pandas data frames have a built-in pivot function that does just that.

The result looks like this (some columns left out).

hour 0 1 2 3 14 15 16
date
2014-09-01 3.0 0.0 0.0 1.0 49.0 28.0 23.0
2014-09-02 0.0 1.0 0.0 0.0 43.0 51.0 111.0
2014-09-03 0.0 1.0 1.0 1.0 46.0 47.0 107.0
2014-09-04 2.0 0.0 0.0 0.0 47.0 56.0 122.0
2014-09-05 2.0 2.0 0.0 2.0 41.0 52.0 119.0

To illustrate the contents of the new data frame, let’s plot the second row, containing the trips taken on the second day (a Monday) in our data frame. The result looks like this.

We see a peak of activity in the morning hours, which likely consists of people commuting to work and a corresponding afternoon peak of people coming back from work.

## Clustering

Let’s think about the clustering task a little. Clustering is a sub-discipline of unsupervised learning where one has data points available, but no target variable. We want to find days that look alike, not e.g. predict the number of trips taken. Famous clustering algorithms include the k-Means algorithm and the family of hierarchical clustering algorithms.

So what can we expect from the output a clustering algorithm applied to our data set? Well, we’ll get a label for each day, such that days looking alike should get the same label and thus be grouped together, and from what we’ve seen above, at least two clusters should be present, weekdays (where we have peak activity caused by commuters) and weekends.

### The Elbow Method

Algorithms like k-Means expect the number of clusters k to be an input, determined before running the algorithm. What do we do if we don’t know the number of clusters beforehand? Well, we can use a trick known as the elbow method. Here, one plots the number of clusters against the inertia, a measure of how well the points cluster together (it’s defined as the sum of the distances of the points to their respective cluster centers). The resulting curve should be monotonously decreasing, since of course the more clusters we have, the lower the sum of distances to the closest cluster center becomes. However, once we reach the ‘correct’ number of clusters, the gain will be smaller and smaller and thus create an elbow shape in the plot. Below, you’ll find the elbow plot I made with artificially created data to illustrate this. The data consists of three sources of Gaussian 2D noise. Successive runs of the k-Means algorithm with different values for k gives different clusters and different inertias, the resulting elbow plot would suggest 2 or 3 clusters (sometimes it’s not 100% obvious what the right answer is).

Now let’s look at the elbow plot from our cycling data.

From the plot it looks like 4 is a good number of clusters. Let’s have a look at the cluster centers, i.e. the typical days for each group.

It looks like clusters 1-3 are weekdays of some sort and class 0 contains weekends. Time for some quality control.

### Quality Control

Inspecting the cluster centers, we see immediately that classes 1 and 3 look very similar. This is typically a reason for suspicion. But let’s press ahead for now and re-visit this later. One thing you absolutely need to do after applying any clustering algorithm is looking at the cluster sizes. If you see big disparities you could have issues like outliers that you need to deal with. Our cluster sizes look like this.

Now, class 2 looks like it has very little counts. A quick look at the cluster center plot above tells us that this class contains lower-traffic days, so that’s nothing too strange. Now what about those similar classes 1 and 3?

Let’s look at the weekdays on which the days labeled 0-3 fall into respectively.

Looks like our clustering cleanly puts almost all Fridays into class 3 and Monday-Thursday into class 1. This is quite exciting, we can tell if it’s a Friday or another weekday by just looking at the way people cycle! Weird, but these slightly wacky results are what I love about my job. Now what about the small cluster, labeled with 2? Let’s repeat the plot above, now grouping by the Month.

Group 2 contains a high number of December weekdays. Not as cleanly cut as our Friday/other weekday dissection, but also quite neat. Especially considering that we capture 10 out of the 15 or so proper December weekdays in this group.

That was today’s data adventure, I hope you’ve enjoyed it. Next time, we’ll have, as promised, a look at the traffic chaos we’ve created last time.