Clustering 101, or: On Fridays, People Bike Differently!
We have talked about the BABS open data data set many times before. It lists bike trips in the San Francisco Bay area, with start and end point, date, time, and some extra information about the rider. What we want to look at in this episode is some basic clustering, and some surprising results from this well-know data set. The plan is to find classes of typical days in terms of bike usage. One would e.g. expect different usage patterns between weekdays and weekends, and we will actually discover some fun things beyond these basics as we go along. Let’s dive right in.
Some Data Processing
We start off by dissecting our data set into dates, hours, and trip
counts. We can do this with one line of Python
code, using
the pandas
library. Send me a
Tweet if you want some future post to give an introduction to
pandas
. For now, all we need is the groupby
function and some
housekeeping.
counts = (babs.groupby(['date', 'hour']) #1
.size() #2
.rename('count') #3
.reset_index()) #4
The first step (#1
) is grouping all rows with the same date and hour
together, the second (#2
) counts how many rows (i.e. bike trips) we
have on a given day, for a given hour. The last two steps (#3
) and
(#4
) make a data frame with sensible column names. The result should
look like this.
date | hour | count | |
---|---|---|---|
0 | 2014-09-01 | 0 | 3 |
1 | 2014-09-01 | 3 | 1 |
2 | 2014-09-01 | 4 | 2 |
3 | 2014-09-01 | 5 | 1 |
4 | 2014-09-01 | 6 | 1 |
Now we want to pivot the table we created to get the hourly counts,
creating one row with hourly counts for each day. Pandas data frames
have a built-in pivot
function that does just that.
X = counts.pivot_table(columns='hour', values='count', index='date').fillna(0)
The result looks like this (some columns left out).
hour | 0 | 1 | 2 | 3 | 14 | 15 | 16 |
---|---|---|---|---|---|---|---|
date | |||||||
2014-09-01 | 3.0 | 0.0 | 0.0 | 1.0 | 49.0 | 28.0 | 23.0 |
2014-09-02 | 0.0 | 1.0 | 0.0 | 0.0 | 43.0 | 51.0 | 111.0 |
2014-09-03 | 0.0 | 1.0 | 1.0 | 1.0 | 46.0 | 47.0 | 107.0 |
2014-09-04 | 2.0 | 0.0 | 0.0 | 0.0 | 47.0 | 56.0 | 122.0 |
2014-09-05 | 2.0 | 2.0 | 0.0 | 2.0 | 41.0 | 52.0 | 119.0 |
To illustrate the contents of the new data frame, let’s plot the second row, containing the trips taken on the second day (a Monday) in our data frame. The result looks like this.
We see a peak of activity in the morning hours, which likely consists of people commuting to work and a corresponding afternoon peak of people coming back from work.
Clustering
Let’s think about the clustering task a little. Clustering is a sub-discipline of unsupervised learning where one has data points available, but no target variable. We want to find days that look alike, not e.g. predict the number of trips taken. Famous clustering algorithms include the k-Means algorithm and the family of hierarchical clustering algorithms.
So what can we expect from the output a clustering algorithm applied to our data set? Well, we’ll get a label for each day, such that days looking alike should get the same label and thus be grouped together, and from what we’ve seen above, at least two clusters should be present, weekdays (where we have peak activity caused by commuters) and weekends.
The Elbow Method
Algorithms like k-Means expect the number of clusters k
to be an
input, determined before running the algorithm. What do we do if we
don’t know the number of clusters beforehand? Well, we can use a trick
known as the elbow method. Here, one plots the number of clusters
against the inertia, a measure of how well the points cluster
together (it’s defined as the sum of the distances of the points to
their respective cluster centers). The resulting curve should be
monotonously decreasing, since of course the more clusters we have,
the lower the sum of distances to the closest cluster center
becomes. However, once we reach the ‘correct’ number of clusters, the
gain will be smaller and smaller and thus create an elbow shape in the
plot. Below, you’ll find the elbow plot I made with artificially
created data to illustrate this. The data consists of three sources of
Gaussian 2D noise. Successive runs of the k-Means algorithm with
different values for k
gives different clusters and different
inertias, the resulting elbow plot would suggest 2 or 3 clusters
(sometimes it’s not 100% obvious what the right answer is).
Now let’s look at the elbow plot from our cycling data.
From the plot it looks like 4 is a good number of clusters. Let’s have a look at the cluster centers, i.e. the typical days for each group.
It looks like clusters 1-3 are weekdays of some sort and class 0 contains weekends. Time for some quality control.
Quality Control
Inspecting the cluster centers, we see immediately that classes 1 and 3 look very similar. This is typically a reason for suspicion. But let’s press ahead for now and re-visit this later. One thing you absolutely need to do after applying any clustering algorithm is looking at the cluster sizes. If you see big disparities you could have issues like outliers that you need to deal with. Our cluster sizes look like this.
Now, class 2 looks like it has very little counts. A quick look at the cluster center plot above tells us that this class contains lower-traffic days, so that’s nothing too strange. Now what about those similar classes 1 and 3?
Let’s look at the weekdays on which the days labeled 0-3 fall into respectively.
Looks like our clustering cleanly puts almost all Fridays into class 3 and Monday-Thursday into class 1. This is quite exciting, we can tell if it’s a Friday or another weekday by just looking at the way people cycle! Weird, but these slightly wacky results are what I love about my job. Now what about the small cluster, labeled with 2? Let’s repeat the plot above, now grouping by the Month.
Group 2 contains a high number of December weekdays. Not as cleanly cut as our Friday/other weekday dissection, but also quite neat. Especially considering that we capture 10 out of the 15 or so proper December weekdays in this group.
That was today’s data adventure, I hope you’ve enjoyed it. Next time, we’ll have, as promised, a look at the traffic chaos we’ve created last time.