After we familiarized ourselves with Keras in the last post, now is the time to get more serious. Much has been said and written about neural networks, and nobody working in analytics nowadays can really escape the hype. Most of the time you’ll however only read about neural networks for classification or regression, that is to say in a supervised learning setting. That is quite interesting and all, but there are exciting things that you can do with unsupervised problems as well.

Autoencoders in Keras

Autoencoders are neural networks in which the input and the output have the same dimensions, such that at training time the same data set can be used as both. Typically you’ll find one or more hidden layers symmetrically in between with a smaller layer at the center. This smaller layer then encodes the input, thus in effect performing dimensionality reduction. To create our own autoencoder, we can pretty much copy what the good people at the Keras blog did a while back. So first, we want an autoencoder.

from keras.models import Model
from keras.layers import Dense, Input
compressed_dim = 2
intermediate_dim_big = 12
intermediate_dim_small = 6
input_day = Input(shape=(len(X.columns),))
encoded = Dense(intermediate_dim_big, activation='selu')(input_day)
encoded = Dense(intermediate_dim_small, activation='selu')(encoded)
encoded = Dense(compressed_dim, activation='selu')(encoded)
decoded = Dense(intermediate_dim_small, activation='selu')(encoded)
decoded = Dense(intermediate_dim_big, activation='selu')(decoded)
decoded = Dense(len(X.columns), activation='sigmoid')(decoded)
autoencoder = Model(input_day, decoded)

This defines an autoencoder that uses a pandas data frame (we’ll have a look at the data in a second) as input, then sends it to hidden layers consisting of 12, 6, and 2 neurons respectively, to then restore it to its original size using the same layering in reverse. The encoded layer thus has just two neurons, in which we try to squeeze all the information collected in each row of data. We would like to inspect intermediate results, so let’s also define the encoder.

encoder = Model(input_day, encoded)

The Data

We’re using like in the last post the Oslo City Bike data. This time, we’re using 3.3 million trips collected in 2016 and 2017. We arrange the data such that each row represents one day, and each column contains the trip count at a given hour, giving us in theory 24 columns. Now the bike sharing system in Oslo shuts down from 11pm to 6am, leaving us with 17 columns. Plotting a few of our rows should give you a feeling for the data.

The days shown are a Saturday (2016-06-11) and the following Sunday (2016-06-12), with their characteristic flat curves lacking commuting peaks and the following Monday and Tuesday (2016-06-13/14). The weekdays show two peaks corresponding to the morning and afternoon commutes.

Let’s Encode

Training the model on our data is easy as pie.

autoencoder.compile(optimizer='adam', loss='mean_squared_error')
autoencoder.fit(
    X.values, X.values,
    batch_size=5,
    epochs=300,
    validation_split=0.1)

A few things you should take care of when trying to reproduce my results:

  • One has to take some care in choosing the batch size. Too small a batch and you don’t catch weekdays and weekends, sometimes leading to trouble during training. Too big, and you’ll see effects of variation in one month (e.g. things slowing down during holiday season) becoming an issue.
  • Using a plain ReLU activation can lead to problems of vanishing gradients in the encoded layer, effectively deactivating one neuron. In the few cases I tried the network never recovered.

This will hardly come as any surprise to practitioners, but I thought it might be useful information for newcomers, since neural networks can easily feel fiddly when starting out.

Results

The resulting network does an excellent job at reconstructing the hourly demand on given days. The plots below show the actual and reconstructed counts at the days shown above, and very close to them the actual numbers. This is pretty accurate considering that the whole information comes from two neurons.

Now let’s see what is going on in the encoded layer. Below is the plot of the activation of the neurons in it on our data.

It looks like the points fall in two clusters. This makes a lot of sense since as we saw above, we have two distinct usage patterns, relatively flat weekends and weekdays with two commuting peaks. To explore this in a little more detail, I’ve performed a k-means clustering on the data set, giving me the two clusters as depicted by the colors below.

Now let’s have a look which weekdays the points in each of the clusters represent (remember that each row in our original data, and thus each encoded point represents the counts of trips taken at each hour of a given day).

Just as expected, the first cluster mainly contains weekdays, and the second one Saturdays and Sundays. The mixed-in weekdays probably contain a number of holidays, giving those days weekend-like usage patterns. Easter Monday and the Monday after Pentecost are national holidays in Norway, and these are the extra Mondays in the second cluster.

I hope you’ve enjoyed this data adventure and stay tuned for new posts coming soon. Now go and encode some data yourself!