Take as a simple example a series of coin tosses. If you start flipping a coin and the first three outcomes are heads, a naive model will claim that the coin is biased. But intuitively we would say that three tries are probably too few to say anything with certainty. Would you base a business decision on this? What about 30 heads in a row?

To illustrate this with data, we’ll have a look at the COMPAS data set that ProPublica did a great analysis on. The short story is that models are to be built to assess risk of a criminal defendant to re-offend. We’ll just look at the ground truth, whether or not the defendant did re-offended, given their age group. Let’s set this up in Python.

```
import pandas
import numpy
from scipy.stats import binom
import matplotlib.pyplot as plt
url = ('https://github.com/dhesse/IN-STK5000-Autumn21'
'/raw/main/compas-scores-two-years.csv.gz')
data = pandas.read_csv(url) # load the data
data.groupby('age_cat')['two_year_recid'].mean() # point estimate
```

```
age_cat
25 - 45 0.459723
Greater than 45 0.315990
Less than 25 0.565075
Name: two_year_recid, dtype: float64
```

We start with what is called a *point estimate*. This is the most
naive model. It counts the total number of re-offending defendants in
each age category and divides by the total. We certainly see a
difference. Members of the youngest age category seem almost twice as
likely to re-offend than those of the oldest. But what about the
errors? Did we see enough data? Should we trust this, or are we
judging prematurely, just like with the coin tosses?

A simple method we can apply is called Bootstrapping. We sample from our $N$ examples $N$ new ones with replacement. We do this a large number of times (in our case 10,000) and can then calculate basic statistics on this (under some assumptions you can look up in the linked article), like confidence intervals. For us, a simple histogram will do.

```
bss = pandas.DataFrame([
data.sample(data.shape[0], replace=True).
groupby('age_cat')['two_year_recid'].mean()
for _ in range(10000)])
bss.plot.hist(bins=30)
```

We see that in our 10,000 samples there is virtually no overlap between the categories, suggesting that there is actually a material difference in re-offense rates of the different age groups.

What if we actually modeled this?

Next we fit a binomial model to each age group in the data, by simply dividing the number of re-offenders by the number of defendants in each group. We then sample from each model and compare with the bootstrap results.

```
params = pandas.DataFrame({
'N': data.groupby('age_cat').size(),
'k': data.groupby('age_cat')['two_year_recid'].sum(),
})
binomial = params.apply(
lambda x: binom.rvs(x['N'], # make a model and draw samples
x['k']/x['N'],
size=5000)/x['N'], axis=1, result_type='expand').T
ax = binomial.plot.hist(bins=40, density=True, alpha=0.5)
bss.plot.hist(bins=40, ax=ax, density=True, alpha=0.5)
plt.legend(bbox_to_anchor=(1,1))
```

As we can see the two estimates basically lie on top of each other. Making a binomial model and bootstrapping are distinct methods with different assumptions but give us the same result. It often pays of to go this route, especially if the stakes are high, like when deciding if someone should be sent to prison or not.

I hope you enjoyed today’s data adventure and will stay tuned for
more! In the meantime maybe reflect on how much you can trust *your*
models.

There are two main factors at play as it seems. The first is a much higher detection rate. During March, around 10% of the tests came back positive, now it is around 1%.

The second factor concerns the age distribution of those who fall ill.

As you can see from the graph (source: MSIS, Sep. 27, 2020), most registered cases in March (Norwegian: Mars) were between 40 and 60 years of age, while now it’s mainly those in their 20s who are affected. This is for now good news since the younger patients are much less likely to develop a severe form of the illness.

The older age groups will however most likely be affected at a later stage if cases keep on rising.

The graph is created with the following Python script with data downloaded from MSIS.

]]>**Disclaimer:** As I said before, I’m not a medical professional. This
blog post aims at investigating some interesting data analysis
techniques, not a scientific study on the spread of the COVID-19
pandemic.

The paper starts with a more or less simple observation. Under the broad assumption that only a relatively small fraction of the total population gets infected at a given time, one can integrate the SIR model’s differential equations to obtain the following expression for $\Delta T(t)$, the number of newly infected individuals at time $t$.

\[\Delta T(t+\tau) = b(R_t)\Delta T(t)\,\]with

\[b(R_t) \approx \exp\left\{ \tau \gamma (R_t - 1) \right\}\,.\]For our data, $\tau = 1\,\mathrm{day}$, and $\gamma$ is the inverse of the infectuous period of around 5.2 days according to this study. So in other words, with a given number of newly infected people $\Delta T$ at time $t$, and a fixed reproduction number $R_t$, we would expect to see $b$ times that number at time $t + \tau$. Now we can define a likelihood function for $R_t$ by assuming a Poisson distribution for $\Delta T(t + \tau)$ with mean $b(R_t)\,\Delta T(t)$.

\[\Delta T(t + \tau) \sim \operatorname{Poisson}\left[b(R_t)\,\Delta T(t)\right]\]For the case numbers in Italy, taken from here, we get the following likelihood functions for the first few days with a significant number for $\Delta T$.

One caveat here is that the raw case numbers need to be smoothed. The reported data is fluctuating very strongly:

This gives observations that are incompatible with our likelihood model, i.e. the support of the likelihood functions doesn’t overlap. Hence we introduce a Gaussian smoothing to the case numbers:

Once we have our likelihood function defined, we can use successive Bayesian updates starting with a uniform distribution for $R_{t = 0}$.

\[P[R|\Delta T(t+1) \leftarrow \Delta T(t)] = \frac{P[\Delta T(t+1) \leftarrow \Delta T(t)|R]\, P[R]}{P[\Delta T(t+1) \leftarrow \Delta T(t)]}\,.\]With this procedure, we not only get an estimate for $R_t$, but also a confidence interval:

This can be very useful e.g. when there are very few cases which will make our estimates more uncertain. Take South Korea, for example.

Even though the estimation method here is rough and results should not be used to inform public policy, it can serve for organizations to get a read on how measures in various countries are effecting the spread of COVID-19. This can give a somewhat better read on where to expect relaxing of public measures and where the restrictions will be in effect for a longer while.

I hope you’ve enjoyed this data adventure. Stay home, stay safe, and stay tuned for the next one!

]]>**Disclaimer:** I’m not a medical expert. This post is meant to
demonstrate very basic data analysis techniques to help interested
readers to process available data on COVID-19 and shed some light on
common simple techniques used to analyze disease data in
general. *This is not a scientific study* by any stretch of
imagination and *should not be mistaken for one*.

The highest quality data I could find on COVID-19 cases comes from Johns Hopkins University. It is unfortunately in a format that’s a little awkward to work with, so we need to massage the data a bit to get them into a form that is easy to work on. The following code loads the data into a Pandas data frame, as a proper time series and a multi-index for ‘Country/Region’ and ‘Province/State’. If you’re not interested in per-province data, you could group by ‘Country/Region’ and sum the cases.

Now we can, for example, recreate pretty plots visualizing the virus spread in various countries. As an example, here the growth of numbers of confirmed cases in major European countries from the day the number of confirmed cases was as closest to 200.

Let’s get to the basic reproduction number $R_0$. It is roughly the number of cases one single case of a disease will produce on average. Now this is clearly more a sledgehammer than a precision tool. You’ve probably heard of ‘Super-spreaders’.

But even though the number of infections caused by an individual case will vary widely according to many different factors, $R_0$ is anyway an interesting quantity since it will tell us much about how the outlook for a disease is. A number smaller than one will mean that the disease will gradually disappear, a bigger one means a faster spread.

Now we will go down the dangerous road of fitting the exponential curves shown above. This is something you really should not do if you’re after precise results, but since all we want to do is get a feeling for how $R_0$ is obtained in principle, the fits will do for now.

One of the most common models to describe the dynamics of a pandemic is the SIR model. It assumes that individuals in a population of size $N$ can be in either one of three states: Susceptible, infected, or recovered. The dynamics of the number $S$ of susceptible, $I$ of infected, and $R$ of recovered persons is then described by a set of three coupled ordinary differential equations:

\[\begin{align*} \frac{\mathrm{d}S}{\mathrm{d}t} &= -\frac{\beta I S}N\,,\\ \frac{\mathrm{d}I}{\mathrm{d}t} &= \frac{\beta I S}N - \gamma I\,,\\ \frac{\mathrm{d}R}{\mathrm{d}t} &= \gamma I\,. \end{align*}\]Here, $\beta$ is the product of the average number of contacts per person per time interval and the probability of infecting each of these contacts, and $\gamma$ the rate of recovery or mortality. The first equation tells us how quickly people pass from susceptible to being infected. The last one states that people move from infected to recovered at rate $\gamma$. The number of infected (middle equation) is then the negative sum of those. This yields a conservation law, $\dot S + \dot I + \dot R = 0$, and hence $S(t) + I(t) + R(t) = N$. This means that we don’t consider changes in population size (like births or deaths). There are many variations of this basic model that deal e.g. with different rates in different age groups, etc.

The much talked about $R_0$ is then defined as the rate of infection divided by the rate at which people go from infected to recovered:

\[R_0 = \frac \beta \gamma\]Now looking that this paper by Wallinga and Lipsitch, we
see that we can *estimate* $R_0$ (simply $R$ in their notation) by
looking at the exponential rise in the *onset* of the disease, by
using the formula

where $r$ is the growth rate (from fitting the exponential rise in early case numbers) and $T_c$ the disease’s generation interval.

Some code to get you started could look like this:

The results we’re getting are the following:

Country | r | 2.5% conf. | 97.5% conf. |
---|---|---|---|

Spain | 0.230066 | 0.211205 | 0.248928 |

Italy | 0.172809 | 0.160298 | 0.185320 |

France | 0.171130 | 0.159568 | 0.182692 |

Germany | 0.188302 | 0.174079 | 0.202524 |

As we could have guessed looking at the curves shown above, we get similar results for the countries we consider. The different rates can be attributed to actual differences in how the disease spreads in those countries, or the methodology employed to identify and count the cases.

Now all we are missing is the generation interval. In this study, we get an estimate of 5.2 days, with a confidence interval of length 3. We can model this using a normal curve centered at 5.2 with standard deviation close to one. We now generate bootstrap samples from the table above, calculate the mean, and draw samples from a normal distribution for $T_c$. The results look like this:

With this, we get an estimate for $R_0$ of 2.01, with a (5%, 95%) confidence interval from 1.67 to 2.31, which is very similar to the numbers we heard in the press. This would mean that each case of COVID-19 yields two more infections on average. Of course this result is derived making a number of assumptions and hand-waving arguments, but after all we arrived at a credible estimate and can quote a confidence interval on our result.

What does this tell us? On a policy side the required action is clear: The value for $R_0$ around two needs to be pushed down below one. In order to achieve this, there are two levers available (remember the definition of $R_0$ and $\beta$). Firstly, the number of social contacts should be limited. Governments are working to achieve this by closing schools, limiting public life, etc. Secondly, one can try to reduce the probability that a contact leads to an infection. This can be done by washing your hands, avoiding to touch your face, wear a mask, and so on.

I hope you’ve learned something about the modeling of disease and got a better understanding of the numbers presented in discussions around COVID-19, and the rationale behind many of the measures that we see implemented around the globe. Stay safe, and tuned for the next data adventure!

]]>The issue is that the logs are just dumped as gzipped CSV files into a S3 bucket. Lots of files. And this is where Metaflow comes in. Let’s make a simple flow that will download all the data and dump it into a Pandas DataFrame:

I actually had to run the script increasing the number of splits
allowed in a `foreach`

in Metaflow using the `--max-num-splits`

parameter. Here probably lies the most severe limitation of this
approach. So I wouldn’t use this as a production solution, but to get
to know Metaflow a bit better, this is fine.

So what did I find out? One thing is quite intriguing: Plotting the percentage of traffic going to each unique uri stem, we get the following picture.

Let’s break this down a little bit. The most popular URI stem is
`robots.txt`

. This is good citizen robots checking what
part of `data-adventures.com`

they are allowed to crawl. Now looking a
further down the list, we see the stem `wp-login.php`

. This is
actually hackers trying to find vulnerable (read: the admin forgot/was
too lazy to change the default password) Wordpress
sites. This seems to be a very common thing, and Darknet
Diaries ran an interesting podcast on this. The general
story goes that someone injects JavaScript code into a vulnerable
website that then skims credit card data off of it. Scary stuff.

I hope you’ve enjoyed this data adventure and stay tuned for more.

]]>To give it a spin, I’ve made a simple workflow that downloads and ultimately transforms into a Pandas data frame survey results for general elections in Germany. The ultimate goal will be to look at biases of various institutes compiling the surveys. For now, we can make pretty graphs like this one.

The setup in Metaflow is very simple. Below you find a snippet containing the first steps. All one needs is a class that contains as methods all the steps required for one’s workflow. Even parallel execution of steps is possible, as you can see:

You see that there is really not much to it. Take a closer look at the tutorials or at my repository to see how all of this integrates with Jupyter and the rest of your workflows. So what’s the verdict overall?

- Open source
- Easy to use
- Integration with many popular libraries
- Easy to use with AWS
- Scaling seems straight forward enough
- There is some rudimentary version control

- It seems like for now you’re stuck with AWS (so if your company chooses e.g. Azure it will at least mean some pain)
- Scaling to very big datasets (e.g. where you would naturally use Spark) might be an issue since you’re running plain Python

Some tasks, especially when operating on large datasets, are complex. It sometimes makes sense hiding some of this complexity, but sometimes it doesn’t. One issue I see right off the bat is that Metaflow saves (checkpoints) your data at every step of every run. This means, depending on the nature of your workflow, that you might end up with a lot of data lying around in AWS that will need to be cleaned not to generate costs. This is still a manual task.

So, do I like Metaflow? Yes, absolutely. It seems like a great solution for some challenges that almost all organizations using data science, and even lone wolf data scientist will face. Is it a silver bullet? Probably not.

]]>If this caught your attention, you might want to use the code I wrote to get your data from Strava. Happy analyzing!

]]>Each session is logged at a resolution of one second, containing

- Time
- Heart Rate (bpm)
- Speed (km/h)
- Pace (min/km)
- Cadence (steps/minute)
- Altitude (m)
- Distance (m)

The device used to log the runs (the discontinued Polar M400 running
watch) also creates columns for additional metrics that it doesn’t
log. These are left blank in the files. The data is provided as
gzipped CSV files. *Note that no data cleaning was attempted*. This
means that the data contains outliers and measurement errors. Below
you see a sample plot representing the altitude profile of a 20
kilometers run, including a number of apparent measurement errors.

The data set contains sessions of various distances ranging from a few kilometers to 20+ kilometers.

I’ll probably do some more sophisticated analyses on the data later on, but for inspiration for things you could do with these data, let’s find out how much slower I go uphill. I’ve added a column measuring ascend in percent. I fit an ordinary least square model from the Statsmodels package, predicting speed from heart rate (and its square), seconds ran, and ascend (as well as its square). The data is grouped in two minute sections to remove autocorrelation. This relatively simple model explains roughly a third of the variance in the data.

Model: | OLS | Adj. R-squared: | 0.310 |

Dependent Variable: | Speed (km/h) | AIC: | 2801.2025 |

Date: | 2018-07-15 13:12 | BIC: | 2830.8959 |

No. Observations: | 1042 | Log-Likelihood: | -1394.6 |

Df Model: | 5 | F-statistic: | 94.36 |

Df Residuals: | 1036 | Prob (F-statistic): | 5.84e-82 |

R-squared: | 0.313 | Scale: | 0.85614 |

Coef. | Std.Err. | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|

const. | -10.3113 | 3.3817 | -3.0491 | 0.0024 | -16.9472 | -3.6755 |

HR (bpm) | 0.2789 | 0.0483 | 5.7732 | 0.0000 | 0.1841 | 0.3737 |

Seconds | 0.0001 | 0.0000 | 4.1677 | 0.0000 | 0.0000 | 0.0001 |

HR^2 | -0.0010 | 0.0002 | -5.5585 | 0.0000 | -0.0013 | -0.0006 |

Ascend (%) | -0.0572 | 0.0073 | -7.7989 | 0.0000 | -0.0716 | -0.0428 |

Ascend^2 | -0.0017 | 0.0002 | -11.6011 | 0.0000 | -0.0020 | -0.0015 |

So how much does running up or down a hill slow me down? The mean ascend over the two minute windows I’ve used is distributed like this.

So the relevant ascend range is roughly between -20% and 20%. Taking the quadratic effect of the slow-down into account (remember I used the ascend as well as its square in the model), the influence of running up or down a hill looks like the following graph.

On the steep downhill sections, I gain roughly half a km/h, while I loose on the steep uphill sections more than twice that. This makes complete sense since I try to avoid running downhill too fast to save my knees.

This is it for now, have fun with the data and stay tuned for more data adventures.

]]>Autoencoders are neural networks in which the input and the
output have the same dimensions, such that at training time the same
data set can be used as both. Typically you’ll find one or more hidden
layers symmetrically in between with a smaller layer at the
center. This smaller layer then *encodes* the input, thus in effect
performing dimensionality reduction. To create our own
autoencoder, we can pretty much copy what the good people at the
Keras blog did a while back. So first, we want an autoencoder.

This defines an autoencoder that uses a pandas data frame (we’ll have a look at the data in a second) as input, then sends it to hidden layers consisting of 12, 6, and 2 neurons respectively, to then restore it to its original size using the same layering in reverse. The encoded layer thus has just two neurons, in which we try to squeeze all the information collected in each row of data. We would like to inspect intermediate results, so let’s also define the encoder.

We’re using like in the last post the Oslo City Bike data. This time, we’re using 3.3 million trips collected in 2016 and 2017. We arrange the data such that each row represents one day, and each column contains the trip count at a given hour, giving us in theory 24 columns. Now the bike sharing system in Oslo shuts down from 11pm to 6am, leaving us with 17 columns. Plotting a few of our rows should give you a feeling for the data.

The days shown are a Saturday (2016-06-11) and the following Sunday (2016-06-12), with their characteristic flat curves lacking commuting peaks and the following Monday and Tuesday (2016-06-13/14). The weekdays show two peaks corresponding to the morning and afternoon commutes.

Training the model on our data is easy as pie.

A few things you should take care of when trying to reproduce my results:

- One has to take some care in choosing the batch size. Too small a batch and you don’t catch weekdays and weekends, sometimes leading to trouble during training. Too big, and you’ll see effects of variation in one month (e.g. things slowing down during holiday season) becoming an issue.
- Using a plain ReLU activation can lead to problems of vanishing gradients in the encoded layer, effectively deactivating one neuron. In the few cases I tried the network never recovered.

This will hardly come as any surprise to practitioners, but I thought it might be useful information for newcomers, since neural networks can easily feel fiddly when starting out.

The resulting network does an excellent job at reconstructing the hourly demand on given days. The plots below show the actual and reconstructed counts at the days shown above, and very close to them the actual numbers. This is pretty accurate considering that the whole information comes from two neurons.

Now let’s see what is going on in the encoded layer. Below is the plot of the activation of the neurons in it on our data.

It looks like the points fall in two clusters. This makes a lot of sense since as we saw above, we have two distinct usage patterns, relatively flat weekends and weekdays with two commuting peaks. To explore this in a little more detail, I’ve performed a k-means clustering on the data set, giving me the two clusters as depicted by the colors below.

Now let’s have a look which weekdays the points in each of the clusters represent (remember that each row in our original data, and thus each encoded point represents the counts of trips taken at each hour of a given day).

Just as expected, the first cluster mainly contains weekdays, and the second one Saturdays and Sundays. The mixed-in weekdays probably contain a number of holidays, giving those days weekend-like usage patterns. Easter Monday and the Monday after Pentecost are national holidays in Norway, and these are the extra Mondays in the second cluster.

I hope you’ve enjoyed this data adventure and stay tuned for new posts coming soon. Now go and encode some data yourself!

]]>You probably know I like to work with bike sharing trip data,
but this time I wanted to look at the open data provided by the Oslo
City Bike program. I downloaded some of their datasets
containing 2.5 million trips taken in Oslo between April 1st, 2017 and
September 20th, 2017 (Norwegians *love* to cycle). The number of daily
trips varies quite a bit, as the plot below shows (on May 17th,
e.g. the system was shut down due to it being a national
holiday).

I made a dataset containing the trips taken with a resolution of one hour, looking like this:

date | trips |
---|---|

2017-05-01 10:00:00 | 353.0 |

2017-05-01 11:00:00 | 617.0 |

2017-05-01 12:00:00 | 817.0 |

2017-05-01 13:00:00 | 1023.0 |

2017-05-01 14:00:00 | 1145.0 |

We will use autoregressive models to predict the number of trips being taken at a given hour, using the number of trips taken in the past as inputs. Some pandas magic gives us the following dataset.

date | 1d | 1h | 2d | 2h | 3d | 3h | 4d | 4h | 7d | target |
---|---|---|---|---|---|---|---|---|---|---|

2017-05-01 10:00:00 | 324.0 | 244.0 | 461.0 | 162.0 | 250.0 | 94.0 | 311.0 | 50.0 | 88.0 | 353.0 |

2017-05-01 11:00:00 | 589.0 | 353.0 | 664.0 | 244.0 | 394.0 | 162.0 | 394.0 | 94.0 | 86.0 | 617.0 |

2017-05-01 12:00:00 | 801.0 | 617.0 | 845.0 | 353.0 | 472.0 | 244.0 | 412.0 | 162.0 | 119.0 | 817.0 |

2017-05-01 13:00:00 | 890.0 | 817.0 | 966.0 | 617.0 | 468.0 | 353.0 | 457.0 | 244.0 | 182.0 | 1023.0 |

2017-05-01 14:00:00 | 1025.0 | 1023.0 | 1100.0 | 817.0 | 593.0 | 617.0 | 609.0 | 353.0 | 288.0 | 1145.0 |

The target is the number of trips taken, the *lagged* data are our
covariates. We’ll use the last week of the data as a test set and
train our models on the rest.

To get started with using Keras, all you need to do is install it using

Building our first model is pretty straightforward, after reading the excellent user guide and documentation, you’ll soon enough write your own. No comparison to Tensorflow!

So how well does our basic model do? As a comparison, I’ve also fitted a gradient boosted tree and a more fancy neural network employing long short-term memory (LSTM) units (building one is super easy in Keras). The good news is that the problem is a quite harmless one, and predictions are close to the actual values.

The errors are decently distributed, and plotting the predictions against the actuals doesn’t show any significant non-linearities.

Staring at the above plot for a little bit, one could argue that the neural networks show are a little bit less variance than the tree in this example. The residual errors on the test set are virtually identical from all these three methods though. This is not surprising, of course a neural network does as well as any other method on a harmless dataset. The thing is, using Keras, it is also as easy to train as any other model, e.g. from scikit-learn.

I hope you’ve enjoyed this post and feel inspired to give Keras a try. As always, stay tuned for the next data adventure, where we’ll have even more fun with neural networks!

]]>