As an example we’ll use a very simple dataset featured on Kaggle, the cardiovascular disease dataset.

```
import pandas
data = pandas.read_csv("cardio_train.csv.zip", sep=";")
```

The data describes a sample of individuals, giving some medical and lifestyle attributes. It contains

- id
- age
- gender
- height
- weight
- ap_hi
- ap_lo
- cholesterol
- gluc
- smoke
- alco
- active
- cardio

I won’t go into all the details as they can be found on the dataset
page at Kaggle. We will focus on the variables `cardio`

, the
absence or presence of cardiovascular disease, as well as the `smoke`

column, an indicator of a self-reported smoking habit. Let’s use
Pandas to look at what correlates with the presence of
cardiovascular disease first.

```
data.corr()['cardio'].sort_values()[:-1].plot.bar()
```

Most of this is not too surprising, old age, high cholesterol, and
high weight are highly correlated with the presence of cardiovascular
disease. A comment on the `cholesterol`

indicator: I know, I should
not use `corr`

with categorical variables, but given how they are
coded, it will not do too much harm in exploratory analysis.

One thing stands out though: Smoking seems *negatively* correlated
with cardiovascular disease. This can’t possibly be right, can it?
Let’s look at what correlates with `smoke`

in turn.

```
data.corr()['smoke'].sort_values()[:-1].plot.bar()
```

Now we’re on to something! It turns out that age is negatively
correlated with smoking. *Older* participants in the study are *more*
likely to smoke. So the negative correlation between a smoking habit
and cardiovascular disease might have to do with the correlation
between age and smoking. That makes a lot of sense.

Let’s try to test this hypothesis. One simple thing we could do is fit
a model that estimates the effect of smoking on cardiovascular
disease, *correcting* for age. We would expect the heart disease
reducing effect of smoking to go away, or even reverse. We’ll use a
simple logistic regression in statsmodels.

```
import statsmodels.formula.api as smf
fit_results = smf.logit("cardio ~ smoke + age", data).fit()
fit_results.summary2()
```

Model: | Logit | Pseudo R-squared: | 0.042 |

Dependent Variable: | cardio | AIC: | 92989.5481 |

Date: | 2022-06-06 09:25 | BIC: | 93017.0168 |

No. Observations: | 70000 | Log-Likelihood: | -46492. |

Df Model: | 2 | LL-Null: | -48520. |

Df Residuals: | 69997 | LLR p-value: | 0.0000 |

Converged: | 1.0000 | Scale: | 1.0000 |

No. Iterations: | 5.0000 |

Coef. | Std.Err. | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|

Intercept | -3.9221 | 0.0645 | -60.8510 | 0.0000 | -4.0484 | -3.7957 |

smoke | -0.0309 | 0.0275 | -1.1220 | 0.2619 | -0.0848 | 0.0231 |

age | 0.0002 | 0.0000 | 61.5588 | 0.0000 | 0.0002 | 0.0002 |

First things first: Looking at the metrics here, we see that we didn’t create a very good model. If we were to use this to predict cardiovascular disease in a population, we’d do a piss-poor job. But that’s not the goal here.

Looking at the confidence intervals on the fit coefficients, we can see that taking age into account, we don’t see a significant effect of smoking anymore, zero is compatible with our coefficient within errors. We recently talked about why model errors and prediction errors matter.

Now what would have happened if we would have fitted a model looking only at the smoking habit?

```
fit_results = smf.logit("cardio ~ smoke", data).fit()
fit_results.summary2()
```

```
Optimization terminated successfully.
Current function value: 0.693027
Iterations 3
```

Model: | Logit | Pseudo R-squared: | 0.000 |

Dependent Variable: | cardio | AIC: | 97027.7864 |

Date: | 2022-06-06 09:24 | BIC: | 97046.0989 |

No. Observations: | 70000 | Log-Likelihood: | -48512. |

Df Model: | 1 | LL-Null: | -48520. |

Df Residuals: | 69998 | LLR p-value: | 4.1673e-05 |

Converged: | 1.0000 | Scale: | 1.0000 |

No. Iterations: | 3.0000 |

Coef. | Std.Err. | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|

Intercept | 0.0084 | 0.0079 | 1.0647 | 0.2870 | -0.0071 | 0.0239 |

smoke | -0.1093 | 0.0267 | -4.0956 | 0.0000 | -0.1617 | -0.0570 |

Hopefully we would have seen how poor a job the model does and stopped right there. But if we didn’t have a keen eye on model metrics and instead would have looked only at the fit coefficients, we could have come to the conclusion that yes, within errors, smoking really does have a protective effect when it comes to heart disease.

This is of course an absurd conclusion. But we see this only because the negative health impact of smoking is common knowledge. This is not always the case when it comes to complex business problems. Subtleties can easily be overlooked and the wrong conclusions drawn when people blindly apply models to problems they don’t fully understand and business is not sufficiently involved. I’ve seen this too many times, and it can cost businesses dearly. In monetary terms when the wrong decisions are taken, but often most importantly in the loss of trust in AI that will hurt much more in the long term.

I hope you’ve enjoyed this data adventure and stay tuned for more!

]]>For the project, I chose to use Particle devices. They come with great documentation and a lot of infrastructure. You can, for example flash firmware to the devices over the air, manage large fleets of IoT devices, and they seamlessly connect with web hooks, Microsoft Azure IoT Hub, and Google Cloud Pub/Sub. In a first version of the project, I made a web hook to connect to an AWS Lambda function that would then write the data to a cloud database. I’ve since also looked at the other integrations, they seem to work flawlessly and are easy to set up.

To keep things simple, I’ll describe here (in a future post) how to set up your own home server to read the data coming from our sensors, and do some simple data analysis on it.

For the acquisition on environmental data, I chose a BMI 680 environmental sensor breakout board. The wiring is quite simple, I am using a breadboard like so:

There was some minimal soldering involved, but nothing that requires particularly advanced soldering skills.

Particle microcontrollers are programmed in C++ using the Visual Studio Code plugin called Particle Workbench. Their website has great how-to guides to get a new user started. Luckily they are also mostly Arduino-compatible, such that the BME 680 Arduino Library works fine in the Particle ecosystem.

You can find my code repository here which glues all of this together. In the end, I get in my debug console the environmental data from the BME 680:

```
{"resistance": 612379,"humidity": 36.89,"pressure": 1003.96,"temperature": 25.90}
```

I hope you enjoyed today’s data adventure and stay tuned for more. We will soon dive deeper into the software side of the project.

]]>We start off with the COMPASS data we used in the last post. We will fit a model predicting the two-year recidivism given age, number of juvenile misdemeanours, and the number of priors.

```
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
features = ['age', 'juv_misd_count', 'priors_count']
target = 'two_year_recid'
X = data[features]
y = data[target]
model = DecisionTreeClassifier(max_depth=5)
```

Now, we would usually do something like cross-validating to see how well our model does on data it hasn’t seen before.

```
cross_val_score(model, X, y)
```

```
array([0.6950797 , 0.67290367, 0.63270963, 0.6950797 , 0.67822469])
```

Looking naively at this, we’d say we’re probably doing a good enough
job, being correct in about 70% of the cases. Not too bad! But what if
we want to know the recidivism *probability* for an individual? Let’s
look as an example at a 30 year old individual, with three juvenile
misdemeanors, and no priors.

```
model = DecisionTreeClassifier(max_depth=5).fit(X, y)
model.predict_proba([[30, 3, 0]])[0][0]
```

```
0.716304347826087
```

The model gives the individual roughly a 30% chance to not re-offend within two years. The model we chose, a decision tree, will estimate this probability by looking at the leaf node we end up in and compare the number of positive and negative cases. So far so good. But should we base a high-stakes decision on this? What is the error? We can estimate it using the bootstrap method. Let’s generate 1000 bootstrap samples.

```
models = []
for _ in range(1000):
sample = data.sample(data.shape[0], replace=True)
model = DecisionTreeClassifier(max_depth=5)\
.fit(sample[features], sample[target])
models.append(model)
plt.hist([m.predict_proba([[30, 3, 0]])[0][0] for m in models], bins=20)
```

As we see, the bootstrap gives us a sizable error bar on the probability prediction. What your final decision will be depends on many factors. The potential cost of a false positive or false negative will have to be weighted against each other. As will, among many other factors, how representative you think the data is for your use case. How would you decide?

I hope you’ve enjoyed today’s data adventure and stay tuned for more!

]]>Take as a simple example a series of coin tosses. If you start flipping a coin and the first three outcomes are heads, a naive model will claim that the coin is biased. But intuitively we would say that three tries are probably too few to say anything with certainty. Would you base a business decision on this? What about 30 heads in a row?

To illustrate this with data, we’ll have a look at the COMPAS data set that ProPublica did a great analysis on. The short story is that models are to be built to assess risk of a criminal defendant to re-offend. We’ll just look at the ground truth, whether or not the defendant did re-offended, given their age group. Let’s set this up in Python.

```
import pandas
import numpy
from scipy.stats import binom
import matplotlib.pyplot as plt
url = ('https://github.com/dhesse/IN-STK5000-Autumn21'
'/raw/main/compas-scores-two-years.csv.gz')
data = pandas.read_csv(url) # load the data
data.groupby('age_cat')['two_year_recid'].mean() # point estimate
```

```
age_cat
25 - 45 0.459723
Greater than 45 0.315990
Less than 25 0.565075
Name: two_year_recid, dtype: float64
```

We start with what is called a *point estimate*. This is the most
naive model. It counts the total number of re-offending defendants in
each age category and divides by the total. We certainly see a
difference. Members of the youngest age category seem almost twice as
likely to re-offend than those of the oldest. But what about the
errors? Did we see enough data? Should we trust this, or are we
judging prematurely, just like with the coin tosses?

A simple method we can apply is called Bootstrapping. We sample from our $N$ examples $N$ new ones with replacement. We do this a large number of times (in our case 10,000) and can then calculate basic statistics on this (under some assumptions you can look up in the linked article), like confidence intervals. For us, a simple histogram will do.

```
bss = pandas.DataFrame([
data.sample(data.shape[0], replace=True).
groupby('age_cat')['two_year_recid'].mean()
for _ in range(10000)])
bss.plot.hist(bins=30)
```

We see that in our 10,000 samples there is virtually no overlap between the categories, suggesting that there is actually a material difference in re-offense rates of the different age groups.

What if we actually modeled this?

Next we fit a binomial model to each age group in the data, by simply dividing the number of re-offenders by the number of defendants in each group. We then sample from each model and compare with the bootstrap results.

```
params = pandas.DataFrame({
'N': data.groupby('age_cat').size(),
'k': data.groupby('age_cat')['two_year_recid'].sum(),
})
binomial = params.apply(
lambda x: binom.rvs(x['N'], # make a model and draw samples
x['k']/x['N'],
size=5000)/x['N'], axis=1, result_type='expand').T
ax = binomial.plot.hist(bins=40, density=True, alpha=0.5)
bss.plot.hist(bins=40, ax=ax, density=True, alpha=0.5)
plt.legend(bbox_to_anchor=(1,1))
```

As we can see the two estimates basically lie on top of each other. Making a binomial model and bootstrapping are distinct methods with different assumptions but give us the same result. It often pays of to go this route, especially if the stakes are high, like when deciding if someone should be sent to prison or not.

I hope you enjoyed today’s data adventure and will stay tuned for
more! In the meantime maybe reflect on how much you can trust *your*
models.

There are two main factors at play as it seems. The first is a much higher detection rate. During March, around 10% of the tests came back positive, now it is around 1%.

The second factor concerns the age distribution of those who fall ill.

As you can see from the graph (source: MSIS, Sep. 27, 2020), most registered cases in March (Norwegian: Mars) were between 40 and 60 years of age, while now it’s mainly those in their 20s who are affected. This is for now good news since the younger patients are much less likely to develop a severe form of the illness.

The older age groups will however most likely be affected at a later stage if cases keep on rising.

The graph is created with the following Python script with data downloaded from MSIS.

]]>**Disclaimer:** As I said before, I’m not a medical professional. This
blog post aims at investigating some interesting data analysis
techniques, not a scientific study on the spread of the COVID-19
pandemic.

The paper starts with a more or less simple observation. Under the broad assumption that only a relatively small fraction of the total population gets infected at a given time, one can integrate the SIR model’s differential equations to obtain the following expression for $\Delta T(t)$, the number of newly infected individuals at time $t$.

\[\Delta T(t+\tau) = b(R_t)\Delta T(t)\,\]with

\[b(R_t) \approx \exp\left\{ \tau \gamma (R_t - 1) \right\}\,.\]For our data, $\tau = 1\,\mathrm{day}$, and $\gamma$ is the inverse of the infectuous period of around 5.2 days according to this study. So in other words, with a given number of newly infected people $\Delta T$ at time $t$, and a fixed reproduction number $R_t$, we would expect to see $b$ times that number at time $t + \tau$. Now we can define a likelihood function for $R_t$ by assuming a Poisson distribution for $\Delta T(t + \tau)$ with mean $b(R_t)\,\Delta T(t)$.

\[\Delta T(t + \tau) \sim \operatorname{Poisson}\left[b(R_t)\,\Delta T(t)\right]\]For the case numbers in Italy, taken from here, we get the following likelihood functions for the first few days with a significant number for $\Delta T$.

One caveat here is that the raw case numbers need to be smoothed. The reported data is fluctuating very strongly:

This gives observations that are incompatible with our likelihood model, i.e. the support of the likelihood functions doesn’t overlap. Hence we introduce a Gaussian smoothing to the case numbers:

Once we have our likelihood function defined, we can use successive Bayesian updates starting with a uniform distribution for $R_{t = 0}$.

\[P[R|\Delta T(t+1) \leftarrow \Delta T(t)] = \frac{P[\Delta T(t+1) \leftarrow \Delta T(t)|R]\, P[R]}{P[\Delta T(t+1) \leftarrow \Delta T(t)]}\,.\]With this procedure, we not only get an estimate for $R_t$, but also a confidence interval:

This can be very useful e.g. when there are very few cases which will make our estimates more uncertain. Take South Korea, for example.

Even though the estimation method here is rough and results should not be used to inform public policy, it can serve for organizations to get a read on how measures in various countries are effecting the spread of COVID-19. This can give a somewhat better read on where to expect relaxing of public measures and where the restrictions will be in effect for a longer while.

I hope you’ve enjoyed this data adventure. Stay home, stay safe, and stay tuned for the next one!

]]>**Disclaimer:** I’m not a medical expert. This post is meant to
demonstrate very basic data analysis techniques to help interested
readers to process available data on COVID-19 and shed some light on
common simple techniques used to analyze disease data in
general. *This is not a scientific study* by any stretch of
imagination and *should not be mistaken for one*.

The highest quality data I could find on COVID-19 cases comes from Johns Hopkins University. It is unfortunately in a format that’s a little awkward to work with, so we need to massage the data a bit to get them into a form that is easy to work on. The following code loads the data into a Pandas data frame, as a proper time series and a multi-index for ‘Country/Region’ and ‘Province/State’. If you’re not interested in per-province data, you could group by ‘Country/Region’ and sum the cases.

Now we can, for example, recreate pretty plots visualizing the virus spread in various countries. As an example, here the growth of numbers of confirmed cases in major European countries from the day the number of confirmed cases was as closest to 200.

Let’s get to the basic reproduction number $R_0$. It is roughly the number of cases one single case of a disease will produce on average. Now this is clearly more a sledgehammer than a precision tool. You’ve probably heard of ‘Super-spreaders’.

But even though the number of infections caused by an individual case will vary widely according to many different factors, $R_0$ is anyway an interesting quantity since it will tell us much about how the outlook for a disease is. A number smaller than one will mean that the disease will gradually disappear, a bigger one means a faster spread.

Now we will go down the dangerous road of fitting the exponential curves shown above. This is something you really should not do if you’re after precise results, but since all we want to do is get a feeling for how $R_0$ is obtained in principle, the fits will do for now.

One of the most common models to describe the dynamics of a pandemic is the SIR model. It assumes that individuals in a population of size $N$ can be in either one of three states: Susceptible, infected, or recovered. The dynamics of the number $S$ of susceptible, $I$ of infected, and $R$ of recovered persons is then described by a set of three coupled ordinary differential equations:

\[\begin{align*} \frac{\mathrm{d}S}{\mathrm{d}t} &= -\frac{\beta I S}N\,,\\ \frac{\mathrm{d}I}{\mathrm{d}t} &= \frac{\beta I S}N - \gamma I\,,\\ \frac{\mathrm{d}R}{\mathrm{d}t} &= \gamma I\,. \end{align*}\]Here, $\beta$ is the product of the average number of contacts per person per time interval and the probability of infecting each of these contacts, and $\gamma$ the rate of recovery or mortality. The first equation tells us how quickly people pass from susceptible to being infected. The last one states that people move from infected to recovered at rate $\gamma$. The number of infected (middle equation) is then the negative sum of those. This yields a conservation law, $\dot S + \dot I + \dot R = 0$, and hence $S(t) + I(t) + R(t) = N$. This means that we don’t consider changes in population size (like births or deaths). There are many variations of this basic model that deal e.g. with different rates in different age groups, etc.

The much talked about $R_0$ is then defined as the rate of infection divided by the rate at which people go from infected to recovered:

\[R_0 = \frac \beta \gamma\]Now looking that this paper by Wallinga and Lipsitch, we
see that we can *estimate* $R_0$ (simply $R$ in their notation) by
looking at the exponential rise in the *onset* of the disease, by
using the formula

where $r$ is the growth rate (from fitting the exponential rise in early case numbers) and $T_c$ the disease’s generation interval.

Some code to get you started could look like this:

The results we’re getting are the following:

Country | r | 2.5% conf. | 97.5% conf. |
---|---|---|---|

Spain | 0.230066 | 0.211205 | 0.248928 |

Italy | 0.172809 | 0.160298 | 0.185320 |

France | 0.171130 | 0.159568 | 0.182692 |

Germany | 0.188302 | 0.174079 | 0.202524 |

As we could have guessed looking at the curves shown above, we get similar results for the countries we consider. The different rates can be attributed to actual differences in how the disease spreads in those countries, or the methodology employed to identify and count the cases.

Now all we are missing is the generation interval. In this study, we get an estimate of 5.2 days, with a confidence interval of length 3. We can model this using a normal curve centered at 5.2 with standard deviation close to one. We now generate bootstrap samples from the table above, calculate the mean, and draw samples from a normal distribution for $T_c$. The results look like this:

With this, we get an estimate for $R_0$ of 2.01, with a (5%, 95%) confidence interval from 1.67 to 2.31, which is very similar to the numbers we heard in the press. This would mean that each case of COVID-19 yields two more infections on average. Of course this result is derived making a number of assumptions and hand-waving arguments, but after all we arrived at a credible estimate and can quote a confidence interval on our result.

What does this tell us? On a policy side the required action is clear: The value for $R_0$ around two needs to be pushed down below one. In order to achieve this, there are two levers available (remember the definition of $R_0$ and $\beta$). Firstly, the number of social contacts should be limited. Governments are working to achieve this by closing schools, limiting public life, etc. Secondly, one can try to reduce the probability that a contact leads to an infection. This can be done by washing your hands, avoiding to touch your face, wear a mask, and so on.

I hope you’ve learned something about the modeling of disease and got a better understanding of the numbers presented in discussions around COVID-19, and the rationale behind many of the measures that we see implemented around the globe. Stay safe, and tuned for the next data adventure!

]]>The issue is that the logs are just dumped as gzipped CSV files into a S3 bucket. Lots of files. And this is where Metaflow comes in. Let’s make a simple flow that will download all the data and dump it into a Pandas DataFrame:

I actually had to run the script increasing the number of splits
allowed in a `foreach`

in Metaflow using the `--max-num-splits`

parameter. Here probably lies the most severe limitation of this
approach. So I wouldn’t use this as a production solution, but to get
to know Metaflow a bit better, this is fine.

So what did I find out? One thing is quite intriguing: Plotting the percentage of traffic going to each unique uri stem, we get the following picture.

Let’s break this down a little bit. The most popular URI stem is
`robots.txt`

. This is good citizen robots checking what
part of `data-adventures.com`

they are allowed to crawl. Now looking a
further down the list, we see the stem `wp-login.php`

. This is
actually hackers trying to find vulnerable (read: the admin forgot/was
too lazy to change the default password) Wordpress
sites. This seems to be a very common thing, and Darknet
Diaries ran an interesting podcast on this. The general
story goes that someone injects JavaScript code into a vulnerable
website that then skims credit card data off of it. Scary stuff.

I hope you’ve enjoyed this data adventure and stay tuned for more.

]]>To give it a spin, I’ve made a simple workflow that downloads and ultimately transforms into a Pandas data frame survey results for general elections in Germany. The ultimate goal will be to look at biases of various institutes compiling the surveys. For now, we can make pretty graphs like this one.

The setup in Metaflow is very simple. Below you find a snippet containing the first steps. All one needs is a class that contains as methods all the steps required for one’s workflow. Even parallel execution of steps is possible, as you can see:

You see that there is really not much to it. Take a closer look at the tutorials or at my repository to see how all of this integrates with Jupyter and the rest of your workflows. So what’s the verdict overall?

- Open source
- Easy to use
- Integration with many popular libraries
- Easy to use with AWS
- Scaling seems straight forward enough
- There is some rudimentary version control

- It seems like for now you’re stuck with AWS (so if your company chooses e.g. Azure it will at least mean some pain)
- Scaling to very big datasets (e.g. where you would naturally use Spark) might be an issue since you’re running plain Python

Some tasks, especially when operating on large datasets, are complex. It sometimes makes sense hiding some of this complexity, but sometimes it doesn’t. One issue I see right off the bat is that Metaflow saves (checkpoints) your data at every step of every run. This means, depending on the nature of your workflow, that you might end up with a lot of data lying around in AWS that will need to be cleaned not to generate costs. This is still a manual task.

So, do I like Metaflow? Yes, absolutely. It seems like a great solution for some challenges that almost all organizations using data science, and even lone wolf data scientist will face. Is it a silver bullet? Probably not.

]]>If this caught your attention, you might want to use the code I wrote to get your data from Strava. Happy analyzing!

]]>