Yes, I know, everyone and their brother is talking about the novel coronavirus SARS-CoV-2. There is a lot of bad information and misinformation out there. For this reason I thought it would be a good idea to go on a data adventure together to have a look at the numbers (and how to get them into an easy-to-use format for you to have a look for yourself) and make our very own rough estimation of the basic reproduction number $R_0$ that people keep talking about.

Disclaimer: I’m not a medical expert. This post is meant to demonstrate very basic data analysis techniques to help interested readers to process available data on COVID-19 and shed some light on common simple techniques used to analyze disease data in general. This is not a scientific study by any stretch of imagination and should not be mistaken for one.

Reading the data

The highest quality data I could find on COVID-19 cases comes from Johns Hopkins University. It is unfortunately in a format that’s a little awkward to work with, so we need to massage the data a bit to get them into a form that is easy to work on. The following code loads the data into a Pandas data frame, as a proper time series and a multi-index for ‘Country/Region’ and ‘Province/State’. If you’re not interested in per-province data, you could group by ‘Country/Region’ and sum the cases.

import pandas
url = (""
df_c = pandas.read_csv(url) # load data
del df_c['Long'] # remove 'Long' column 
del df_c['Lat']  # remove 'Lat' column 
# now, set the index to reflect country and province
df_c = df_c.set_index(['Country/Region','Province/State'])\
           .transpose() # transpose to get dates as rows
# make a proper datetime index
df_c = df_c.set_index(pandas.to_datetime(df_c.index))

Now we can, for example, recreate pretty plots visualizing the virus spread in various countries. As an example, here the growth of numbers of confirmed cases in major European countries from the day the number of confirmed cases was as closest to 200.

Estimating $R_0$

Let’s get to the basic reproduction number $R_0$. It is roughly the number of cases one single case of a disease will produce on average. Now this is clearly more a sledgehammer than a precision tool. You’ve probably heard of ‘Super-spreaders’.

But even though the number of infections caused by an individual case will vary widely according to many different factors, $R_0$ is anyway an interesting quantity since it will tell us much about how the outlook for a disease is. A number smaller than one will mean that the disease will gradually disappear, a bigger one means a faster spread.

Now we will go down the dangerous road of fitting the exponential curves shown above. This is something you really should not do if you’re after precise results, but since all we want to do is get a feeling for how $R_0$ is obtained in principle, the fits will do for now.

SIR model

One of the most common models to describe the dynamics of a pandemic is the SIR model. It assumes that individuals in a population of size $N$ can be in either one of three states: Susceptible, infected, or recovered. The dynamics of the number $S$ of susceptible, $I$ of infected, and $R$ of recovered persons is then described by a set of three coupled ordinary differential equations:

\[\begin{align*} \frac{\mathrm{d}S}{\mathrm{d}t} &= -\frac{\beta I S}N\,,\\ \frac{\mathrm{d}I}{\mathrm{d}t} &= \frac{\beta I S}N - \gamma I\,,\\ \frac{\mathrm{d}R}{\mathrm{d}t} &= \gamma I\,. \end{align*}\]

Here, $\beta$ is the product of the average number of contacts per person per time interval and the probability of infecting each of these contacts, and $\gamma$ the rate of recovery or mortality. The first equation tells us how quickly people pass from susceptible to being infected. The last one states that people move from infected to recovered at rate $\gamma$. The number of infected (middle equation) is then the negative sum of those. This yields a conservation law, $\dot S + \dot I + \dot R = 0$, and hence $S(t) + I(t) + R(t) = N$. This means that we don’t consider changes in population size (like births or deaths). There are many variations of this basic model that deal e.g. with different rates in different age groups, etc.

The much talked about $R_0$ is then defined as the rate of infection divided by the rate at which people go from infected to recovered:

\[R_0 = \frac \beta \gamma\]

Now looking that this paper by Wallinga and Lipsitch, we see that we can estimate $R_0$ (simply $R$ in their notation) by looking at the exponential rise in the onset of the disease, by using the formula

\[R_0 = 1 + r\,T_c\,\]

where $r$ is the growth rate (from fitting the exponential rise in early case numbers) and $T_c$ the disease’s generation interval.

Some code to get you started could look like this:

import statsmodels.api as sm

def get_interval(n, N, colname, df):
    first = (df.index[(df[colname] >= n)][0] - df.index[0]).days
    last = (df.index[(df[colname] > N)][0] - df.index[0]).days
    return df.reset_index()[colname][first:last]

def fit_onset(n, N, country):
    y = numpy.log(get_interval(n, N, (country, float('nan')), df_c))
    X = sm.add_constant(numpy.arange(0, len(y)))
    model = sm.OLS(y, X)

The results we’re getting are the following:

Country r 2.5% conf. 97.5% conf.
Spain 0.230066 0.211205 0.248928
Italy 0.172809 0.160298 0.185320
France 0.171130 0.159568 0.182692
Germany 0.188302 0.174079 0.202524

As we could have guessed looking at the curves shown above, we get similar results for the countries we consider. The different rates can be attributed to actual differences in how the disease spreads in those countries, or the methodology employed to identify and count the cases.

Now all we are missing is the generation interval. In this study, we get an estimate of 5.2 days, with a confidence interval of length 3. We can model this using a normal curve centered at 5.2 with standard deviation close to one. We now generate bootstrap samples from the table above, calculate the mean, and draw samples from a normal distribution for $T_c$. The results look like this:

With this, we get an estimate for $R_0$ of 2.01, with a (5%, 95%) confidence interval from 1.67 to 2.31, which is very similar to the numbers we heard in the press. This would mean that each case of COVID-19 yields two more infections on average. Of course this result is derived making a number of assumptions and hand-waving arguments, but after all we arrived at a credible estimate and can quote a confidence interval on our result.

What does this tell us? On a policy side the required action is clear: The value for $R_0$ around two needs to be pushed down below one. In order to achieve this, there are two levers available (remember the definition of $R_0$ and $\beta$). Firstly, the number of social contacts should be limited. Governments are working to achieve this by closing schools, limiting public life, etc. Secondly, one can try to reduce the probability that a contact leads to an infection. This can be done by washing your hands, avoiding to touch your face, wear a mask, and so on.

I hope you’ve learned something about the modeling of disease and got a better understanding of the numbers presented in discussions around COVID-19, and the rationale behind many of the measures that we see implemented around the globe. Stay safe, and tuned for the next data adventure!