On the last data adventure, we estimated the novel Coronavirus’ basic reproduction number $R_0$ using some Python scripting and basic exponential fitting. As much fun as that was, I was wondering if one could gain a more dynamic understanding of the situation. Luckily, I stumbled upon this blogpost, based on this journal article, accompanied by this notebook, which attempts estimating a dynamic version of the reproduction number, called $R_t$. The interesting idea behind $R_t$ is that it will give some indication on how well measures aimed at reducing the spread of COVID-19 work in a given country or region.

Disclaimer: As I said before, I’m not a medical professional. This blog post aims at investigating some interesting data analysis techniques, not a scientific study on the spread of the COVID-19 pandemic.

The Method

The paper starts with a more or less simple observation. Under the broad assumption that only a relatively small fraction of the total population gets infected at a given time, one can integrate the SIR model’s differential equations to obtain the following expression for $\Delta T(t)$, the number of newly infected individuals at time $t$.


For our data, $\tau = 1\,\mathrm{day}$, and $\gamma$ is the inverse of the infectuous period of around 5.2 days according to this study. So in other words, with a given number of newly infected people $\Delta T$ at time $t$, and a fixed reproduction number $R_t$, we would expect to see $b$ times that number at time $t + \tau$. Now we can define a likelihood function for $R_t$ by assuming a Poisson distribution for $\Delta T(t + \tau)$ with mean $b(R_t)\,\Delta T(t)$.

For the case numbers in Italy, taken from here, we get the following likelihood functions for the first few days with a significant number for $\Delta T$.

One caveat here is that the raw case numbers need to be smoothed. The reported data is fluctuating very strongly:

This gives observations that are incompatible with our likelihood model, i.e. the support of the likelihood functions doesn’t overlap. Hence we introduce a Gaussian smoothing to the case numbers:

Bayesian Estimates

Once we have our likelihood function defined, we can use successive Bayesian updates starting with a uniform distribution for $R_{t = 0}$.

With this procedure, we not only get an estimate for $R_t$, but also a confidence interval:

This can be very useful e.g. when there are very few cases which will make our estimates more uncertain. Take South Korea, for example.

Even though the estimation method here is rough and results should not be used to inform public policy, it can serve for organizations to get a read on how measures in various countries are effecting the spread of COVID-19. This can give a somewhat better read on where to expect relaxing of public measures and where the restrictions will be in effect for a longer while.

I hope you’ve enjoyed this data adventure. Stay home, stay safe, and stay tuned for the next one!