In applied AI, predictive power of a model is often secondary. When business decisions are based on the predictions of a machine learning model, confidence in the predictions is often more important than pushing model metrics as far as one can. Let me explain.

Take as a simple example a series of coin tosses. If you start flipping a coin and the first three outcomes are heads, a naive model will claim that the coin is biased. But intuitively we would say that three tries are probably too few to say anything with certainty. Would you base a business decision on this? What about 30 heads in a row?

To illustrate this with data, we’ll have a look at the COMPAS data set that ProPublica did a great analysis on. The short story is that models are to be built to assess risk of a criminal defendant to re-offend. We’ll just look at the ground truth, whether or not the defendant did re-offended, given their age group. Let’s set this up in Python.

import pandas
import numpy
from scipy.stats import binom
import matplotlib.pyplot as plt


url = ('https://github.com/dhesse/IN-STK5000-Autumn21'
       '/raw/main/compas-scores-two-years.csv.gz')
data = pandas.read_csv(url) # load the data
data.groupby('age_cat')['two_year_recid'].mean() # point estimate
age_cat
25 - 45            0.459723
Greater than 45    0.315990
Less than 25       0.565075
Name: two_year_recid, dtype: float64

We start with what is called a point estimate. This is the most naive model. It counts the total number of re-offending defendants in each age category and divides by the total. We certainly see a difference. Members of the youngest age category seem almost twice as likely to re-offend than those of the oldest. But what about the errors? Did we see enough data? Should we trust this, or are we judging prematurely, just like with the coin tosses?

Bootstrap

A simple method we can apply is called Bootstrapping. We sample from our $N$ examples $N$ new ones with replacement. We do this a large number of times (in our case 10,000) and can then calculate basic statistics on this (under some assumptions you can look up in the linked article), like confidence intervals. For us, a simple histogram will do.

bss = pandas.DataFrame([
    data.sample(data.shape[0], replace=True).
    groupby('age_cat')['two_year_recid'].mean()
    for _ in range(10000)])
bss.plot.hist(bins=30)

png

We see that in our 10,000 samples there is virtually no overlap between the categories, suggesting that there is actually a material difference in re-offense rates of the different age groups.

What if we actually modeled this?

Binomial Model

Next we fit a binomial model to each age group in the data, by simply dividing the number of re-offenders by the number of defendants in each group. We then sample from each model and compare with the bootstrap results.

params = pandas.DataFrame({
    'N': data.groupby('age_cat').size(),
    'k': data.groupby('age_cat')['two_year_recid'].sum(),
})
binomial = params.apply(
    lambda x: binom.rvs(x['N'], # make a model and draw samples
                        x['k']/x['N'],
                        size=5000)/x['N'], axis=1, result_type='expand').T
ax = binomial.plot.hist(bins=40, density=True, alpha=0.5)
bss.plot.hist(bins=40, ax=ax, density=True, alpha=0.5)
plt.legend(bbox_to_anchor=(1,1))

png

As we can see the two estimates basically lie on top of each other. Making a binomial model and bootstrapping are distinct methods with different assumptions but give us the same result. It often pays of to go this route, especially if the stakes are high, like when deciding if someone should be sent to prison or not.

I hope you enjoyed today’s data adventure and will stay tuned for more! In the meantime maybe reflect on how much you can trust your models.