Lying With Data - Why Business Insight Matters
When using data science in a business context, we aim to make decisions based on our data. This can have several advantages. Machines can potentially take decisions faster, at higher volume, and with less errors than humans, giving organizations that master data-driven decision making a huge competitive advantage. However, not all models and data sets are created equally, and business insight can make the difference between winning and shooting yourself in the foot.
As an example we’ll use a very simple dataset featured on Kaggle, the cardiovascular disease dataset.
import pandas
data = pandas.read_csv("cardio_train.csv.zip", sep=";")
The data describes a sample of individuals, giving some medical and lifestyle attributes. It contains
- id
- age
- gender
- height
- weight
- ap_hi
- ap_lo
- cholesterol
- gluc
- smoke
- alco
- active
- cardio
I won’t go into all the details as they can be found on the dataset
page at Kaggle. We will focus on the variables cardio
, the
absence or presence of cardiovascular disease, as well as the smoke
column, an indicator of a self-reported smoking habit. Let’s use
Pandas to look at what correlates with the presence of
cardiovascular disease first.
data.corr()['cardio'].sort_values()[:-1].plot.bar()
Most of this is not too surprising, old age, high cholesterol, and
high weight are highly correlated with the presence of cardiovascular
disease. A comment on the cholesterol
indicator: I know, I should
not use corr
with categorical variables, but given how they are
coded, it will not do too much harm in exploratory analysis.
One thing stands out though: Smoking seems negatively correlated
with cardiovascular disease. This can’t possibly be right, can it?
Let’s look at what correlates with smoke
in turn.
data.corr()['smoke'].sort_values()[:-1].plot.bar()
Now we’re on to something! It turns out that age is negatively correlated with smoking. Older participants in the study are more likely to smoke. So the negative correlation between a smoking habit and cardiovascular disease might have to do with the correlation between age and smoking. That makes a lot of sense.
Let’s try to test this hypothesis. One simple thing we could do is fit a model that estimates the effect of smoking on cardiovascular disease, correcting for age. We would expect the heart disease reducing effect of smoking to go away, or even reverse. We’ll use a simple logistic regression in statsmodels.
import statsmodels.formula.api as smf
fit_results = smf.logit("cardio ~ smoke + age", data).fit()
fit_results.summary2()
Model: | Logit | Pseudo R-squared: | 0.042 |
Dependent Variable: | cardio | AIC: | 92989.5481 |
Date: | 2022-06-06 09:25 | BIC: | 93017.0168 |
No. Observations: | 70000 | Log-Likelihood: | -46492. |
Df Model: | 2 | LL-Null: | -48520. |
Df Residuals: | 69997 | LLR p-value: | 0.0000 |
Converged: | 1.0000 | Scale: | 1.0000 |
No. Iterations: | 5.0000 |
Coef. | Std.Err. | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -3.9221 | 0.0645 | -60.8510 | 0.0000 | -4.0484 | -3.7957 |
smoke | -0.0309 | 0.0275 | -1.1220 | 0.2619 | -0.0848 | 0.0231 |
age | 0.0002 | 0.0000 | 61.5588 | 0.0000 | 0.0002 | 0.0002 |
First things first: Looking at the metrics here, we see that we didn’t create a very good model. If we were to use this to predict cardiovascular disease in a population, we’d do a piss-poor job. But that’s not the goal here.
Looking at the confidence intervals on the fit coefficients, we can see that taking age into account, we don’t see a significant effect of smoking anymore, zero is compatible with our coefficient within errors. We recently talked about why model errors and prediction errors matter.
Now what would have happened if we would have fitted a model looking only at the smoking habit?
fit_results = smf.logit("cardio ~ smoke", data).fit()
fit_results.summary2()
Optimization terminated successfully.
Current function value: 0.693027
Iterations 3
Model: | Logit | Pseudo R-squared: | 0.000 |
Dependent Variable: | cardio | AIC: | 97027.7864 |
Date: | 2022-06-06 09:24 | BIC: | 97046.0989 |
No. Observations: | 70000 | Log-Likelihood: | -48512. |
Df Model: | 1 | LL-Null: | -48520. |
Df Residuals: | 69998 | LLR p-value: | 4.1673e-05 |
Converged: | 1.0000 | Scale: | 1.0000 |
No. Iterations: | 3.0000 |
Coef. | Std.Err. | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 0.0084 | 0.0079 | 1.0647 | 0.2870 | -0.0071 | 0.0239 |
smoke | -0.1093 | 0.0267 | -4.0956 | 0.0000 | -0.1617 | -0.0570 |
Hopefully we would have seen how poor a job the model does and stopped right there. But if we didn’t have a keen eye on model metrics and instead would have looked only at the fit coefficients, we could have come to the conclusion that yes, within errors, smoking really does have a protective effect when it comes to heart disease.
This is of course an absurd conclusion. But we see this only because the negative health impact of smoking is common knowledge. This is not always the case when it comes to complex business problems. Subtleties can easily be overlooked and the wrong conclusions drawn when people blindly apply models to problems they don’t fully understand and business is not sufficiently involved. I’ve seen this too many times, and it can cost businesses dearly. In monetary terms when the wrong decisions are taken, but often most importantly in the loss of trust in AI that will hurt much more in the long term.
I hope you’ve enjoyed this data adventure and stay tuned for more!