When using data science in a business context, we aim to make decisions based on our data. This can have several advantages. Machines can potentially take decisions faster, at higher volume, and with less errors than humans, giving organizations that master data-driven decision making a huge competitive advantage. However, not all models and data sets are created equally, and business insight can make the difference between winning and shooting yourself in the foot.

As an example we’ll use a very simple dataset featured on Kaggle, the cardiovascular disease dataset.

import pandas

data = pandas.read_csv("cardio_train.csv.zip", sep=";")

The data describes a sample of individuals, giving some medical and lifestyle attributes. It contains

  • id
  • age
  • gender
  • height
  • weight
  • ap_hi
  • ap_lo
  • cholesterol
  • gluc
  • smoke
  • alco
  • active
  • cardio

I won’t go into all the details as they can be found on the dataset page at Kaggle. We will focus on the variables cardio, the absence or presence of cardiovascular disease, as well as the smoke column, an indicator of a self-reported smoking habit. Let’s use Pandas to look at what correlates with the presence of cardiovascular disease first.

data.corr()['cardio'].sort_values()[:-1].plot.bar()

png

Most of this is not too surprising, old age, high cholesterol, and high weight are highly correlated with the presence of cardiovascular disease. A comment on the cholesterol indicator: I know, I should not use corr with categorical variables, but given how they are coded, it will not do too much harm in exploratory analysis.

One thing stands out though: Smoking seems negatively correlated with cardiovascular disease. This can’t possibly be right, can it? Let’s look at what correlates with smoke in turn.

data.corr()['smoke'].sort_values()[:-1].plot.bar()

png

Now we’re on to something! It turns out that age is negatively correlated with smoking. Older participants in the study are more likely to smoke. So the negative correlation between a smoking habit and cardiovascular disease might have to do with the correlation between age and smoking. That makes a lot of sense.

Let’s try to test this hypothesis. One simple thing we could do is fit a model that estimates the effect of smoking on cardiovascular disease, correcting for age. We would expect the heart disease reducing effect of smoking to go away, or even reverse. We’ll use a simple logistic regression in statsmodels.

import statsmodels.formula.api as smf
fit_results = smf.logit("cardio ~ smoke + age", data).fit()
fit_results.summary2()
Model: Logit Pseudo R-squared: 0.042
Dependent Variable: cardio AIC: 92989.5481
Date: 2022-06-06 09:25 BIC: 93017.0168
No. Observations: 70000 Log-Likelihood: -46492.
Df Model: 2 LL-Null: -48520.
Df Residuals: 69997 LLR p-value: 0.0000
Converged: 1.0000 Scale: 1.0000
No. Iterations: 5.0000
Coef. Std.Err. z P>|z| [0.025 0.975]
Intercept -3.9221 0.0645 -60.8510 0.0000 -4.0484 -3.7957
smoke -0.0309 0.0275 -1.1220 0.2619 -0.0848 0.0231
age 0.0002 0.0000 61.5588 0.0000 0.0002 0.0002

First things first: Looking at the metrics here, we see that we didn’t create a very good model. If we were to use this to predict cardiovascular disease in a population, we’d do a piss-poor job. But that’s not the goal here.

Looking at the confidence intervals on the fit coefficients, we can see that taking age into account, we don’t see a significant effect of smoking anymore, zero is compatible with our coefficient within errors. We recently talked about why model errors and prediction errors matter.

Now what would have happened if we would have fitted a model looking only at the smoking habit?

fit_results = smf.logit("cardio ~ smoke", data).fit()
fit_results.summary2()
Optimization terminated successfully.
         Current function value: 0.693027
         Iterations 3
Model: Logit Pseudo R-squared: 0.000
Dependent Variable: cardio AIC: 97027.7864
Date: 2022-06-06 09:24 BIC: 97046.0989
No. Observations: 70000 Log-Likelihood: -48512.
Df Model: 1 LL-Null: -48520.
Df Residuals: 69998 LLR p-value: 4.1673e-05
Converged: 1.0000 Scale: 1.0000
No. Iterations: 3.0000
Coef. Std.Err. z P>|z| [0.025 0.975]
Intercept 0.0084 0.0079 1.0647 0.2870 -0.0071 0.0239
smoke -0.1093 0.0267 -4.0956 0.0000 -0.1617 -0.0570

Hopefully we would have seen how poor a job the model does and stopped right there. But if we didn’t have a keen eye on model metrics and instead would have looked only at the fit coefficients, we could have come to the conclusion that yes, within errors, smoking really does have a protective effect when it comes to heart disease.

This is of course an absurd conclusion. But we see this only because the negative health impact of smoking is common knowledge. This is not always the case when it comes to complex business problems. Subtleties can easily be overlooked and the wrong conclusions drawn when people blindly apply models to problems they don’t fully understand and business is not sufficiently involved. I’ve seen this too many times, and it can cost businesses dearly. In monetary terms when the wrong decisions are taken, but often most importantly in the loss of trust in AI that will hurt much more in the long term.

I hope you’ve enjoyed this data adventure and stay tuned for more!