DISCLAIMER

You don’t take things you read here seriously, do you? You really shouldn’t. At least not by and large. I hope you’ll get some neat ideas and will be inspired to try out a technology you haven’t used before every now and then, but it would be foolish to draw any conclusions from things I find in the datasets presented here. I mean, I’m a data scientist, and most of the methods I use in this blog are pretty sound, but don’t make life decisions based on things you read here. Consider yourself warned.

The Wonderful World of Online Dating

The other day, I came across a very neat data set containing approximately 60k anonymized profiles taken from the dating site OkCupid. Users have entered things like salary, habits (e.g. drinking, drug use, etc.), education, age, and a few more. One usually has to be quite careful with self-reported numbers (social desirability bias comes to mind), and taking a sample from a dating website as representative for the population at large certainly will introduce one hell of a selection bias. And yes, you should be very aware of which kind of biases your data will likely exhibit. It can be hard to quantify biases (c.f. the polls and the US’ president-elect) but it’s usually just a matter of common sense to identify the worst ones you will likely have to deal with.

But then sometimes you can ask a question that just navigates around the biases. Like this one: Assuming that a user being inactive for more than 14 days indicates that one found true love, what features of the profile make a person attractive? Age? Income? Education? Luckily, we can find out.

Let’s first have a look at those 14 days. The histogram of inactive days looks like this:

Even though 14 days inactivity meaning you found love is arbitrary and possibly overly optimistic, but looking at the graph it kind of makes sense (and I’m an optimist).

So what makes people being inactive? Let’s fit a logistic regression to the data. We shouldn’t have any illusions about it’s predictive power, but at least we will be able to answer an important question: What makes people fall in love? Brains or money? Hence the call for my model looks like this (be warned, you’ll need to do some data cleansing before this will work):

model <- glm(active~age+income+education+sex,
                 data = data,
                 family=binomial(link='logit'))

Plotting the relevant variables gives us the following graph.

So the romantics can rest assured, it’s what’s in your brain that counts, not how deep your pockets are.