I’ve been working a fair bit with anomaly detection in the last months, and browsing through Andrew Ng’s excellent machine learning course, I was curious to try out his anomaly detection algorithm. It reads a bit like this:

  1. Select a number of features \( X_1, \ldots, X_n \).

  2. If any variable \( X_i \) does not look Gaussian enough (for some definition of “Gaussian enough”), find a suitable transformation \( x’_i = g_i(x) \) with e.g. \( g_i(x) = x^{0.5} \) and use the \( X’_i \) instead of \( X_i \) where needed.

  3. Calculate the mean and standard deviation \( \mu_i \) and \( \sigma_i \) of your new features.

  4. Calculate the Gaussian density \( f(x_i’|\mu_i,\sigma_i) = \frac 1 {\sqrt{2 \pi \sigma^2}} e^{-(x_i’ - \sigma_i)^2 / (2 \sigma^2)} \).

  5. Calculate the total density \( F = \prod_i f(x_i’|\mu_i, \sigma_i) \) and call the observation an anomaly if \( F < \epsilon \) for some chosen \( \epsilon \).

So let’s go back to our soccer data, fire up our R, and see what we can find. As our features, we use the passes per minute (PPM) and passes completed per minute (PCPM) for each player. In a real-world anomaly detection scenario, one wouldn’t only use two features, and especially two that are so strongly correlated. But let’s have some fun with this. Applying some simple transformations (roots, basically), we find that both features can be made into something that looks at least somewhat Gaussian.

The fitting is done easily with a few dplyr group-by-and-summarize iterations and I’ll leave the details to the eager reader (it’s a good and easy exercise). The result will, plotted in [ggplot2] look something like this.

Now, using two independent Gaussians to fit our data makes the edges of the plot look all wrong. Just the outer fringes are marked as anomalous while some points that look outlier-ish to the naked eye are not. Also inspecting the points manually, one doesn’t feel encouraged.

Name PPM PCPM P
Toni Kroos 1.1456140 1.0666667 1.480058e-08
Bruno Soriano 1.0892857 1.0714286 3.962323e-08
Nuri Şahin 1.0222222 0.9555556 1.123398e-06
Granit Xhaka 1.0205128 0.9333333 1.684373e-06
Andrés Iniesta 0.9888889 0.9083333 4.268718e-06
Lukas Podolski 1.0000000 0.8888889 4.847362e-06

The long answer has to do with the fact that density estimation is hard. The short answer is using a multivariate Gaussian that allows for correlations in the variables. I used the fantastic mclust package to do mine. The results look way more interesting.

If we look again at the players in the lowest-density regions, we actually see something that looks like anomalies (and just poor passing skills).

Name PPM PCPM
Divock Origi 0.22222222 0.05555556
Conor McLaughlin 0.17777778 0.08888889
Kyle Lafferty 0.12322275 0.06635071

The topic of density estimation in general is quite interesting, I’d encourage you to read up on some of the techniques out there and play around with mclust.