Anomaly Detection in R: Euro 2016 Edition.
I’ve been working a fair bit with anomaly detection in the last months, and browsing through Andrew Ng’s excellent machine learning course, I was curious to try out his anomaly detection algorithm. It reads a bit like this:

Select a number of features \( X_1, \ldots, X_n \).

If any variable \( X_i \) does not look Gaussian enough (for some definition of “Gaussian enough”), find a suitable transformation \( x’_i = g_i(x) \) with e.g. \( g_i(x) = x^{0.5} \) and use the \( X’_i \) instead of \( X_i \) where needed.

Calculate the mean and standard deviation \( \mu_i \) and \( \sigma_i \) of your new features.

Calculate the Gaussian density \( f(x_i’\mu_i,\sigma_i) = \frac 1 {\sqrt{2 \pi \sigma^2}} e^{(x_i’  \sigma_i)^2 / (2 \sigma^2)} \).

Calculate the total density \( F = \prod_i f(x_i’\mu_i, \sigma_i) \) and call the observation an anomaly if \( F < \epsilon \) for some chosen \( \epsilon \).
So let’s go back to our soccer data, fire up our R, and see what we can find. As our features, we use the passes per minute (PPM) and passes completed per minute (PCPM) for each player. In a realworld anomaly detection scenario, one wouldn’t only use two features, and especially two that are so strongly correlated. But let’s have some fun with this. Applying some simple transformations (roots, basically), we find that both features can be made into something that looks at least somewhat Gaussian.
The fitting is done easily with a few dplyr groupbyandsummarize iterations and I’ll leave the details to the eager reader (it’s a good and easy exercise). The result will, plotted in [ggplot2] look something like this.
Now, using two independent Gaussians to fit our data makes the edges of the plot look all wrong. Just the outer fringes are marked as anomalous while some points that look outlierish to the naked eye are not. Also inspecting the points manually, one doesn’t feel encouraged.
Name PPM PCPM PToni Kroos  1.1456140  1.0666667  1.480058e08 
Bruno Soriano  1.0892857  1.0714286  3.962323e08 
Nuri Şahin  1.0222222  0.9555556  1.123398e06 
Granit Xhaka  1.0205128  0.9333333  1.684373e06 
Andrés Iniesta  0.9888889  0.9083333  4.268718e06 
Lukas Podolski  1.0000000  0.8888889  4.847362e06 
The long answer has to do with the fact that density estimation is hard. The short answer is using a multivariate Gaussian that allows for correlations in the variables. I used the fantastic mclust package to do mine. The results look way more interesting.
If we look again at the players in the lowestdensity regions, we actually see something that looks like anomalies (and just poor passing skills).
Name PPM PCPMDivock Origi  0.22222222  0.05555556 
Conor McLaughlin  0.17777778  0.08888889 
Kyle Lafferty  0.12322275  0.06635071 
The topic of density estimation in general is quite interesting, I’d encourage you to read up on some of the techniques out there and play around with mclust.