A tale of goals, passes, web scraping, and being the only German who is not into Soccer.

Usually, I’m a proud member of the small club of Germans who don’t care too much for soccer and make up for this by at least following the national selection on big competitions, like this year’s Euro. After Germany has been kicked out by the host France in the semi-finals, it’s time for some soul searching and many a discussion was had about whether the German style of playing (focusing on ball control and short, safe passing) is as good as it seemed in the aftermath of the last world cup, won by Germany.

Not being the biggest Soccer enthusiast, I don’t feel to qualified to enter this discussion, but maybe we can find interesting data we can crunch to give us some insights.

Web Scraping 201 - What About JavaScript?

It’s about time I talked about web scraping. Writing a spider that follows links and extracts useful information is an important way for any data scientist worth his or her salt to gather interesting data. Now there are very nice frameworks around for this task. If you like Python, you should look into scrapy. Scala/Java users have with JSoup a powerful tool at their hands to extract entities from HTML pages (similar to beautifulsoup for Python).

An important thing to keep in mind is that you should have a look at the

robots.txt

page of the URL you want to extract information from to make sure they are okay with your spider grabbing data from their servers. Luckily, the UEFA does allow crawling their statistics on the Euro 2016 at the time of writing.

There is one more possible challenge though. Many pages are not static, but built at the time of viewing using JavaScript. This means that we cannot simply use a tool like scrapy to extract data since it doesn’t execute JavaScript. I had to write a set of Scala scripts using Selenium to get around this problem. Selenium is actually a web testing tool, but very suitable for jobs like this. Go check it out and play around with it. It’s really easy to use and a very handy tool to know.

import collection.JavaConversions._
import org.openqa.selenium.WebDriver
import org.openqa.selenium.By

object Crawl {

  def getUrls(driver: WebDriver) = {
    for (link <- driver.findElements(By.tagname("a"))) yield link.getAttribute("href")
  }

  def crawl(url: String, driver: WebDriver): Unit = {
    driver.get(url)
    println(getUrls.mkString("\n"))
    for (newUrl <- urls) craw(newUrl, driver) // might run a long time!
  }
}

Do Passes Matter?

After spending some time extracting the data, let’s look at some numbers and graphs. And models, of course. UEFA provides us with lots of interesting information on players and teams, but in this post I’ll focus on two main indicators: Passes and attempts to score. Now just comparing the total number of passes and attempts per team should take the number of games into account as well as if or not the game was decided in overtime (which is in turn only applicable after the group stage). As for passes, UEFA has good numbers on all teams that made it past the group stages. So let’s look at those.

APM

The bar chart above shows the passes per minute (PPM) per team, averaged over the players’ individual PPM. The color indicates which is the highest stage that the team reached. Now this is an interesting plot, as it shows that there is virtually no correlation between if a team passes the ball around a lot and when it was kicked out.

But what about goals? You can pass the ball around all you want, but you won’t win unless you score some goals. So let’s repeat the last plot, but with the attempts to score per minute (APM) instead of the PPM.

PPM

Due to some inconsistencies in the data, we now also have some teams that were kicked out in the group stage in our plot, most notably at first and next to last position. The teams of Austria and Turkey, both kicked out at the group stage, have the highest and next to lowest number (only Italy had less attempts per minute) of APMs.

I guess this is what makes soccer exciting. You can play very successful with a number of strategies, passing a lot or very little, making many or few attempts to score. To some degree it’s randomness that keeps us at the edge of our seats.

Nerd Zone: Ordinal Logistic Regression

Sometimes it’s just not enough to look at bar charts, but you have to quantify what you see. And since Germans take their soccer serious, I have all intention to back my ramblings with some numbers.

Let’s make sure that there is really not connection between APM, PPM, and at what stage you get kicked out. In order to do so, we use Ordinal Logistic Regression, using the MASS R package (check the tutorial here). The model we build using gives as expected a really bad AIC score.

Call:
    polr(formula = Stage ~ AvgPPM + AvgAPM, data = features)
    
    Coefficients:
             Value Std. Error t value
    AvgPPM   3.288      2.635  1.2479
    AvgAPM -22.803     50.428 -0.4522
    
    Intercepts:
                               Value    Std. Error t value 
    Group stage\|Quarter-finals  -0.3130   1.3682    -0.2288
    Quarter-finals\|Round of 16   0.5285   1.3426     0.3936
    Round of 16\|Semi-finals      2.3271   1.4676     1.5856
    
    Residual Deviance: 57.45138 
    AIC: 67.45138

So what does the null model say?

Call:
    polr(formula = Stage ~ 1, data = features)
    
    No coefficients
    
    Intercepts:
                               Value   Std. Error t value
    Group stage\|Quarter-finals -0.9808  0.4787    -2.0489
    Quarter-finals\|Round of 16 -0.1823  0.4282    -0.4257
    Round of 16\|Semi-finals     1.5040  0.5528     2.7209
    
    Residual Deviance: 59.05298 
    AIC: 65.05298 

We see that the features basically don’t change the AIC score, as we expected. So we can conclude that our PPM and APM values don’t influence a team’s success and all that is left is to wish both France and Portugal the best of luck for tonight’s final. May the best team win!