Data Science For a Cause, Grouping Conflicts in Africa

Last time we started looking at a fascinating data set from the Armed Conflict Location & Event Data Project (ACLED), that lists incidents in armed conflicts all over Africa and some countries in South and Southeast Asia, since 1997 and in great detail. Make sure you have a look, if you like you can download my code from github in order to get you started.

We will keep on using R, with the excellent dplyr package for data wrangling and ggmap for plotting maps. What we want to do today is finding groups of incidents that are geographically close together, as well as occurring at a similar time. This is a quite straight-forward unsupervised learning problem in the class of clustering. We will in particular use the hierarchical clustering method, implemented in R in the hclust function.

Before we can apply it though, we should normalize our data. This is always wise if you apply methods like clustering where you compare variables measured in different ways. We can convert our date column into a numerical value that is needed as input for the method in charge of calculating the distance matrix (which in turn is input to the clustering method) simply by calling

time=scale(as.numeric(dmy(EVENT_DATE)))

Now if we didn’t use

scale

scale function to remove the mean and bring the standard deviation to one, we would run into trouble when we start comparing this column with latitude and longitude, which are measured in completely different units. Our code looks like this:

library(readr)
library(ggmap)
library(dplyr)
library(lubridate)
conflict.data <- read_csv(
    "ACLED Version 6 All Africa 1997-2015_csv_dyadic.csv.gz")
conflict.data <- conflict.data %>%
    filter(FATALITIES > 10) %>%
    mutate(lat=scale(LATITUDE),
           long=scale(LONGITUDE),
           time=scale(as.numeric(dmy(EVENT_DATE)))
           )
model <- hclust(
    dist(
        conflict.data %>%
        select(lat, long, time)))
conflict.data$CLUSTER <- as.factor(cutree(model, h=3))
bbox <- make_bbox(LONGITUDE,
                  LATITUDE,
                  data=conflict.data,
                  f=0.2)
africa <- get_map(bbox)
ggmap(africa) +
    geom_point(aes(x=LONGITUDE,
                   y=LATITUDE,
                   color=CLUSTER),
               data=conflict.data,
               size=2.5) +
    xlim(-20, 40) +
    ylim(-35, 35)

Now let’s look at the output.

Clusters, map data by Google

It seems like the clustering algorithm does do a sensible job. Stay tuned for more in-depth analysis on the clusters.