In the last two posts I did some exploration of a fascinating data set published by the Armed Conflict Location & Event Data Project (ACLED). You can find the code I used on github. The ACLED data lists incidents in armed conflicts all over Africa and some countries in South and Southeast Asia, since 1997 and in great detail.

At first, we used R and ggmap to plot the recorded incidents on a map. These can be protests or riots, but also outbursts of violence with many fatalities. To get a better overview we then used hierarchical clustering to group those conflicts temporally and spatially. Something like this might be of value for a NGO or a news outlet in need a system that automatically assigns a reported incident to a conflict without human interference.

While the clustering worked well, we had to play a little trick that you might have noticed if you read the last post. In our code, we had the following transformations applied to our data.

conflict.data <- conflict.data %>%
    filter(FATALITIES > 10) %>%
    mutate(lat=scale(LATITUDE),
           long=scale(LONGITUDE),
           time=scale(as.numeric(dmy(EVENT_DATE)))
           )

There, in line 2 you have it. We only considered incidents with more than 10 fatalities. This might make sense if one is only interested in armed conflicts, but we did it only because of limited memory on the computer the code ran on. Time to bring out the big(ish) guns.

Apache Spark is currently my favorite big data tool. You can run it on your laptop, on a small cluster, or on a warehouse. All with the same code, on the same interface, with the same language (today, we’ll do Python). And it’s lightning fast. The first step is to read the data into Spark. The ACLED set has some extra newlines in the description column, which is why we have to jump through some hoops (

csv

files usually use newlines as separator between records).

def read():
    with gzip.open('ACLED Version 6 All Africa 1997-2015_csv_dyadic.csv.gz') as inputFile:
        fullLine = ""
        for line in inputFile:
            fullLine += line
            if fullLine.count('"') % 2 == 0:
                yield string.replace(fullLine, "\n", " ")
                fullLine = ""
sourceFile = sc.parallelize(read())
def parseLine(line):
    parts = line.split(',')
    try:
        return ((float(parts[19]),
                float(parts[20]),
                 (datetime.datetime.now()
                  - parse(parts[3]))
                 .total_seconds()),
                int(parts[-1]))
    except:
        pass
    return None
data = sourceFile\
    .map(parseLine)\
    .filter(lambda x: x)

Now we’re ready to cluster. Let’s look at the clustering methods Spark supports. In the documentation of MLLib, Spark’s preferred machine learning library, we find a number of choices. For simplicity, let’s stick to k-means. The advantage of k-means is that it’s a very simple but fast algorithm, but we have to specify the number of clusters we want to find. This can be bad for some applications, but since this is an educational post, let’s go for it. Remember that we have to scale our inputs as discussed last time, luckily Spark has a class that does just that for us.

features = data.map(lambda x: x[0])
fatalities = data.map(lambda x: x[1])
scaler = StandardScaler(withMean=True, withStd=True).fit(features)
normalizedData = scaler.transform(features)
nClusters = 10
model = KMeans().train(normalizedData, nClusters)
labels = model.predict(normalizedData)

To plot the data on a map, we use the amazing basemap package in python. You can check the notebook I used for this post if you are interested in the gory details. The result looks like this.

africa15

Again, the clustering looks sensible. A news outlet that wants to automatically assign conflict identifiers to incident reports might just use a model like this. Next time, I would like to dive a bit deeper into the data set and explore the individual clusters.