After we had a first look at Metaflow in the last data adventure, today we’ll use Metaflow for something more useful. I’m rolling this blog using Jekyll as my static site generator, and upload the whole thing into an AWS S3 bucket, serving it through AWS CloudFront. Now, I used Google Analytics for a while, but stopped due to privacy concerns. However, CloudFront offers simple logging functionality that will give the inquisitive admin the option to do some basic analysis on how people use their website.
The issue is that the logs are just dumped as gzipped CSV files into a S3 bucket. Lots of files. And this is where Metaflow comes in. Let’s make a simple flow that will download all the data and dump it into a Pandas DataFrame:
I actually had to run the script increasing the number of splits
allowed in a
foreach in Metaflow using the
parameter. Here probably lies the most severe limitation of this
approach. So I wouldn’t use this as a production solution, but to get
to know Metaflow a bit better, this is fine.
So what did I find out? One thing is quite intriguing: Plotting the percentage of traffic going to each unique uri stem, we get the following picture.
Let’s break this down a little bit. The most popular URI stem is
robots.txt. This is good citizen robots checking what
data-adventures.com they are allowed to crawl. Now looking a
further down the list, we see the stem
wp-login.php. This is
actually hackers trying to find vulnerable (read: the admin forgot/was
too lazy to change the default password) Wordpress
sites. This seems to be a very common thing, and Darknet
Diaries ran an interesting podcast on this. The general
website that then skims credit card data off of it. Scary stuff.
I hope you’ve enjoyed this data adventure and stay tuned for more.