After we had a first look at Metaflow in the last data adventure, today we’ll use Metaflow for something more useful. I’m rolling this blog using Jekyll as my static site generator, and upload the whole thing into an AWS S3 bucket, serving it through AWS CloudFront. Now, I used Google Analytics for a while, but stopped due to privacy concerns. However, CloudFront offers simple logging functionality that will give the inquisitive admin the option to do some basic analysis on how people use their website.

The issue is that the logs are just dumped as gzipped CSV files into a S3 bucket. Lots of files. And this is where Metaflow comes in. Let’s make a simple flow that will download all the data and dump it into a Pandas DataFrame:

FIELDS =  ("date time x-edge-location sc-bytes c-ip cs-method cs(Host) "
"cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) "
"x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes "
"time-taken x-forwarded-for ssl-protocol ssl-cipher "
"x-edge-response-result-type cs-protocol-version fle-status "
"fle-encrypted-fields c-port time-to-first-byte "
"x-edge-detailed-result-type sc-content-type sc-content-len "
"sc-range-start sc-range-end").split(' ')

class GetLogsAsDataFrame(FlowSpec):

    @step
    def start(self):
        self.bucket_name = 'your-S3-bucket-here'
        self.bucket = boto3.client('s3').list_objects(Bucket=self.bucket_name)
        self.contents = self.bucket['Contents']
        self.next(self.bucket_df, foreach="contents")
    
    @step
    def bucket_df(self):
        obj = boto3.client('s3').get_object(Bucket=self.bucket_name,
		                                    Key=self.input['Key'])
        self.df = pandas.read_csv(obj['Body'], compression='gzip',
			                skiprows=2, delimiter='\s+',
							names=FIELDS)
        self.next(self.join)
    
    @step
    def join(self, inputs):
        self.data = pandas.concat((input.df for input in inputs))
        self.next(self.end)
    
    @step
    def end(self):
        pass

I actually had to run the script increasing the number of splits allowed in a foreach in Metaflow using the --max-num-splits parameter. Here probably lies the most severe limitation of this approach. So I wouldn’t use this as a production solution, but to get to know Metaflow a bit better, this is fine.

The Results

So what did I find out? One thing is quite intriguing: Plotting the percentage of traffic going to each unique uri stem, we get the following picture.

Let’s break this down a little bit. The most popular URI stem is robots.txt. This is good citizen robots checking what part of data-adventures.com they are allowed to crawl. Now looking a further down the list, we see the stem wp-login.php. This is actually hackers trying to find vulnerable (read: the admin forgot/was too lazy to change the default password) Wordpress sites. This seems to be a very common thing, and Darknet Diaries ran an interesting podcast on this. The general story goes that someone injects JavaScript code into a vulnerable website that then skims credit card data off of it. Scary stuff.

I hope you’ve enjoyed this data adventure and stay tuned for more.