In December, Netflix open-sourced Metaflow, a data science production framework. You can think of it as a one-stop-shop for your data analytics workflow needs. A corporate suit for your data mess. It is very easy to use, python based, supports many popular libraries such as SciKit Learn and Pytorch, as well as the AWS cloud (which comes as no big surprise since Netflix runs on AWS).

To give it a spin, I’ve made a simple workflow that downloads and ultimately transforms into a Pandas data frame survey results for general elections in Germany. The ultimate goal will be to look at biases of various institutes compiling the surveys. For now, we can make pretty graphs like this one.

Surveys

The setup in Metaflow is very simple. Below you find a snippet containing the first steps. All one needs is a class that contains as methods all the steps required for one’s workflow. Even parallel execution of steps is possible, as you can see:

class DownloadData(FlowSpec):
    @step
    def start(self):
        """
        Download site list.
        """
        self.sites = get_sites()
        self.next(self.load_site, foreach="sites")

    @step
    def load_site(self):
        """
        Download invidivual sites.
        """
        self.site_name = self.input["name"]
        response = requests.get(self.input["url"])
        response.encoding = response.apparent_encoding
        self.site_data = response.text
        self.next(self.join)

    @step
    def join(self, inputs):
        """
        Save list of raw sites
        """
        self.sites = [
            {"name": input.site_name, "text": input.site_data} 
			  for input in inputs
        ]
        self.next(self.end)

    @step
    def end(self):
        pass

You see that there is really not much to it. Take a closer look at the tutorials or at my repository to see how all of this integrates with Jupyter and the rest of your workflows. So what’s the verdict overall?

The Good

  • Open source
  • Easy to use
  • Integration with many popular libraries
  • Easy to use with AWS
  • Scaling seems straight forward enough
  • There is some rudimentary version control

The Bad

  • It seems like for now you’re stuck with AWS (so if your company chooses e.g. Azure it will at least mean some pain)
  • Scaling to very big datasets (e.g. where you would naturally use Spark) might be an issue since you’re running plain Python

The Ugly

Some tasks, especially when operating on large datasets, are complex. It sometimes makes sense hiding some of this complexity, but sometimes it doesn’t. One issue I see right off the bat is that Metaflow saves (checkpoints) your data at every step of every run. This means, depending on the nature of your workflow, that you might end up with a lot of data lying around in AWS that will need to be cleaned not to generate costs. This is still a manual task.

So, do I like Metaflow? Yes, absolutely. It seems like a great solution for some challenges that almost all organizations using data science, and even lone wolf data scientist will face. Is it a silver bullet? Probably not.