In the last post, I talked about the usefulness of REPLs which indeed is hard to overstate. Exploratory data analysis would be a lot of hassle without the read-evaluate-print loop. We had a closer look at Jupyter in particular and first attempts at analyzing the data from the post on food and inflation can be found on github.

So REPLs are great tools. Really great. I mean, for exploratory analysis. Most of it, that is. Let me explain. REPLs have one major drawback. Code in e.g. a Jupyter notebook is very hard to test and the thing about code that’s hard to test is that it usually doesn’t get tested. At all. It’s just in our nature to be lazy, so if something is a hassle you usually just don’t do it, unless it’s strictly required. And clearly your code will execute (after you fiddle with it until it seems to work alright, at which point you save and slowly walk away from your computer) just fine without tests. But do you know it’s doing the right thing?

Let’s take a minute to talk about test driven development (or TDD). Some argue that you don’t need unit testing, for various reasons. Like it slows you down when coding. Or it gives you a false sense of security. And so on. All the arguments I’ve heard against TDD have one thing in common though. They’re wrong. I tend to agree with Uncle Bob on the matter. If you are unfamiliar with unit testing, I recommend you have a look at some introductory material.

Now we’re no software developers but data scientists. I don’t demand that you unit test every last line of your analysis. But I do believe it is a good idea to break out what you can from a notebook and write some tests for it. Why? Code reuse! If you have that neat function in a separate file with some tests going on, you’ll be more likely to re-use it. Why? Because you trust it. Because it’s tested. Let’s have a look.

Last time, we had a neat little function that would normalize a series.

def normalize(x):
    return pd.Series(x.Value.values / 
                     sum(x.Value[x.year == 2000]),
                     index=x.year)

Any function you write might be a good candidate to be but in a separate file. Even tough this specific one might not be the best example, but let’s break it out and write a test. The unittest package is one of the obvious and popular choices for this task.

class TestNormalize(unittest.TestCase):
    """Testing the normaize function."""
    def test_with_known_values(self):
        test_data = pd.DataFrame({'year': [2000, 2001],
                                  'Value': [2, 4]})
        expected_result = pd.Series([1, 2],
                                   index=[2000, 2001])
        expected_result.index.name = 'year'
        pd.util.testing.assert_series_equal(
            normalize(test_data),
            expected_result)

Now, while writing the test you should notice a few strange things. What happens if there is no year 2000 in our data frame? What happens if we have it multiple times? This would break our code. Just a few more lines and two more tests make the function much more reliable, and by proxy you much more likely to use it in a future analysis.

def normalize(x):
    if not any(x.year == 2000):
        raise NoYear2kError()
    if any(x.groupby('year').size() > 1):
        raise DuplicateYearError
    return pd.Series(x.Value.values /
                     sum(x.Value[x.year == 2000]),
                     index=x.year)

class TestNormalize(unittest.TestCase):
    """Testing the normaize function."""
    def test_with_known_values(self):
        test_data = pd.DataFrame({'year': [2000, 2001],
                                  'Value': [2, 4]})
        expected_result = pd.Series([1, 2],
                                   index=[2000, 2001])
        expected_result.index.name = 'year'
        pd.util.testing.assert_series_equal(
            normalize(test_data),
            expected_result)
    def test_y2k_not_in_years_raises(self):
        with self.assertRaises(NoYear2kError):
            normalize(pd.DataFrame({'year': []}))
    def test_duplicate_year_asserts(self):
        with self.assertRaises(DuplicateYearError):
            normalize(pd.DataFrame({'year': [2000]*2}))

Of course it’s far from perfect, but you get the idea. Maybe you want to make it better as a homework? Now let’s do the same with the reading and pre-processing of our data.

class NoDateInDataError(Exception):
    pass

class NoSisalInDataError(Exception):
    pass

class NoStandardCurrencyInDataError(Exception):
    pass

def extract_year(df):
    if 'Date' not in df:
        raise NoDateInDataError()
    df['year'] = df.Date.apply(lambda x: int(x[4:8]))
    return df

SLU = 'Standard Local Currency/tonne'

def clean_data(df):
    if not any(df.item == 'Sisal'):
        raise NoSisalInDataError
    if not any(df.Unit == SLU):
        raise NoStandardCurrencyInDataError
    return df[(df.item == 'Sisal') &
              (df.Unit == SLU)]

class TestExtractYear(unittest.TestCase):
    """You should test more here, connected to the expected string
    format in the date column."""
    def test_no_Date_column_raises(self):
        with self.assertRaises(NoDateInDataError):
            extract_year(pd.DataFrame({}))
    def test_known_values(self):
        test_data = pd.DataFrame({'Date': ["1-1-1234"]})
        expected_result = pd.Series([1234], name='year')
        pd.util.testing.assert_series_equal(
            extract_year(test_data).year,
            expected_result)

class TestCleanData(unittest.TestCase):
    def setUp(self):
        self.test_data = pd.DataFrame(
            {'item': ["Sisal", "Hay"]*2,
             'Unit': [SLU, SLU, 'dollar', 'dollar']})
    def test_cleans_non_sisal(self):
        result = clean_data(self.test_data)
        self.assertTrue(all(result.item == "Sisal"))
    def test_cleans_non_slc(self):
        result = clean_data(self.test_data)
        self.assertTrue(all(result.Unit == SLU))
    def test_no_sisal_raises(self):
        with self.assertRaises(NoSisalInDataError):
            clean_data(pd.DataFrame(
                {'item': [None],
                 'Unit': [SLU]}))
    def test_no_slc_raises(self):
        with self.assertRaises(NoStandardCurrencyInDataError):
            clean_data(pd.DataFrame(
                {'item': ['Sisal'],
                 'Unit': [None]}))

Much more reliable, and more likely to be re-used. Imagine you’re looking at your team’s repository. What parts of a colleague’s code would you rather use? The one with tests or the one without? Our notebook becomes much more readable as well.

So if you have a useful function in your EDA, be a good data scientist, put it in a separate file, test, refactor, and save yourself and your colleagues some time in the future.