What a strange animal a data scientist is, somewhere in between a craftsman and an artisan, working in code, on data, visualizing, modeling, tinkering. Most of you will agree though that much of what we do is more craft than art, as as with every good craftsman, we need good tools. Sure, a truly skilled worker can create beautiful things with sub-par tools, but he or she won’t nearly be as efficient as if they were using high quality tools. So what are those? For me, everything starts with the very basic needs of navigating, finding, and modifying files, for which I use shell tools and a great editor, such as vim or emacs. But this post is not about those basic building blocks. Not that they aren’t important, on the contrary, but that’s probably a topic for another post.

Today I want to talk about the language that we use to manipulate and visualize data and how we interact with this language. There are many good choices of language, such as Python or R. One important aspect of these languages is that we have what is called a REPL available. REPL stands for read-evaluate-print loop, but the word often refers to the program that enables us to execute this loop, like a command line. Why not write scripts straight away? Maybe we need to try various transformations on data, want to look at the results and judging from what we see change those transformations. If loading the data takes a long time e.g. because we read it from a REST API, we would like to have the data persistent in our computer’s memory. And that’s what a REPL is really about, persistence. The results from our last computation are available for inspection and further modification after each step. Somewhat like executing a program in a debugger.

Now the two languages I mentioned above, R and Python, are very different in some respects. R is a specialized tool for statistical analysis. Python is a general purpose language. If you ask me which one to learn, I’d say learn both! If you insist on one, choose Python, just for the fact that it’s general purpose and you can e.g. easily write a simple web server in a few lines of code. Things like these can come in very handy when building a proof of concept. But I’m getting ahead of myself.

So let’s say we want to look at the price data we investigated last time, in a REPL, in Python. The most bare-bones REPL after the plain Python shell is IPython. We can use the gzip and csv packages to inspect the data, normalize the prices and make a plot, like this.

Python 2.7.11 (default, Mar  1 2016, 18:40:10)
Type "copyright", "credits" or "license" for more information.

IPython 4.1.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import gzip

In [2]: import csv

In [3]: from itertools import islice

In [4]: from collections import defaultdict

In [5]: with gzip.GzipFile("ObservationData_vhltxod.csv.gz") as input_file:
   ...:     obs_reader = csv.reader(input_file, delimiter=',')
   ...:     for row in islice(obs_reader, 5):
   ...:         print row
   ...:
['country', 'item', 'element', 'Unit', 'Date', 'Value']
['Afghanistan', 'Almonds, with shell', 'Producer Price (Local Currency Unit/tonne)', 'Local Currency Unit/tonne', '1/1/1993 12:00:00 AM', '46000']
['Afghanistan', 'Almonds, with shell', 'Producer Price (Local Currency Unit/tonne)', 'Local Currency Unit/tonne', '1/1/1994 12:00:00 AM', '50000']
['Afghanistan', 'Almonds, with shell', 'Producer Price (Local Currency Unit/tonne)', 'Local Currency Unit/tonne', '1/1/1995 12:00:00 AM', '62000']
['Afghanistan', 'Almonds, with shell', 'Producer Price (Local Currency Unit/tonne)', 'Local Currency Unit/tonne', '1/1/1996 12:00:00 AM', '50000']

In [6]: sisal_prices = defaultdict(dict)

In [7]: with gzip.GzipFile("ObservationData_vhltxod.csv.gz") as input_file:
   ...:     obs_reader = csv.reader(input_file, delimiter=',')
   ...:     for row in obs_reader:
   ...:         country, product, element, unit, date, value = row
   ...:         if product == 'Sisal' and unit == 'Standard Local Currency/tonne':
   ...:             year = int(date[4:8])
   ...:             sisal_prices[country][year] = float(value)
   ...:

In [8]: sisal_normalized_prices = defaultdict(float)

In [9]: ncountries = 0

In [10]: for country in sisal_prices:
   ....:     if 2000 in sisal_prices[country]:
   ....:         for year, value in sisal_prices[country].items():
   ....:             sisal_normalized_prices[year] += value / sisal_prices[country][2000]
   ....:         ncountries += 1
   ....:

In [11]: for i in sisal_normalized_prices:
   ....:     sisal_normalized_prices[i] = sisal_normalized_prices[i] / ncountries
   ....:

In [12]: import matplotlib.pyplot as plt
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

In [13]: xvals, yvals = sorted(sisal_normalized_prices), [sisal_normalized_prices[i] for i in sorted(sisal_normalized_prices)]

In [14]: plt.plot(xvals, yvals)
Out[14]: []

In [15]: plt.xlabel("year")
Out[15]: 

In [16]: plt.ylabel("sisal price, normalized")
Out[16]: 

In [17]: plt.show()

Here’s the resulting plot.

figure_1

While this is nice, some dislike the bare-bones look and feel. Editing can be cumbersome, and once you end your session you have to jump through a lot of hoops to just replay it. Luckily Jupyter comes to our rescue. Jupyter notebooks are basically a nicer looking version of our REPL, using a web browser as the interface. They support markdown, LaTeX formatting, in-line graphics and much more.

Reading the data was also quite painful above. Even though one should know how to parse a .csv file in pure Pyhton, I would recommend using a package that makes manipulating data easier, like Pandas. Rewriting our REPL session in a Jupyter Notebook will look like this.

Much nicer, but also our notebook-based analysis has its issues.

  • Testing. Notebooks are hard to test. This might be okay for a exploratory data analysis (or maybe not?) but one should not write bigger projects as notebooks.

  • Code reuse. Once the code is in a notebook rather than in a file, it can’t easily be included in another analysis, other than by copy-and-paste, leading to duplication.

  • Source control. Not keeping the code separate from the results we can easily run into trouble when we want to put a notebook under source control (e.g. git). If the output will change every time we execute a cell, one has to automatically remove the output (e.g. thusly), which might not be ideal in many cases.

During the next few posts, I will talk some more about various aspects of what I consider useful tools. If you have any requests, please leave a comment.