• A Data Scientist's Toolbox, Part 2: Testing your code.

    In the last post, I talked about the usefulness of REPLs which indeed is hard to overstate. Exploratory data analysis would be a lot of hassle without the read-evaluate-print loop. We had a closer look at Jupyter in particular and first attempts at analyzing the data from the post on food and inflation can be found on github.

  • A Data Scientist's Toolbox, Part 1: REPLs

    What a strange animal a data scientist is, somewhere in between a craftsman and an artisan, working in code, on data, visualizing, modeling, tinkering. Most of you will agree though that much of what we do is more craft than art, as as with every good craftsman, we need good tools. Sure, a truly skilled worker can create beautiful things with sub-par tools, but he or she won’t nearly be as efficient as if they were using high quality tools. So what are those? For me, everything starts with the very basic needs of navigating, finding, and modifying files, for which I use shell tools and a great editor, such as vim or emacs. But this post is not about those basic building blocks. Not that they aren’t important, on the contrary, but that’s probably a topic for another post.

  • Food and Inflation

    After my last post a friend said: “What?! You blogged about baby names? Again?” This was when I realized that I really needed to kick my obsession and move on to a new data set. The good people at Open Data for Africa came to my rescue with a great data set from the Food and Agriculture Organization containing producer prices for primary crops, live animals, livestock primary products, all together 200 commodities, collected in 130 countries between 1900 and 2014. They claim that the data they collect represents 97% of the world’s gross agricultural produce and thus are of course an absolute treat for a data scientist’s lazy Sunday afternoon.

  • iPhones, Twitter, and Baby Names, Again.

    By now you maybe know that finding out something about people by knowing their first name is one of my hobbies. A while back we investigated if people whose age suggest that their parents were following a trend rather than setting one while choosing their name use language differently.

  • Good Data Scientist, Bad Data Scientist.

    I recently chatted with a colleague who has much more experience than me and who said he would never call himself a data scientist. I couldn’t really get out of him what he prefers instead (he gave some more examples of what he’d rather not be called though) but this conversation got me thinking. I’m happy to use the job title of data scientist, if just to differentiate our role from more classical ones such as statistician, analyst, software- or data engineer.

  • Listen To Twitter, Part 2: Links!

    Last time we used Spark and Scala to look at what hashtags are used together with #DataScience on Twitter, which is nice, but misses out on an important aspect that people use Twitter for: sharing links. So what are the most popular data science realted sites on Twitter? In order to find out, we again use Spark, writing a short Scala script thusly:

  • What's Hot In #DataScience? Listen To Twitter, Part 2.

    Last time, I presented some glue code that helps you stash away tweets matching user-provided hashtags in MongoDB. I used this code to listen in on #DataScience Tweets for one lazy Friday afternoon (night time here in Norway), and now it’s time to crunch the numbers! Since one could in principle collect a huge number of Tweets with this method, I decided to use Apache Spark for the analysis, using the Scala interface. I consider myself a reasonably experienced C++ and Python programmer (I was a Software Engineer in a former life), but am relatively new to Scala. Which is great, because one should learn new languages every now and again, but it also means that you should take the code sample below with a grain of salt.

subscribe via RSS