I recently chatted with a colleague who has much more experience than me and who said he would never call himself a data scientist. I couldn’t really get out of him what he prefers instead (he gave some more examples of what he’d rather not be called though) but this conversation got me thinking. I’m happy to use the job title of data scientist, if just to differentiate our role from more classical ones such as statistician, analyst, software- or data engineer.

Then I came across this article. Even though it doesn’t answer what a data scientist is, it sheds some light at what people shouldn’t call themselves data scientists. These days there seem to be a lot of those around. While the article raises great points, I don’t agree with the order of them and I believe it misses the point a little. Let me explain.

Curiosity

I think being curious about the world around us is the most important trait of a data scientist. Our profession is all about wanting to understand how things work and improve them. If you’re not curious about how businesses work, how people think, how they behave, and what they talk about, you’re probably better off pursuing something else.

This curiosity should extend to technology. A good data scientist should have a natural curiosity about what’s hot in technology and be eager to try out APIs etc.. I would be skeptical about anyone in the field who did never try e.g. mining Tweets, if only just for fun.

Skills

I think that skills follow curiosity. You need to be confident in at least one serious programming language. No, Matlab doesn’t count. Python is fine. Java and C++ are as well. Scala if you’re so inclined. I insist on this for two reasons.

  • Get things done. Sometimes coding is the only way to get stuff done quickly. Stashing Tweets into MongoDB can be done in a few lines of Python, but good luck doing this in most enterprise tools. You’ll probably need hours just to install the right ones.

  • Thinking. Programming teaches you to think in a certain way that can be extremely helpful when working with data and algorithms. Especially functional(-ly oriented) languages. I agree that you shouldn’t learn to program for the sake of programming, but I do believe that for a data scientist it is a must. Solid coding skills is what distinguishes you from a clean-cut analyst.

Often businesses want something more enterprise-grade than your Python scripts, so you’ll need to get familiar with SAS or SPSS, but that shouldn’t be too much of a hurdle.

But tools should be just that, tools. You should be driven by your curiosity, and coding or enterprise tools should just be a means to the end of answering all those questions that you and your boss have. The exact same thing is true for statistics. Do you have to know how your models work under the hood? What the pitfalls are? How to proper evaluate their validity? Absolutely! Why? Because that’s just more tools to help you extract information from data and helps you substantiate (or refute) your findings.

Communication

Once you have gained insights or made a great predictive model, you need to be able to c_onvince people to act on what you found_. The reason for the whole exercise was to learn something about the world and the people in it. If you are are genuinely interested in learning how things work, you usually have an idea or two about how one can make them better. And I’m serious when I say that if that’s not the case for you, you’re probably not born to be a data scientist. And this is an aspect that the article cited above falls short.

Of course your presentations should be sleek, with top-notch visualizations and great story-telling. But chances are you’ll not convince anyone if your heart is not in it.

This is actually the point where I see most articles on what a data scientist should be and what not fall short. It’s the attitude that counts, more than knowing how to code and present and do A-B testing.