Last time, I presented some glue code that helps you stash away tweets matching user-provided hashtags in MongoDB. I used this code to listen in on #DataScience Tweets for one lazy Friday afternoon (night time here in Norway), and now it’s time to crunch the numbers! Since one could in principle collect a huge number of Tweets with this method, I decided to use Apache Spark for the analysis, using the Scala interface. I consider myself a reasonably experienced C++ and Python programmer (I was a Software Engineer in a former life), but am relatively new to Scala. Which is great, because one should learn new languages every now and again, but it also means that you should take the code sample below with a grain of salt.

First, you need to get your Tweets from MongoDB into Spark, using the mongo-hadoop connector. The kind MongoDB people developing the connector give detailed instructions on how to use it with Spark. The only steps are:

  • Download the Spark jars here. You only need the one called something like mongo-hadoop-spark.jar.

  • Download the MongoDB Java driver here (which will send you here). You want the file called mongodb-driver-x.y.z.jar.

Now you can start up your Spark console with the –jars option to point to the two files you just downloaded (separated by a comma).

Alternatively, you can use the toree kernel for Jupyter, or use Zeppelin. The repository I made contains a Jupyter notebook using toree.

Now that we have the tools sorted out, it’s time to let the data tell us something. Take some time to look at the available entities you can get from a Tweet, there are lots of opportunities here. For the sake of demonstration, we will make a quick plot of the most popular hashtags used along with #DataScience (on which we filtered collecting the tweets).

The code reads data from our MongoDB instance (lines 18-22). It then extracts from the MongoDB object the “entities”-> “hashtag” list and inspects the text, which we convert to lowercase. We’re not interested in different capitalizations of the same tag. The last step is to filter out #DataScience, which we used to collect the data in the first place. The conversion to a list makes the result sortable. We use R (c.f. the repository for the R code) to plot the result. Here Zeppelin is ahead of toree, which at the time of writing doesn’t support in-line graphics, which is why I went on the R-detour.

topTags

Big data seems to be very on fashion, having over three times more mentions than the closest contender, a bunch of topics around the data science job market. #machinelearning, #iot, as well as #security are other obvious members of the top 10. Stay tuned for more number-crunching!