Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Which freaking big data programming language should I use?

Ian Pointer | April 1, 2016
When it comes to wrangling data at scale, R, Python, Scala, and Java have you covered -- mostly

Python tends to be supported in big data processing frameworks, but at the same time, it tends not to be a first-class citizen. For example, new features in Spark will almost always appear at the top in the Scala/Java bindings, and it may take a few minor versions for those updates to be made available in PySpark (especially true for the Spark Streaming/MLLib side of development).

As opposed to R, Python is a traditional object-oriented language, so most developers will be fairly comfortable working with it, whereas first exposure to R or Scala can be quite intimidating. A slight issue is the requirement of correct white-spacing in your code. This splits people between "this is great for enforcing readability" and those of us who believe that in 2016 we shouldn't need to fight an interpreter to get a program running because a line has one character out of place (you might guess where I fall on this issue).


Ah, Scala -- of the four languages in this article, Scala is the one that leans back effortlessly against the wall with everybody admiring its type system. Running on the JVM, Scala is a mostly successful marriage of the functional and object-oriented paradigms, and it's currently making huge strides in the financial world and companies that need to operate on very large amounts of data, often in a massively distributed fashion (such as Twitter and LinkedIn). It's also the language that drives both Spark and Kafka.

As it runs in the JVM, it immediately gets access to the Java ecosystem for free, but it also has a wide variety of "native" libraries for handling data at scale (in particular Twitter's Algebird and Summingbird). It also includes a very handy REPL for interactive development and analysis as in Python and R.

I'm very fond of Scala, if you can't tell, as it includes lots of useful programming features like pattern matching and is considerably less verbose than standard Java. However, there's often more than one way to do something in Scala, and the language advertises this as a feature. And that's good! But given that it has a Turing-complete type system and all sorts of squiggly operators ('/:' for foldLeft and ':\' forfoldRight), it is quite easy to open a Scala file and think you're looking at a particularly nasty bit of Perl. A set of good practices and guidelines to follow when writing Scala is needed (Databricks' are reasonable).

The other downside: Scala compiler is a touch slow, to the extent that it brings back the days of the classic "compiling!" XKCD strip. Still, it has the REPL, big data support, and Web-based notebooks in the form of Jupyter and Zeppelin, so I forgive a lot of its quirks.


Previous Page  1  2  3  Next Page 

Sign up for CIO Asia eNewsletters.