Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Which freaking big data programming language should I use?

Ian Pointer | April 1, 2016
When it comes to wrangling data at scale, R, Python, Scala, and Java have you covered -- mostly


Finally, there's always Java -- unloved, forlorn, owned by a company that only seems to care about it when there's money to be made by suing Google, and completely unfashionable. Only drones in the enterprise use Java! Yet Java could be a great fit for your big data project. Consider Hadoop MapReduce -- Java. HDFS? Written in Java. Even Storm, Kafka, and Spark run on the JVM (in Clojure and Scala), meaning that Java is a first-class citizen of these projects. Then there are new technologies like Google Cloud Dataflow (now Apache Beam), which until very recently supported Java only.

Java may not be the ninja rock star language of choice. But while they're straining to sort out their nest of callbacks in their Node.js application, using Java gives you access to a large ecosystem of profilers, debuggers, monitoring tools, libraries for enterprise security and interoperability, and much more besides, most of which have been battle-tested over the past two decades. (I'm sorry, everybody; Java turns 21 this year and we are all old.)

The main complaints against Java are the heavy verbosity and the lack of a REPL (present in R, Python, and Scala) for iterative developing. I've seen 10 lines of Scala-based Spark code balloon into a 200-line monstrosity in Java, complete with huge type statements that take up most of the screen. However, the new lambda support in Java 8 does a lot to rectify this situation. Java is never going to be as compact as Scala, but Java 8 really does make developing in Java less painful.

As for the REPL? OK, you got me there -- currently, anyhow. Java 9 (out next year) will include JShell for all your REPL needs.

Drumroll, please

Which language should you use for your big data project? I'm afraid I'm going to take the coward's way out and come down firmly on the side of "it depends." If you're doing heavy data analysis with obscure statistical calculations, then you'd be crazy not to favor R. If you're doing NLP or intensive neural network processing across GPUs, then Python is a good bet. And for a hardened, production streaming solution with all the important operational tooling, Java or Scala are definitely great choices.

Of course, it doesn't have to be either/or. For example, with Spark, you can train your model and machine learning pipeline with R or Python with data at rest, then serialize that pipeline out to storage, where it can be used by your production Scala Spark Streaming application. While you shouldn't go overboard (your team will quickly suffer language fatigue otherwise), using a heterogeneous set of languages that play to particular strengths can bring dividends to a big data project.

Source: Infoworld 


Previous Page  1  2  3 

Sign up for CIO Asia eNewsletters.