Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

What’s next for open-source Spark?

Jon Gold | Feb. 10, 2017
A crowd of more than 1,500 gathered Wednesday in Boston to hear about the future for the open-source big data engine.


Credit: Jon Gold

Boston -- A conference focused on a single open source project sounds like the sort of event that will feature a lone keynote speaker speaking to maybe 100 interested parties in a lecture hall at a local college. Spark Summit East was very much the opposite.

A total of 1,503 people watched the five keynote speakers in a cavernous ballroom at the Hynes Convention Center lay out the future of Spark, the big data processing engine originally developed at the University of California – Berkeley by Matei Zaharia. Spark underlies huge data-driven applications being used by major players like Salesforce, Facebook, IBM and many others, helping organize, analyze, and surface specific grains of sand from beach-sized databases.

Part of the reason that Spark has taken off in such a big way, said Zaharia from the stage, is that Moore’s Law has slowed down considerably of late. While the average data center network connection is about 10 times faster than it was even seven years ago, and the average storage I/O rate has grown by a similar amount, CPUs have remained roughly the same.

Hardware manufacturers are working around the problem by using simpler devices like GPUs and FPGAs, but it can be a lot of work moving applications onto completely different silicon, he noted. Spark’s moving to take advantage of new hardware platforms, according to Zaharia, but it’s also working to maximize performance on existing systems.

“The effort to do this is called Project Tungsten, which began about two years ago, to optimize Spark’s CPU and memory usage using two things – a binary storage format that escapes the [Java virtual machine] and is no longer tied to the limits of that, and runtime code generation,” he said.

Michael Armbrust is an engineer at Databricks, which sells a hosted Spark environment and is one of the chief sponsors of the summit. He traced the genesis of Spark back to Zaharia’s realization that cross-machine complexity – i.e. errors caused by the use of large groups of computers to work on a single problem – was going to be a stumbling block.

Optimization is still a key concept for Spark development, but the way in which that optimization happens is a little different.

“Roll forward to the year 2013, and a lot of people are using Spark, but what we’re finding is that a lot of people are spending their time tuning their computation,” said Armbrust. “You have to make sure you’re minimizing overheads like garbage collection, you want to make sure you’re getting the last inches of performance out of your cores … what you really want is just a high-level language that allows you to quickly and concisely express common computations.”

 

1  2  Next Page 

Sign up for CIO Asia eNewsletters.