Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

What’s next for open-source Spark?

Jon Gold | Feb. 10, 2017
A crowd of more than 1,500 gathered Wednesday in Boston to hear about the future for the open-source big data engine.

This, coupled with the fact that 95% or more of Spark users are running SQL datasets, led to the development of Spark SQL, a language that “allow[s] you to just quickly say what you want Spark to figure out, and you leave it up to Spark to figure out exactly the most efficient way to perform that computation.”

Salesforce Senior Engineering Manager Alexis Roos detailed how his team is putting some of Spark's capabilities to use broadening the horizons of the company’s flagship Salescloud and Salesforce Inbox products.

“Using AI, we can make Salesforce Inbox smarter,” said Roos, before outlining the type of complex connections the system is able to make to ensure that the correct people are identified as hot leads and which contacts to make, and in what order.

spark salesforce demo 
Credit: Spark

“We want to tell users why an email is important, but we don’t want to stop there,” Roos said. “We also want to tell them what they should do about it.”

Nobody is working with bigger datasets than Cotton Seed, senior principal engineer at MIT and Harvard’s Broad Institute, which studies genomics using reams of digital information that rival YouTube for sheer scale. Broad – pronounced to rhyme with “road” – generates 17TB of new genome data every day, and manages a total of 45 petabytes of information.

YouTube’s still bigger at 25TB per day and 86 petabytes total, but that will change quickly in the near future, according to Seed. By 2025, he said, genomics research around the world will be taking in more than 20 exabytes – or 20 billion gigabytes – per year.

“That would be about $400 million a month, just in raw storage costs,” he said, adding that the compute tasks required to analyze that data in, for example, Google’s cloud would result in fees of nearly $6 billion per month, along with an “are you sure you typed that right?” query from the Google Cloud Platform estimation tool.

“It’s really gonna require innovations in computing technology and large data to continue to maintain our current pace of innovation in biomedicine,” Seed said, with some understatement.

For the moment, Seed’s team has created Hail, a platform built on Spark designed to process genetic data more efficiently. It uses a high-level language, a la Spark SQL, to automate certain basic analysis tasks, is highly scalable, and is designed to be easy to use for non-computer scientists, i.e. most people in genomics labs.

 

Previous Page  1  2 

Sign up for CIO Asia eNewsletters.