Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

comScore CTO shares big data lessons

Thor Olavsrud | June 30, 2015
Digital analytics company comScore has been using Hadoop to process its big data since 2009. With six years of experience using the technology in production, comScore CTO Mike Brown shares some of the lessons he's learned.

CIOs should think small on big data (at first)

So given his long experience with Hadoop in production, what advice does Brown have for CIOs that are just starting to implement big data technologies? First, start small.

"Everybody is thrilled by the notion of big data, but start small," Brown says. "The technology is there to allow you to scale up, but taking a subset of your data and pounding on that for a while and working on it, that will allow you to demonstrate value to the business much faster."

The really important thing, he says, is to get past the proof of concept (PoC) and put your project into production.

"Choose one thing to try to provide value to show that this does work," he says. "Then get that into production. I'm fearful that some places choose to leave their big data projects as the evergreen PoC. It doesn't get real until you've got it in production. It can be hard, but that's the big thing to do. Once you do that, then it's quick to build momentum."

Brown notes that it's also essential to give careful consideration to the hardware you select. One of the things that really helped Hadoop catch on, he says, is that you can scale with commodity hardware. But that doesn't mean you can skimp.

"When we first started out, I think the conventional wisdom out there from a Hadoop perspective was to go with high-density, low-speed drives," he says. "But when you get into analytics, slow drives make the shuffle be kind of slow."

comScore ran headfirst into that problem when it first began working with Hadoop. In a MapReduce job, there is a 'shuffle and sort' handoff process that occurs after the 'Map' phase and before the 'Reduce' phase. The data from the mapper are sorted by key and partitioned (assuming there are multiple reducers) and then moved to nodes that will run the reducer tasks, where they are written to disk. This is where low-speed drives can become a big bottleneck, Brown says.

"It's worth doing some traditional IOPS testing on what your drives can do," he says.

"IOPS is really driving a lot of this stuff," he adds. "I've heard of some shops putting in all SSDs now."

Another area that's important to focus on, Brown says, is quality assurance -- of your data.

Keep up with your algorithms

"I think the big thing, especially in the data area, is you actually need data QA," he says. "Did the algorithm do what it's supposed to do? Algorithms can need maintenance just like software does."

Finally, he says, make sure that you're thinking long-term.

 

Previous Page  1  2  3  Next Page 

Sign up for CIO Asia eNewsletters.