comScore is no stranger to big data. The digital analytics company was founded in 1999 with the goal of providing intelligence to companies about what's happening online.
In those early days, the volume of data at the company's fingertips was relatively modest. But that wouldn't last long.
"Things got really interesting for us starting back in 2009 from a volume perspective," says Mike Brown, the company's first software engineer, now its CTO. "Prior to that, we had been in the 50 billion to 100 billion events per month space."
Let the big data flow
Starting in the summer of 2009, it was like someone opened the sluice gates on the dam; data volume increased dramatically and has continued to grow ever since. In December of last year, Brown says comScore recorded more than 1.9 trillion events -- more than 10 terabytes of data ingested every single day.
Back in 2004, before Hadoop was a twinkle in the eyes of Doug Cutting and Mike Carafella, comScore had begun building its own grid processing stack to deal with its data. But in 2009, five years into that project, comScore was struggling to implement its new Unified Digital Measurement (UDM) initiative and the volume of data and processing requirements were growing fast.
"The census has been huge," Brown says. "Ninety percent of the top 100 media properties participate in that program now and every page is sending us a call."
The company now has about 50 different data sources across the census and panel classes, Brown says.
To accommodate the rising tide of data, comScore started a new round of infrastructure upgrades. It became apparent that its custom-built grid processing stack wasn't going to be able to scale with the need. Fortunately, there was a promising new technology gaining steam that might fit the bill: Apache Hadoop.
Putting data on the MapR
After experimenting with Apache Hadoop, the company decided to go with MapR Technologies' distribution.
"I think we were the first production MapR client," Brown says. "Our cluster has grown to a decent size. We have 450 nodes in our production cluster and that has 10 petabytes of addressable disk space, 40 terabytes of RAM and 18,000+ CPUs."
One of the big deciding factors in favor of MapR's distribution is its support for NFS.
"HDFS is great internally, but to get data in and out of Hadoop, you have to do some kind of HDFS export," Brown says. "With MapR, you can just mount [HDFS] as NFS and then use native tools whether they're in Windows, Unix, Linux or whatever. NFS allowed our enterprise systems to easily access the data in the cluster."
Sign up for CIO Asia eNewsletters.