With its powerful data mining capabilities, Hadoop is bringing together people across different places and even across different generations.
While Hadoop continues to grow in popularity, most reported use cases around the open-source data-processing platform revolve around ad targeting or some other specialized task. But at O'Reilly's Strata-Hadoop World, held last week in New York, a number of Internet services talked about how they use Hadoop to bring people together.
Ancestry.com is using Hadoop as the cornerstone of a new service that allows users to submit a sample of their DNA and then have Ancestry.com look for matches to far-flung relatives, both alive and long-deceased. And social dating service eHarmony uses the service to refine its process of matching its millions of members.
In both cases, Hadoop has excelled at comparing hundreds or even thousands of variables across millions of different entities, a job much too large for traditional relational databases or even data warehouses.
"Hadoop is one of those key tools that has allowed us to create a massively scalable system," said Ancestry.com Chief Technology Officer Scott Sorensen in one presentation. The service is moving from proprietary tools to Hadoop to parse its large and ever-growing amount of data, he said.
Ancestry.com generates around US$480 million a year in revenue, from people who use the service to chart their ancestry, using their own documents as well as a repository of Ancestry.com's collection of 12 billion public records, about 10 petabytes' worth of data.
Hadoop powers a new service offered by the company called AncestryDNA. A user can send in a saliva sample, along with US$99, and the company will take 700,000 snips of the DNA from the sample and load the results into Hadoop, which will compare the snips to more than 200,000 other samples collected by the company. The company can then provide a list of far-flung relatives, whose family connections can go back 10 generations or more.
Half of a person's DNA comes from each biological parent. "Small changes in that DNA over generations leave bread crumbs that are like a view into history," Sorensen said. Ancestry.com can use the snips to determine a user's mix of ethnicities, as well as match the user with distant relatives.
Hadoop proved to be uniquely suited for this task in that it excels at taking 700,000 snips and then comparing those to snips from hundreds of thousands of other people's DNA to find matches. The service can find, on average, 40 fourth cousins for every customer who submits a sample. That result will only improve as more people submit their DNA, Sorensen said.
The company used a number of algorithms developed in academia for finding hidden matches in DNA. But the engineers at Ancestry.com had to parallelize the algorithms to run them across a multinode Hadoop deployment. Using traditional scale-up architectures, it would take Ancestry.com up to four weeks to compare 120,000 sets of DNA.
Sign up for CIO Asia eNewsletters.