Move over Hadoop, there is another highly scalable data processing powerhouse in town: Apache Giraph. Facebook is using the technology to bring a new style of search to its billion users.
When Facebook built its Graph Search service, the social networking company picked Giraph over other social graphing technologies -- such as the Hadoop-based Apache Hive and GraphLab -- because of Giraph's speed and immense scalability.
"Analyzing these real-world graphs at our scale ... with available software was impossible last year. We needed a programming framework to express a wide range of graph algorithms in a simple way and scale them to massive datasets," wrote Facebook software engineer Avery Ching, in a blog post that discussed Facebook's use of the technology.
With a little modification, Facebook has used Giraph to analyze a trillion edges, or connections between different entities, in under four minutes.
In addition to using Giraph for its Graph Search, Facebook also plans to use the software for other duties such as targeting ads and ranking data.
"Open Graph allows application developers to connect objects in their applications with real-world actions (such as user X is listening to song Y)," Ching explained.
Facebook's Graph Search, while still not as mature as regular search services such as Google's, may be the first widespread public introduction to the benefits of using social graphs.
A social graph maps the complex relationships between many different entities (called nodes). A node can be anything: a person, a restaurant, a city. They are connected by edges. An edge, for instance, asserts that a particular person may live in a certain city.
Yahoo first developed Giraph using the principles set forth in a 2010 paper published by Google engineers, "Pregel: a system for large-scale graph processing."
Using the Bulk Synchronous Parallel model of computing, Google designed Pregel to generate graphs from very large data sets, using lots of commodity servers.
Like it did with Hadoop, Yahoo bequeathed Giraph to the Apache Software Foundation, where it is now a fully open-source project worked on by developers from Facebook, LinkedIn, Twitter and Hortonworks.
Because Giraph is written in Java, Ching explained, it can connect very easily with the various parts of Facebook's Hadoop deployment, which it relies upon for data storage management and resource scheduling.
Facebook stores its user-generated data in a data warehouse running on Apache Hive, a component of Hadoop. Giraph, however, can generate graphs four times faster than Hive itself. Because it runs on Hadoop's MapReduce, a Giraph job can be split across multiple servers so it can be executed in parallel.
Sign up for CIO Asia eNewsletters.