Most implementations of a Hadoop platform include at least some of these subprojects, as they are often necessary for exploiting big data. For example, most organizations choose to use HDFS as the primary distributed file system and HBase as a database, which can store billions of rows of data. And the use of MapReduce or the more recent Spark is almost a given since they bring speed and agility to the Hadoop platform.
With MapReduce, developers can create programs that process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers. The MapReduce framework is broken down into two functional areas:
- Map, a function that parcels out work to different nodes in the distributed cluster.
- Reduce, a function that collates the work and resolves the results into a single value.
One of MapReduce’s primary advantages is that it is fault-tolerant, which it accomplishes by monitoring each node in the cluster; each node is expected to report back periodically with completed work and status updates. If a node remains silent for longer than the expected interval, a master node makes note and reassigns the work to other nodes.
Apache Hadoop, an open-source framework that uses MapReduce at its core, was developed two years later. Originally built to index the now-obscure Nutch search engine, Hadoop is now used in virtually every major industry for a wide range of big data jobs. Thanks to Hadoop’s Distributed File System and YARN (Yet Another Resource Negotiator), the software lets users treat massive data sets spread across thousands of devices as if they were all on one enormous machine.
In 2009, University of California at Berkeley researchers developed Apache Spark as an alternative to MapReduce. Because Spark performs calculations in parallel using in-memory storage, it can be up to 100 times faster than MapReduce. Spark can work as a standalone framework or inside Hadoop.
Even with Hadoop, you still need a way to store and access the data. That’s typically done via a NoSQL database like MongoDB, like CouchDB, or Cassandra, which specialize in handling unstructured or semi-structured data distributed across multiple machines. Unlike in data warehousing, where massive amounts and types of data are converged into a unified format and stored in a single data store, these tools don’t change the underlying nature or location of the data—emails are still emails, sensor data is still sensor data—and can be stored virtually anywhere.
Still, having massive amounts of data stored in a NoSQL database across clusters of machines isn’t much good until you do something with it. That’s where big data analytics comes in. Tools like Tableau, Splunk, and Jasper BI let you parse that data to identify patterns, extract meaning, and reveal new insights. What you do from there will vary depending on your needs.
InfoWorld Executive Editor Galen Gruman, InfoWorld Contributing Editor Steve Nunez, and freelance writers Frank Ohlhorst and Dan Tynan contributed to this story.
Sign up for CIO Asia eNewsletters.