Big data lets you anayze much more data from more sources, but at less resolution. Thus, we will be living with both traditional data warehouses and the new style for some time to come.
The technology breakthroughs behind big data
To accomplish the four required facets of big data—volume, variety, nondestructive use, and speed—required several technology breakthroughs, including the development of a distributed file system (Hadoop), a method to make sense of disparate data on the fly (first Google’s MapReduce, and more recently Apache Spark), and a cloud/internet infrastructure for accessing and moving the data as needed.
Until about a dozen years ago, it wasn’t possible to manipulate more than a relatively small amount of data at any one time. (Well, we all thought our data warehouses were massive at the time. The context has shifted dramatically since then as the internet produced and connected data everywhere.) Limitations on the amount and location of data storage, computing power, and the ability to handle disparate data formats from multiple sources made the task all but impossible.
Then, sometime around 2003, researchers at Google developed MapReduce. This programming technique simplifies dealing with large data sets by first mapping the data to a series of key/value pairs, then performing calculations on similar keys to reduce them to a single value, processing each chunk of data in parallel on hundreds or thousands of low-cost machines. This massive parallelism allowed Google to generate faster search results from increasingly larger volumes of data.
Around 2003, Google created the two breakthroughs that made big data possible: One was Hadoop, which consists of two key services:
- reliable data storage using the Hadoop Distributed File System (HDFS)
- high-performance parallel data processing using a technique called MapReduce.
Hadoop runs on a collection of commodity, shared-nothing servers. You can add or remove servers in a Hadoop cluster at will; the system detects and compensates for hardware or system problems on any server. Hadoop, in other words, is self-healing. It can deliver data—and run large-scale, high-performance processing jobs—in spite of system changes or failures.
Although Hadoop provides a platform for data storage and parallel processing, the real value comes from add-ons, cross-integration, and custom implementations of the tech- nology. To that end, Hadoop offers subprojects, which add functionality and new capabilities to the platform:
- Hadoop Common: The common utilities that sup- port the other Hadoop subprojects.
- Chukwa: A data collection system for managing large distributed systems.
- HBase: A scalable, distributed database that sup- ports structured data storage for large tables.
- HDFS: A distributed le system that provides high throughput access to application data.
- Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
- MapReduce: A software framework for distributed processing of large data sets on compute clusters.
- Pig: A high-level data- ow language and execution framework for parallel computation.
- ZooKeeper: A high-performance coordination service for distributed applications.
Sign up for CIO Asia eNewsletters.