I write a lot about Hadoop, for the obvious reason that it's the biggest thing going on right now. Last year everyone was talking about it -- and this year everyone is rolling it out.
Some are still in the installation and deployment phase. Some are in the early implementation phases. Some have already used it to tackle real and previously intractable business problems. But trust me, Hadoop is the hottest ticket in town.
Hadoop and the techniques surrounding it are a bit complicated. The complexity doesn't come from mathematics or even computing theories -- much of it is the nature of the beast. There are so many technologies and components to choose from, often with overlapping capabilities, that it's hard to decide what to use to solve which problems.
If you try and learn the entire Hadoop ecosystem at once, you'll find yourself on a fool's errand. Instead, focus on the foundation, then develop a passing familiarity with the solutions for common problems -- and get a sense of what you need.
Core/foundational elements (this you must know)
Hadoop's underpinning is HDFS, a big distributed file system. Think of it as RAID plus CIFS or NFS over the Internet: Instead of striping to disks, it stripes over multiple servers. It offers redundant, reliable, cheap storage. While Hadoop may be most famous for implementing the MapReduce algorithm, HDFS is really the Hadoop ecosystem's foundation. Yet it is a replaceable base, because there are alternative distributed file systems available.
YARN (Yet Another Resource Negotiator), which sits on top of HDFS, is a bit hard to describe because like Hadoop it isn't all one thing. At its base YARN is a cluster manager for Hadoop. Also, MapReduce is built on top of it. Yet part of the motivation for YARN is that the Hadoop ecosystem expands beyond algorithms and technologies based on MapReduce. YARN is based on these concepts:
- MapReduce: You've heard of this. In the case of Hadoop it is both an algorithm and an API. Abstractly, the algorithm depends on the distribution of data (HDFS) and processing (YARN) and works like this. MapReduce is also a Java API provided by Hadoop that allows you to create jobs to be processed using the MapReduce algorithm.
Here's an example of how the MapReduce algorithm works: A client named Jim calls and he's in construction. You don't remember who he is, but he knows you. Let's say it's 1955, and instead of a computer you have a giant Rolodex (notecards on a spindle) of all of your contacts organized by last name. Jim is going to get wise and realize you don't know who he is before you find a notecard "James Smith, Contractor, Steel Building Construction." If, instead, you have 10 duplicate Rolodexes and possibly half of them have A-M and the other have O-Z and there are at least 11 people in the office, you could ask your office manager (writing frantically on a pad) to find all cards with "James, Jim, or Jimmy that have 'construction' or 'contractor.'"
Your office manager could ask 10 people in the office to start grabbing cards in their rolodex. The Office manager might also divvy up the job and ask one person to only bother with A-B, and so on. The office manager has "mapped" the job to the nodes. Now the nodes find the cards (the result) and send them to the office manager who combines the cards onto one sheet of paper and "reduces" them into the final answer (in our case possibly throwing out duplicates or providing a count). With any luck we get one card, and worst case you have a new request based on what you "cold read" from the client on the phone.
- (Global) Resource Manager: Negotiates resources among the various applications requesting them. Its underlying scheduler is pluggable and can prioritize different guarantees such as SLA, fairness, and capacity constraints.
- Node Managers: Node Managers are per-machine slaves that monitor and report resource utilization (CPU, memory, disk, network) back to the Resource Manager.
- Application Masters: Each application making requests from your Hadoop cluster generally has one Application Master. In some cases you may group a few applications together and use one Master for all of them. Some tools like Pig use a single Application Master across multiple applications. The Application Master(s) compose Resource Requests to the Resource Manager and allocates (resource) Containers in collaboration with the NodeManagers. These Containers, distributed across the nodes and managed by the NodeManagers, are the actual resources allocated for the application requests.
- Pig: Technically Pig is a tool and Pig Latin is the language but nearly everyone will use Pig to refer to the language as allingca ita igpa atinla isa illysa. While it is possible to execute jobs on Hadoop using SQL, you'll find that SQL is a relatively limited language for potentially unstructured or differently structured data. Pig feels to me like PERL, SQL and regular expressions had a love child. It isn't superhard to learn or use, and it lets you create MapReduce jobs in far fewer lines of code than using the MapReduce API. There are lots of things you can MapReduce with Pig that you can't do with SQL.
- Hive and Impala: Hive is essentially a framework for data warehousing on top of Hadoop. Moreover, if you love your SQL and really want to do SQL on Hadoop, Hive is for you. Impala is another implementation of the same general idea and is also open source but not hosted at Apache. Why can't we all get along? Well, Hive is more or less backed by Hortonworks and Impala by Cloudera. Cloudera says Impala is faster and most third-party benchmarks seem to agree (but not the 100x Cloudera claims). Cloudera doesn't claim that Impala is a complete replacement for Hive (and ships it as well), but that it is superior for some use cases.
Sign up for CIO Asia eNewsletters.