Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

16 for '16: What you must know about Hadoop and Spark right now

Andrew C. Oliver | Jan. 8, 2016
Amazingly, Hadoop has been redefined in the space of a year. Let's take a look at all the salient parts of this roiling ecosystem and what they mean

Spark isn't only obviating the need for MapReduce and Tez, but also possibly tools like Pig. Moreover, Spark's RDD/DataFrames APIs aren't bad ways to do ETL and other data transformations. Meanwhile, Tableau and other data visualization vendors have announced their intent to support Spark directly.

2. Hive

Hive lets you run SQL queries against text files or structured files. Those usually live on HDFS when you use Hive, which catalogs the files and exposes them as if they were tables. Your favorite SQL tool can connect via JDBC or ODBC to Hive.

In short, Hive is a boring, slow, useful tool. By default, it converts your SQL into MapReduce jobs. You can switch it to use the DAG-based Tez, which is much faster. You can also switch it to use Spark, but the word "alpha" doesn't really capture the experience.

You need to know Hive because so many Hadoop projects begin with "let's dump the data somewhere" and thereafter "oh by the way, we want to look at it in a [favorite SQL charting tool]." Hive is the most straightforward way to do that. You may need other tools to do that performantly (such as Phoenix or Impala).

3. Kerberos

I loathe Kerberos, and it isn't all that fond of me, either. Unfortunately, it's the only fully implemented authentication for Hadoop. You can use tools like Ranger or Sentry to reduce the pain, but you'll still probably integrate with Active Directory via Kerberos.

4. Ranger/Sentry

If you don't use Ranger or Sentry, then each little bit of your big data platform will do its own authentication and authorization. There will be no central control, and each component will have its own weird way of looking at the world.

But which one to choose: Ranger or Sentry? Well, Ranger seems a bit ahead and more complete at the moment, but it's Hortonworks baby. Sentry is Cloudera's baby. Each supports the part of the Hadoop stack that its vendor supports. If you're not planning to get support from Cloudera or Hortonworks, then I'd say Ranger is the better offering at the moment. However, Cloudera's head start on Spark and the big plans for security the company announced as part of its One Platform strategy will certainly pull Sentry ahead. (Frankly, if Apache were functioning properly, it would pressure both vendors to work together on one offering.) 

5. HBase/Phoenix

HBase is a perfectly acceptable column family data store. It's also built into your favorite Hadoop distributions, it's supported by Ambari, and it connects nicely with Hive. If you add Phoenix, you can even use your favorite business intelligence tool to query HBase as if it was a SQL database. If you're ingesting a stream of data via Kafka and Spark or Storm, then HBase is a reasonable landing place for that data to persist, at least until you do something else with it.

 

Previous Page  1  2  3  4  5  6  Next Page 

Sign up for CIO Asia eNewsletters.