I love the elephant. The elephant loves me. Nothing is perfect, however, and sometimes friends fight.
Here are the things I fight with Hadoop about.
1. Pig vs. Hive
You cannot use Hive UDFs in Pig. You have to use HCatalog to access Hive tables in Pig. You cannot use Pig UDFs in Hive. Whether it's one little extra functionality I need while in Hive, but don't really feel like writing a full-on Pig script or it's the "gee, I could easily do this if I were just in Hive" while I'm writing Pig scripts, I frequently think, "Tear down this wall!" when I'm writing in either.
2. Being forced to store all my shared libraries in HDFS
This is a recurring theme in Hadoop. If you store your Pig script on HDFS, then it automatically assumes any JAR files will be there as well (I'm working on fixing that myself). This general theme repeats in Oozie and other tools. It's usually sensible, but at times, having an organization-wide forced shared library version is painful. Besides, more than half the time, these are the same JAR files you installed everywhere you installed the client, so why store them twice? This is being fixed in Pig. How about everything else?
Debugging you is not fun, so the docs have lots of examples with the old schema. When you get an error, it usually has nothing to do with whatever you did wrong. It may be a "protocol error" for a configuration typo or a schema validation error for a schema that validates using the schema validator but fails on the server. To a great degree, Oozie is like Ant or Maven, except distributed, with no tooling and a bit brittle.
4. Error messages
You're joking, right? Speaking of error messages. My favorite is the one where any of the Hadoop tools say, "failure, no error returned," which translates to "something happened, good luck finding it."
'Nuff said? If you want to secure Hadoop in a way that was relatively thought out, you get to use Kerberos. Remember Kerberos and how much fun and antiquated it is? So you go straight LDAP, except that nothing in Hadoop is integrated: no single sign-on, no SAML, no OAuth, and nothing passes the credentials around (instead, it re-authenticates and re-authorizes). Even more fun, each part of the Hadoop ecosystem wrote its own LDAP support, so it's inconsistent.
Because writing a proper LDAP connector needs to be done at least 100 more times in Java before we get it right. Gosh, go look at that code. It doesn't really pool connections properly. In fact, I kind of think Knox was created out of a zeal for Java or something. You could do the same with a well-written Apache config, mod_proxy, mod_rewrite. In fact, that's basically what Knox is, except in Java. To boot, after it authenticates and authorizes, it doesn't pass the information on to Hive or WebHDFS or whatever you're accessing, and gets to do it again.
Sign up for CIO Asia eNewsletters.