I suspect this is the last year that Pig makes my list. Spark is much faster can be used for a lot of the same ETL cases -- and Pig Latin (yes, that's what they call the language you write with Pig) is a bit bizarre and often frustrating. As you might imagine, running Pig on top of Spark entails hard work.
Theoretically, people doing SQL on Hive can move to Pig in the same way that they used to go from SQL to PL/SQL, but in truth, Pig isn't as easy as PL/SQL. There might be room for something between plain old SQL and full-on Spark, but I don't think Pig is it. Coming from the other direction is Apache Nifi, which might let you do some of the same ETL with less or no code. We already use Kettle to reduce the amount of ETL code we write, which is pretty darn nice.
YARN and Mesos enable you to queue and schedule jobs across the cluster. Everyone is experimenting with various approaches: Spark to YARN, Spark to Mesos, Spark to YARN to Mesos, and so on. But know that Spark's Standalone mode isn't very realistic for busy multijob, multi-user clusters. If you're not using Spark exclusively and still running Hadoop batches, then go with YARN for now.
Nifi would have had to try hard not to be an improvement over Oozie. Various vendors are calling Nifi the answer to the Internet of things, but that's marketing noise. In truth, Nifi is like Spring integration for Hadoop. You need to pipe data through transforms and queues, then land it somewhere on a schedule -- or from various sources based on a trigger. Add a pretty GUI and that's Nifi. The power is that someone wrote an awful lot of connectors for it.
If you need this today but want something a bit more mature, use Pentaho's Kettle (along with other associated kitchenware, such as Spoon). These tools have been working in production for a while. We've used them. They're pretty nice, honestly.
While Knox is perfectly adequate edge protection, all it does is provide a reverse proxy written in Java with authentication. It's not very well written; for one thing, it obscures errors. For another, despite how it uses URL rewriting, merely adding a new service behind it requires a whole Java implementation.
You need to know Knox because if someone wants edge protection, this is the "approved" means of providing it. Frankly, a small modification or add-on for HTTPD's mod_proxy and it would have been more functional and offered a better breadth of authentication options.
Sign up for CIO Asia eNewsletters.