Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

16 for '16: What you must know about Hadoop and Spark right now

Andrew C. Oliver | Jan. 8, 2016
Amazingly, Hadoop has been redefined in the space of a year. Let's take a look at all the salient parts of this roiling ecosystem and what they mean

15. Scala/Python

Technically, you can use Java 8 for Spark or Hadoop jobs. But in reality, Java 8 support is an afterthought, so salespeople can tell big companies they still use their Java developers. The truth is Java 8 is a new language if you use it right -- in that context, I consider Java 8 a bad knockoff of Scala.

For Spark in particular, Java trails Scala and possibly even Python. I don't really care for Python myself, but it's reasonably well supported by Spark and other tools. It also has robust libraries -- and for many data science, machine learning, and statistical applications it will be the language of choice. Scala is your first choice for Spark and, increasingly, other toolsets. For the more "mathy" stuff you may need Python or R due to their robust libraries.

Remember: If you write jobs in Java 7, you're silly. If use Java 8, it's because someone lied to your boss.

16. Zeppelin/Databricks

The Notebook concept most of us first encountered with iPython Notebook is a hit. Write some SQL or Spark code along with some markdown describing it, add a graph and execute on the fly, then save it so someone else can derive something from your result.

Ultimately, your data science is documented and executed -- and the charts are pretty!

Databricks has a head start, and its solution has matured since I last noted being underwhelmed with it. On the other hand, Zeppelin is open source and isn't tied to buying cloud services from Databricks. You should know one of these tools. Learn one and it won't be a big leap to learn the other.

New technologies to watch

I wouldn't throw these technologies into production yet, but you should certainly know about them.

Kylin: Some queries need lower latency, so you have HBase on one side, and on the other side, larger analytics queries might not be appropriate for HBase -- thus, Hive on the other. Moreover, joining a few tables over and over to calculate a result is slow, so "prejoining" and "precalculating" that data into Cubes is a major advantage for such datasets. This is where Kylin comes in.

Kylin is this year's up and comer. We've already seen people using Kylin in production, but I'd suggest a bit more caution. Because Kylin isn't for everything, its adoption isn't as broad as Spark's, but Kylin has similar energy behind it. You should know at least a little about it at this point.

Atlas/Navigator: Atlas is Hortonworks' new data governance tool. It isn't even close to fully baked yet, but it's making progress. I expect it will probably surpassCloudera's Navigator, but if history repeats itself, it will have a less fancy GUI. If you need to know the lineage of a table or, say, map security without having to do so on a column-by-column basis (tagging), then either Atlas or Navigator could be your tool. Governance is a hot topic these days. You should know what one of these doohickies does.

 

Previous Page  1  2  3  4  5  6  Next Page 

Sign up for CIO Asia eNewsletters.