Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

The 10 worst big data practices

Andrew C. Oliver | July 18, 2014
It's a new world full of shiny toys, but some have sharp edges. Don't hurt yourself or others. Learn to play nice with them.

What about, say, using Mahout to find customer orders that are common outliers? In most companies, most customer orders resemble each other. But what about the orders that happen often enough but don't match common ones? These may be too small for salespeople to care about, but they may indicate a future line of business for your company (that is, actual business development). If you can't drum up at least a couple of good real-world uses for Hadoop, maybe you don't need it after all.

5. Thinking Hive is the be-all, end-all. You know SQL. You like SQL. You've been doing SQL. I get it, man, but maybe you can grow, too? Maybe you should reach deep down a decade or three and remember the young kid who learned SQL and saw the worlds it opened up for him. Now imagine him learning another thing at the same time.

You can be that kid again. You can learn Pig, at least. It won't hurt ... much. Think of it as PL/SQL on steroids with maybe a touch of acid. You can do this! I believe in you! To do a larger bit of analytics, you may need a bigger tool set that may include Hive, Pig, MapReduce, Uzi, and more. Never say, "Hive can't do it, so we can't do it." The whole point of big data is to expand beyond what you could do with one technology.

6. Treating HBase like an RDBMS. You went nosing around Hadoop and realized indeed there was a database. Maybe you found Cassandra, but most likely you found HBase. Phew, a database -- now I don't have to try so hard! Trust me, HDFS-plus-Hive will drain less glucose from your head muscle (IANAD).

The only real commonality between HBase and your RDBMS is that both have something resembling a table. You can do things with HBase that would make your RDBMS's head spin, but the reverse is also true. HBase is good for what HBase is good for, and it is terrible at nearly everything else. If you try and represent your whole RDBMS schema as-is in HBase, you will experience a searing hot migraine that will make your head explode.

7. Installing 100 nodes by hand. Oh my gosh, really? You are going to hand-install Hadoop and all its moving parts on 100 nodes by Tuesday? Nothing beats those hand-rolled bits and bytes, huh? That is all fun and good until someone loses a node and you're hand-rolling those too. At the very least, use Puppet -- actually, use Ambari (or your distribution's equivalent) first.

8. RAID/LVM/SAN/VMing your data nodes. Hadoop stripes blocks of data across multiple nodes, and RAID stripes it across multiple disks. Put them together, what do you have? A roaring, low-performing, latent mess. This isn't even turducken -- it's like roasting a turkey inside of a turkey. Likewise, LVM is great for internal file systems, but you're not really going to randomly decide all hundred of your data nodes need to be larger, instead of, like, adding a few more data nodes.


Previous Page  1  2  3  Next Page 

Sign up for CIO Asia eNewsletters.