And your SAN, your holy SAN -- loved by many, I/O bound, and latent to all. You're using HDFS for a higher burst rate, so now you're going to stick everything back in the box? The idea is to scale horizontally -- how are you going to do that across the same network pipe to the same box o' disks?
Hey, EMC will sell you more SAN, but maybe you need to think outside the box. VMs are great. However, if you want high-end performance, I/O is king. Fine, you can virtualize the name node and much of the rest of Hadoop, but nothing beats bare metal for data nodes. You can achieve much of the same advantage as virtualization with devops tools. Even most of the cloud vendors are offering metal options.
9. Treating HDFS as just a file system. If you dump stuff onto HDFS, you haven't necessarily accomplished anything. The tooling around it is important, of course. Now you can Hive, Pig, and MapReduce it, but you have to think a bit about what, why, and where you're dumping things onto HDFS. You need to think about how you're going to secure all of this and for whom.
10. Whoo, shiney! Also known as, "today is Thursday, let's move to Spark." Yes, Hadoop is a growing ecosystem, and you want to stay ahead of the curve. I feel you, man, but let's remember that freedom is just another word for nothing left to lose. Once you have real data and real users, you don't have the same amount of freedom as when you had no real responsibility. Now you must have a plan.
Fortunately, you have the tools to manage that evolution and move forward responsibly. Maybe you don't get to deploy this week's cool thing while it is fresh, but you don't have to run Hadoop 1.1 anymore, either. As with any technology -- or anything in life -- find that moderate path that prevents you from being the last gazelle in the pack or the first lemming off the cliff.
This is the current top 10 I'm seeing in the field. How's your big data project going? What anti-patterns or patterns have you found?
Sign up for CIO Asia eNewsletters.