This vendor-written piece has been edited by Executive Networks Media to eliminate product promotion, but readers should note it will likely favour the submitter's approach.
2016 marks the 10th Anniversary of Hadoop. This birthday provides us an opportunity to celebrate, and also to reflect on how we got here and where we are going.
Hadoop has come to symbolise big data, itself central to this century's industrial revolution: the digital transformation of business. Ten years ago, digital business was limited to a few sectors, like e-commerce and media. Since then, we have seen digital technology become essential to nearly every industry. Every industry is becoming data driven, built around its information systems. Big data tools like Hadoop enable industries to best benefit from all the data they generate.
Hadoop did not cause digital transformation, but it is a critical component of this larger story. Thus by exploring Hadoop's history, we can better understand the century we are now in.
Pre-Hadoop, there were two software traditions, which I will call "enterprise" and "hacker". In the enterprise tradition, vendors developed and sold software to businesses who ran it-the two rarely collaborated. Enterprise software relied on a Relational Database Management System (RDBMS) to address almost every problem. Users trusted only their RDBMS to store and process business data. If it was not in the RDBMS, it was not business data.
In the hacker tradition, software was largely used by the same party that developed it, at universities, research centers, and Silicon Valley web companies. Developers wrote software to address specific problems, like routing network traffic, generating and serving web pages, and so on. I came out of this latter tradition, specifically working on search engines for over a decade. We had little use for an RDBMS, since it did not scale well to searching the entire web, becoming too slow, inflexible and expensive.
In 2000, I launched the Apache Lucene project, working in open source for the first time. The methodology was a revelation. I could collaborate with more than just the developers at my employer, plus I could keep working on the same software when I changed employers. But most important, I learned just how great open source is at making software popular. When software is not encumbered by licensing restrictions, users feel much more comfortable trying it and building their businesses around it, without the risks of hard dependencies on opaque, commercial software. When users find problems, they can get involved and help to fix them, increasing the size of the development team. In short, open source is an accelerant for software adoption and development.
A few years later, around 2004, while working on the Apache Nutch project, we arrived at a second insight. We were trying to build a distributed system that could process billions of web pages. It had been rough going: the software was difficult to develop and operate. We heard rumors that Google engineers had a system where they could, with just a few lines of code, write a computation that would run in parallel on thousands of machines, reliably processing many terabytes in minutes. Then Google published two papers describing how this all worked, a distributed filesystem (GFS) with an execution engine (MapReduce) running on top of it. This approach would make Nutch a much more viable system. Moreover, these tools could probably be used in a lot of other applications. MapReduce had unprecedented potential for large-scale data analysis, but was at that time only available to engineers at Google.
Sign up for CIO Asia eNewsletters.