Combining these two insights-the efficacy of open source for spreading technology and the broad applicability of Google's approach-we realized that an open-source implementation of Google's ideas would not only help us in Nutch, but had the potential to become a very successful open-source project. With that realization, Mike Cafarella and I began implementing such a distributed filesystem and MapReduce engine in Nutch.
By 2005, we had this new Google-inspired version of Nutch limping on 20-to-40 computer clusters, Mike at the University of Washington and me at the Internet Archive. However, I realised that, with just a couple of us working part time, it would take many years to get this software to be stable and reliable enough so that anyone could easily make use of it. Moreover, to truly fulfill its promise, the software needed to be tested and debugged on thousand computer clusters, which we did not have. The technology needed more engineers and more hardware.
Late that year I gave a talk about Nutch to folks at Yahoo! and learned that they had a great need for this kind of software. They also had a team of skilled engineers to work on it, and plenty of hardware. It was a perfect match.
So in January 2006, I joined Yahoo!. Shortly thereafter we separated the distributed filesystem and MapReduce software from Nutch into a new project, "Hadoop", named after my son's stuffed elephant. With the addition of a dozen or so Yahoo! engineers and access to thousands of their computers, we made rapid progress. By 2007 we had a relatively stable, reliable system that could process petabytes using affordable commodity hardware.
Hadoop was then a game changer. Developers could much more quickly and easily build better methods of advertising, spell-checking, page layout, and so on. Increasingly, users outside of Yahoo! started to deploy Hadoop, at companies like Facebook, Twitter, and LinkedIn. Other projects were soon built on top of Hadoop, like Apache Pig, Apache Hive, Apache HBase, and so on. Academic researchers began to use Hadoop. We had reached the target I had initially imagined: a popular open source project that enabled easy, affordable storage and analysis of bulk data.
Little did I know that things were only getting started. A few venture capitalists approached me suggesting that this software might be useful outside of the web and academia. I thought they were crazy. Banks, insurance companies and railways would never run the open-source "hacker" software that I worked on. But the VCs persisted, and, in 2008, they funded Cloudera, the first company whose express mission was to bring Hadoop and related technologies to traditional enterprises.
A year later, in 2009, I began to recognize this possibility. If we could makeHadoop approachable to Fortune 500 companies, it had the potential to change their businesses. As companies were adopting more technology, from websites and call centers to cash registers and bar code scanners, more and more data about their businesses passed through their fingers. If institutions could capture and use more of this data, they could better understand and improve their businesses. Traditional RDBMS-based technologies were a poor match in several dimensions: they were too rigid to support variable, messy data and rapid experimentation; they did not scale easily to petabytes; plus they were very expensive. Even a small Hadoop cluster could permit companies to ask and answer bigger questions than before, learning and improving. So I joined Cloudera. This was clearly where Hadoop would make the biggest difference going forward.
Sign up for CIO Asia eNewsletters.