Webscale pioneers like Netflix, Google, Amazon and Etsy have made a science of breaking their own applications and infrastructure so that they can determine if their application and operations architecture is complete and robust. While few IT shops run their apps and infrastructure in the same way as these behemoths, valuable lessons for CIOs and CTOs of all stripes can be found in the innovative practices these companies have created.
In a theoretical sense, webscale companies have a simpler problem to solve than most organizations. These players run one or a few massive services, and while there are lots of components to these services, they are generally better understood and built to work together, unlike what you find in traditional enterprise IT. A typical shop has dozens of interacting components with dependencies that are often either not documented or even widely understood.
The only way that webscale companies manage tens of thousands of servers is by automating absolutely everything. Webscale companies are also usually very disciplined about making their development and test environments identical to their production environments in as many ways as possible. Because, for the most part, webscale companies practice DevOps, they are developing operational procedures, searching for vulnerabilities, and creating automated responses all through the development and release management cycle.
It is important to remember, however, that any webscale development or test environment is really just a shadow of the production environment, which contains complexities and scale that cannot be replicated. When you deploy in such an environment, it is always risky.
The ravages of Chaos Monkey
Chaos Monkey is the original tool Netflix used to reduce risks in its production environment.
Chaos Monkey randomly shuts down servers, services and other components to make sure that failure does not lead to any disruption to users. In practice, Chaos Monkey tests two things:
- The ability of the application and operational architecture to conceal failure
- The quality of automated responses for recovering from failure
Ideally, Chaos Monkey should be able to wreak havoc all day long and users should never notice, simulating a process of continuous disaster recovery. When Chaos Monkey does cause a problem, either the application and operational architecture or the automation of disaster recovery must be fixed.
Netflix also now has its own simian army, including Chaos Gorilla, which takes out an entire availability zone. (To read more about havoc that can be wreaked by other members of the simian army, such as Latency Monkey, Conformity Monkey, Doctor Monkey and Security Monkey, see the Netflix blog on that topic.)
Each of the webscale players has it own bag of tricks for being its own worst enemy. When I was at Google, we used firewall rules to simulate network outages. Etsy has developed a huge arsenal of automated tests so it can deploy changes many times a day, confident that any problems will be quickly found. Loudcloud encouraged discipline by offering a 100% SLA with financial remedies. Amazon doesn't talk much about what it has learned. I wish it did.
Sign up for CIO Asia eNewsletters.