What IT can learn
The typical IT shop could usually benefit from acting more aggressively as its own worst enemy. Most of the time, IT operations are based on manual checklists. When a server goes down, there is a procedure written down somewhere for bringing it back up. As anyone who has any significant experience knows, it is one thing to have a checklist. It is another thing to have a well-tested and consistently used checklist, in the same way as it is one thing to have a backup and another thing altogether to have used that backup to successfully restore key applications.
As a first step, IT can be more aggressive about its manual procedures in the following ways:
- When you cause an outage, do it violently. Don't shut down a server step by step. Pull out the network cable or shut off the power. At Google we used the firewall to inflict such violence.
- Don't have someone who owns the service cause the outage. Often, they will take it easy on their baby.
- Have someone who is not involved in the service run the restore procedure using the checklist. This is how you can increase quality.
- The more your development and test environment matches your production infrastructure, the more of this work you can do in dev/test.
Creating effective, tested manual procedures is the prerequisite for automation, which is a much stronger foundation for disaster recovery than manual methods.
Practice being your own worst enemy
Google has an annual companywide ritual called DIRT (Disaster Recovery Testing), dedicated to finding vulnerabilities and improving manual and automated responses. Every year, the program grows in scope and quality. In essence, even at a place like Google, where everything is highly automated, the company has to practice being its own worst enemy to be good at it.
Focus automation of disaster recovery on the most mission-critical and fragile parts of your apps and infrastructure. Automation not only allows for a faster response to outages but it also helps you spin up new servers faster in response to scaling events.
Once you have automation in place for all the types of outages you can foresee, then it's time to unleash your own simian army. Start out in the test or dev environments until it is hard for you to destroy your app with an outage. Next, hold your breath and try it in production during a maintenance window or at some other low-risk time. When doing this, make sure you include all perspectives. Test from the end-user perspective, not just the interaction between servers.
The most important thing is to take the time to be your own worst enemy. Doing so can make the difference between an outage that lasts a few minutes to one that lasts hours or even days.
Sign up for CIO Asia eNewsletters.