Don't play the blame game
Whether it's external problems or an increasing willingness to try "more risky fare like fail-fast experimentation, open hackathons, and citizen developer programs," CIOs are even more likely to face major IT failures, Dion Hinchcliffe, VP and principal analyst at Constellation Research, told CIO.com.
"The first step is to prepare for failures with solid contingency plans, but it's also key to learn from failure through an honest and open, blame-free process."
He admits that "this can be hard for IT for practical reasons - given the already maximized work schedules - as well as human ones: A hit to morale can occur when really digging into the root cause of failures and observing dysfunction."
If the investigation focuses on assigning blame rather than understanding the systemic failures that led to the incident, you won't make staff feel safe enough to share information, suggest solutions, warn you about possible issues or absorb the lessons of the incident.
To help avoid blame, Nather suggests "not looking backwards and rehashing it and saying 'If only this had happened...' It's better to say, 'If we assume this could happen again, how could we respond better this time?'" Not only does that remove the notion of finding fault, but it's also more realistic. "Everyone would like to look at an incident and say, 'We'll never have that happen again,' but you can't really say that!"
Rather than assigning blame, Lambert recommends understanding the reasoning behind decisions. "Often, doing dumb stuff is about not having time to do good stuff. People make trade-offs that they're not necessary happy with but sometimes you just have to do that. Sit down with the person who made those trade-offs and ask them why. What were the pressures, what was the information they had that made these trade-offs make sense."
Don't call it a post mortem
Although the term "blameless post-mortem" is common - popularized by companies like Etsy, whose tracker for the process is called Morgue - Nather suggests picking a friendlier phrase. "If you call it a post-mortem that sounds so terribly morbid! The term we use is an after action report. We try to make it a very positive thing, rather than thinking of it as 'having survived the battle we will now count our wounded and dead'."
Don't call it human error
When British Airways had to cancel all flights from Gatwick and Heathrow airports over a bank holiday weekend this May, it blamed the IT failure that stranded some 75,000 travellers on human error. A contractor appears to have turned the uninterruptable power supply off and the power surge when it was turned back on damaged systems in its data center. BA promised an independent investigation, but its initial explanation raised questions over the design of both the power and backup systems.
Sign up for CIO Asia eNewsletters.