Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

How to create a company culture that can weather failure

Mary Branscombe | Aug. 16, 2017
In technology, things go wrong all the time, sometimes catastrophically. But if you stop paying attention after you fix the immediate problem, you’re missing out on the benefit of learning from experience.

"Part of your responsibility as a CIO is to build these relationships," explains Nather. "The system admins should be your eyes and ears. You want to have the culture where someone will come into your office and close the door and say, 'There's something I think you ought to know.' If you can get that, you can build a resilient organization."

Treating IT and security as being a business service rather than a point of control helps create that kind of culture. "If you take the attitude that you're there to help everyone else with their business, that's very different from sitting in an ivory tower and saying, 'Ooh you did something wrong, you missed a spot'," she says.

 

Do learn from others' mistakes

Thinking about what you'd do differently the next time a problem occurs is useful, but you can also think about how you'd tackle problems you haven't run into yet.

"What I see in in very mature organizations is that they also try to learn from other people's incidents," says Nather. "Ask, 'If that were to happen to us, what would it look like, how could we detect it and how could we respond to it?'"

Your competitors might or might not share the details of incidents they've faced and fixed (formally or informally), but you can also watch organizations with a similar technology setup and risk profile in other industries. Security vendors often blog step-by-step analyses of incidents. Nather also recommends a Twitter account @badthingsdaily that comes up with scenarios regularly: "Your partner database just went down, a tornado just destroyed your backup data centers. You can take them and talk them through. You can even go through the exercise of building the tool or doing the scripting to be able to automate the detection so that's one less thing your people have to worry about doing manually."

These tabletop exercises can be more palatable than the 'chaos monkey' approach pioneered by Netflix to simulate failure by deliberately shutting down some systems. "For less mature organizations, actually breaking something is a real concern, which is why even talking through it without actually doing anything can be very useful."

 

Do have processes that take into account that people get tired

Many incident reports include a phrase like "it was now three o'clock in the morning" followed by a decision that actually prolonged the problem, but Lambert points out that "being late at night doesn't change the frequency of alerts."

"Incidents caused by failures of machines and networks are not more frequent out of hours, but they are harder to respond to." For one thing, during the day there are more people around to spot problems sooner. For another, unless you have dedicated support staff working shifts, "the person who has to deal with it has to get paged, they might be tired or distracted."

 

Previous Page  1  2  3  4  5  6  7  Next Page 

Sign up for CIO Asia eNewsletters.