Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

How to create a company culture that can weather failure

Mary Branscombe | Aug. 16, 2017
In technology, things go wrong all the time, sometimes catastrophically. But if you stop paying attention after you fix the immediate problem, you’re missing out on the benefit of learning from experience.

That puts Target in a much better place than if it had only fixed the immediate problems and then stopped. "Other organizations that have been breached have circled the wagons. Their attorneys didn't let them say anything, they're not learning from the breach, they're not changing their spending on security and it's very clear they will fall to the same kind of breach again later."

The difference is as much culture as technology, Nather said. "It's all in how they responded and made something positive come out of what was a terrible situation."

If you want turn problems into learning experiences, there are some key do's and don'ts.


Do follow up

Adding this step to your playbook of what to do when things go wrong may seem obvious, but you have to follow up on an incident so you can learn from it.

"Schedule a formal review of the incident and identify next steps," says Stephen Burgess, consultant at the Uptime Institute. He suggests having regular meetings designed to track incidents to a final resolution, to make sure the longer term changes actually happen.

"From the root cause should come any formalized lessons learned, which in turn must clearly identify whether there are any final corrective actions.  Maintain scrutiny and open status of the failure incident until there is managerial confirmation that final corrective actions have been performed." That might mean training, changing policies, processes and procedures, or making proactive repairs and infrastructure upgrades.

Sam Lambert, senior director of infrastructure at GitHub, suggests that IT could learn from other disciplines. "Other industries that build things and build things to last and want to learn from failures in things they build, carry out investigations as standard operating procedure.Look at flight investigations and how useful they've been for aviation safety."

View failures as a chance to get ahead of similar potential problems, Lambert says. "If a failure case comes up and we recognize that failure case could be systemic in some other system, analyzing it gives us an opportunity to look at what may go wrong in the future."

He points to several areas where GitHub has been able to go beyond fixing the immediate problem to improving their systems generally. "We've learned about cause and effect: one service going wrong can affect other services even when they're not the cause of the problem. We've learned ways to build in safeguards and do checking in our development process. We've learned to respect the time necessary to make systems resilient the first time. We've also learned that some things can't be prevented and you've just got to accept that and understand that you have to learn from them each time."


Previous Page  1  2  3  4  5  6  7  Next Page 

Sign up for CIO Asia eNewsletters.