Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Failure is inevitable, so manage risk

Bernard Golden | Nov. 2, 2012
The Amazon Outage in Perspective

There Is No "Right" Equipment, No Matter What Your SLA Says
There's only one problem: the solution proposed by commenters is outmoded, inappropriate and unsustainable.

First, it assumes that availability can be increased by use of enterprise-grade equipment. The fact is, every type of equipment fails, often at inconvenient times. Believing that availability can magically improve by simply using the "right" equipment is doomed to failure.

Resource failure is an unfortunate reality. The primary issue is what user organisations should do to protect themselves from hardware failure. It's what they should really do, too. I view the "negotiate harder on the SLA" strategy as akin to "the beatings will continue until morale improves," meaning that it makes the SLA-demander feel better but is unlikely to result in any actual improvement.

Commentary: Cloud Computing and the Truth About SLAs
Many of the cloud providers commenting on the AWS outage propose this kind of solution. In my view, this demonstrates how poorly they understand this issue. Their hardware will fail, too. Those engaged in taunting a competitor when it experiences a service failure should remember that pride goes before the fall. 

Second, ironclad change control processes are not actually going to reduce resource failure. This is because anything involving human interaction is subject to mistakes, which results in failure. It's instructive to note that both major AWS outages were not the result of hardware failure, but of human error-specifically, human error that interacted with system design assumptions that failed to account for the type of error that occurred. And even organisations that are strongly ITIL-oriented experience human-caused problems.

Finally, the solutions proposed don't account for the world of the future. Every company is going to experience a massive increase in IT scale; believing that just putting in place rigid enough processes, with enough checks and balances, will reduce failure just doesn't recognise how inadequate that approach is for this new IT world. No IT organisation (and no cloud provider) will be able to afford enough people (or enough enterprise-grade equipment) to pursue this type of solution. 

Redundancy, Failover Have Been Best Practices For a Long Time
The true solutions for resource failure has long been known: redundancy and failover. Instead of a single server, use two; if one goes down, it's possible to switch over to the second to keep an application running. It's just that, in the past, implementing redundancy was unaffordable except for a tiny percentage of truly mission-critical applications, given the cost of hardware and software. 

The genius of cloud computing is that it offers the ability to address this redundancy easily and cheaply. Many users have designed their apps to be resilient in the face of individual resource failure and have protected themselves against it-unlike those who pursue the traditional solutions proffered by many commenters which will, inevitably, result in an outage when the enterprise-grade equipment fails.

 

Previous Page  1  2  3  Next Page 

Sign up for CIO Asia eNewsletters.