An endless stream of tweets and blog posts have noted, described and bewailed last week's Amazon Web Services outage. Some people characterised the outage as an indictment of public cloud computing in general. Others, some of whom work at other cloud providers, characterised it as indicative of AWS-specific shortcomings. Still others used the event as an opportunity to outline how users have to be sure to hammer home SLA penalty clauses during contract negotiations, just to ensure protection from outages.
Most of these responses are reflective of bias or the commenter's own agenda and fail to draw the proper lessons from this outage. More crucially, they fail to offer really useful advice or recommendations, preferring to proffer outmoded or alternative solutions that do not provide appropriate risk mitigation strategies appropriate for the new world of IT.
Analysis: Amazon Outage Started Small, Snowballed Into 12-Hour Event
The first thing to look at is what risk really is. Wikipedia calls it "the probable frequency and probable magnitude of future loss." In other words, risk can be ascertained by how often a problem occurs and how much that problem is likely to cost. Naturally, one has to evaluate how valuable mitigation efforts to address a risk are, given the cost of mitigation. Spending $1 to protect oneself against a $1,000 loss would seem to make sense, while spending $1,000 to protect oneself against a $1 loss is foolish.
Amazon Outages Show That Failure Is An Option
The question for users is whether this outage presents a large enough loss that continuing to use AWS is no longer justified (i.e., is too risky) and that other solutions should be pursued. Certainly there are now applications running on AWS that represent millions or even tens of millions of dollars of annual revenue, so this question is quite germane.
In terms of this specific outage, Amazon posted an explanation that describes it as a combination of some planned maintenance, a failure to update some internal configuration files and a programmatic memory leak. The result was poor availability of Amazon's Elastic Block Storage (EBS) service.
Interestingly, the last large AWS outage was also am EBS failure, although even more interestingly, it had an entirely different cause, though human error was the trigger for the previous outage as well. In both cases, someone misconfigured an EBS resource, which triggered an unexpected condition, resulting in a service outage.
Most interesting of all, AWS says users shouldn't be surprised by this occurrence. Amazon's No. 1 design principle: "everything fails all the time; design your application for failure and it will never fail."
Many people are outraged by this, feeling that a service provider should take responsibility for ensuring 100 percent (or at least "five nines") of service availability. Amazon's attitude, they imply, is irresponsible. The right solution, they say, is that users should look to a provider that is willing to take responsibility and provide a service that is truly reliable, made possible by use of so-called "enterprise-grade" hardware and software backstopped by ironclad change control.
Sign up for CIO Asia eNewsletters.