AWS vows to not get caught twice with the same bug. It has instituted new alarms to prevent this specific incident from happening again, and has also modified the broader EBS memory monitoring and alerts for detecting if new hardware is not being accepted into the system. "We believe we can make adjustments to reduce the impact of any similar correlated failure or degradation of EBS servers within an Availability Zone," AWS says.
Gartner analyst Kyle Hilgendorf says it's slightly surprising that human error caused a DNS propagation issue, which led to much of an availability zone going down. "They're supposed to be deploying the best and brightest to handle these systems," he says. But, accidents happen. The bigger flaw, he says, is that AWS did not have alerts in place to catch the issue earlier. "That's the damaging part," he says. "A week passed and no one noticed memory was continuously leaking. That was the unacceptable part of this."
So could it have been prevented?
AWS says that customers who have heeded the company's advice about using multiple availability zones were able to tolerate the outage, for the most part. Some customers, including at least one Network World reader (see comments section of this story), reported that even in a multi-AZ architecture he still had problems moving workloads into healthy AZs. AWS says they messed this up, too.
The company uses a throttling system that prevents individual users from overwhelming the system because they are moving large workloads all at once. During the outage, AWS enabled a heightened throttling policy to ensure the system remained stable. "Unfortunately, the throttling policy that was put in place was too aggressive," AWS admits. It says policies have been changed to ensure the throttling is not as aggressive during future incidents.
AWS is making it up to customers, too. The company is issuing an automatic credit to any customers who were subject to its aggressive throttling policy between 12:06 and 2:33 p.m. on Oct. 22 and will automatically credit their entire usage of EC2, EBS and ELB instances during that three-hour period on their October bill.
Some have been concerned about the US-East-1 region in Northern Virginia, seeing as it has been the site of three of the company's major recent outages. Hilgendorf says it is the oldest, largest and cheapest region for AWS, so if there's an issue there it could disproportionately impact more customers.
The latest outage is the third major one in two years, which means downtime events at AWS may be starting to add up, according to one analyst. While acknowledging that AWS is the "clear leader" in the infrastructure as a service (IaaS) market, Technology Business Research's Jillian Mirandi says if yet more major outages continue to happen at AWS, it could lead to some of AWS's biggest customers -- like Netflix, Foursquare, Pinterest and Heroku -- to look elsewhere. "If major companies such as these continue to experience outages, they will be tempted to move services onto competing IaaS products," she recently wrote.
Sign up for CIO Asia eNewsletters.