Amazon Web Services says power outages, software bugs and rebooting bottlenecks led to a "significant impact to many customers," last week, according to a detailed post-mortem report the company released today about the service disruption.
As storms raged through the mid-Atlantic on Friday night, AWS experienced power outages that initially impacted the company's Elastic Cloud Compute (EC2), Elastic Block Storage (EBS) and Relational Database Service (RDS) offerings, but extended into "control plane," services, such as its Elastic Load Balancer, which are designed is designed to shift traffic away from impacted areas of the company's service.
AWS experienced multiple power outages on Friday night, most of which were handled by a backup generator kicking in to supply power. Shortly before 8 p.m. PDT, a backup generator failed to fully kick in after a power outage. The company's "uninterruptable power supply," another backup, was depleted within seven minutes. For 10 minutes at 8:04, parts of the impacted data center did not have power, which brought down the EC2 and EBS services in the impacted area.
As a result, for more than an hour between 8:04 and 9:10 p.m. PDT on Friday, customers were unable to create new EC2 instances or EBS volumes. The "vast majority" of the instances came back online between 11:15 p.m. PDT and just after midnight, AWS says, but that was delayed somewhat because of a bottleneck in the server booting process due to the large number of reboot requests. AWS says removing the bottleneck is an area they will work to improve on in the case of a power failure.
AWS breaks its regions up into multiple availability zones (AZs), which are designed to be isolated from failure. Even though the issues on Friday were centered in a single AZ, AWS ran into more trouble when load balancers attempted to switch traffic to unaffected AZs. "As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn't seen before," the company wrote. The bug caused a flood of requests which, combined with EC2 instances coming back online, created a backlog in the system.
Meanwhile, the company's cloud-based relational database service suffered from the EBS volume being out and another software bug. For customers who had their RDS in the impacted AZ, those services had to wait for the EBS to be restored, which for most customers was by 11 p.m. PDT. For customers who have their RDS spread across multiple AZs, AWS says there was a software bug that did not allow automatic failover to the unaffected AZs for some customers. AWS says it's known about the bug since April and it has a mitigation for it, which is in beta and will be rolled out in the coming weeks.
Sign up for CIO Asia eNewsletters.