Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

How one user (successfully) managed the Amazon cloud reboot

Brandon Butler | Oct. 1, 2014
The lesson: Prepare for failure

Kevin Felichko didn't get as much sleep as he wanted to on Monday night.

Felichko is the CTO of PropertyRoom.com, an online auction site of seized goods that is run entirely on Amazon Web Services' cloud. Late last week AWS announced that it would be rebooting up to 10% of the company's virtual machines, known as its Elastic Compute Cloud (EC2) instances. For a company like PropertyRoom.com, which processes tens of millions of dollars worth of online auctions all through Amazon's cloud, that could have been a big problem.

But Felichko says it turned out to be a manageable problem. One key to using IaaS cloud computing resources is to prepare for failure. Amazon's CTO Werner Vogels even preaches that. And that's what Felichko and his tech team of four had done when they migrated over to Amazon's cloud earlier this year.

On Friday PropertyRoom.com got notification from Amazon that most of the reboots of the company's instances would happen during the late evening hours on Monday. Late Monday night Amazon informed Felichko that the reboots would be delayed until Tuesday morning. After staying up late monitoring the situation, Felichko was slightly frustrated that the maintenance window had been moved on him at the last minute. But, on Tuesday the reboot happened and PropertyRoom.com website never went down.

However much of an inconvenience the whole process was, Felichko says it could have been much worse, but he's thankful it wasn't. He credits heeding the advice of AWS and cloud experts to prepare your cloud applications to be flexible in the face of uncertainty.

Using a service named CloudWatch (which monitors the health of EC2 instances) Felichko has set up the system so that if any of the instances serving the front end of the website go down then CloudFormation (which is a tool that sets up and deploys AWS services) will automatically scale the front-end web server to another healthy instance. The services are scaled across multiple AWS Availability Zones (AZ), which are different data centers within a single region of AWS's cloud.

So, when Felichko learned about the reboot, he was fairly confident the system would work on its own to migrate the workloads off any instance that shut down and onto a running one. It worked as planned, mostly.

The one issue Felichko ran into was that one of the instances serving a back-end function for managing inventory was stuck in a reboot cycle and would not fully restart. That created somewhat of a domino effect in the system because the company's order processing system is tied closely to the inventory. Felichko reached out to an AWS customer service representative who resolved the issue. It had been a hardware issue in AWS's data center and that instance was taken offline.

 

1  2  Next Page 

Sign up for CIO Asia eNewsletters.