The second of two lengthy outages that hit Visual Studio Online in November was caused by the same kind of issues as the first, Microsoft has just disclosed.
Visual Studio Online, which allows developers to plan and track software development projects and share code, was unavailable for just over seven and a half hours on Monday of last week. It was preceded by another outage the prior week.
In both cases, updates to Azure were to blame, and procedures that hadn't been followed delayed the process of finding a fix.
That's according to Brian Harry , a Microsoft Technical Fellow who this week detailed what went wrong in a very candid and detailed blog post and also in an interview.
His postmortem account is a model of transparency for cloud services, with details of what went wrong and why last week's outage was "probably 3-4 hours longer than it had to be due to inefficiencies." The detail is key to running cloud services, according to Harry, who works as the Product Unit Manager for Team Foundation Server.
"With service outages, in a modern devops world it's all about root cause analysis. It's important we get the service back up, but root cause comes first, because just getting the service back up with the threat it will happen again is not a victory," he said on Thursday.
It may also become a model for other Microsoft services, including Azure, where some customers were disappointed with communications during the last outage. "When we have a major issue with a high-profile service everybody cares; you get all kinds of people involved in trying to help communicate. The end result was less transparent and less empathetic communications than I think we would want. As a result you're going to see changes in the way we communicate about Azure outages. You really have to have someone willing to stand out there and say 'I own this and this is what I'm doing about it,'" he said.
In the latest outage case, an update to the SQL Azure cloud database service included a new feature designed to automatically find and repair databases with unusually high numbers of errors. Using the gradual rollout system that Microsoft calls "flighting," this was loaded in one SQL Azure region, where it caused problems for a Visual Studio Online procedure that generates a lot of duplicate record errors — but is designed to ignore them. Trying to handle the 170,000 exceptions per minute being generated — and successfully ignored — took so many resources that it made a key database lock up in a few hours.
Some things in Microsoft's procedure for handling cloud problems went the way they were supposed to. Monitoring systems spotted the problem late Sunday night, over an hour before customers in Europe started tweeting about not being able to access their accounts. The update was being rolled out one region at a time, unlike the previous Azure update mistakenly deployed in multiple locations. Once the Visual Studio Online problem was identified, the update was stopped before it deployed in the next region, and Harry believes no other Azure customers were affected. The new feature also came with the option to turn it off.
Sign up for CIO Asia eNewsletters.