But it took more than three hours to identify that the problem was caused by the SQL Azure change, because the Azure operations team didn't have documentation covering the new feature or even showing it was part of the update. Even when they knew the update was the problem and the feature could be turned off, it took another 45 minutes to work out how, and then another hour to find and fix an unrelated bug in the roaming feature that lets developers have their Visual Studio settings automatically synced to another machine.
Coping with the always-changing cloud
The details about the outage are a window into how Microsoft is building cloud services.
"In any large-scale system there will be failure, and you have got to be resilient to those failures," Harry said. "There are things we are doing to become more resilient, but the level of investment to do this well is quite high, and it takes time."
All but one of the main Visual Studio Online systems have been converted into smaller redundant services; the last Shared Platform Service was affected here. A "circuit breaker" system, for turning off specific features or groups of users to keep the rest of the system available, won't cover all features for another two to three months and isn't yet mature enough to trip automatically.
As Azure becomes a critical underpinning for other Microsoft services, there are also questions about coordinating changes, which are staggered across different regions. Harry is keen to see an Azure-wide "canary" region, similar to the fast ring in the Windows 10 technical preview and the Office 365 First Release program. "Imagine if any customer could sign up to have resources in that region, so that not only do we get to test our services all together as we roll out, but our customers who are building on Azure could choose to have some fraction of testing or production in the Azure canary region and get an early peek at changes that are coming."
The role of developers in devops
Most cloud outages are caused either by changes or by combinations of problems outside the service that expose hidden problems. The trick is spotting something going wrong before the situation becomes critical, and responding. "We get many terabytes of telemetry a day," Harry said. "You need tools to search it, mine it and understand what it's telling you."
Getting this right involves developers as well as operations, something that makes sense of the pattern of layoffs at Microsoft earlier in the year, aimed at bringing those teams closer together. "We're on a journey of transferring more responsibility for things you would traditionally call ops to the development team," he said. The engineering team on Visual Studio Online is now in charge of deploying the code they write.
Sign up for CIO Asia eNewsletters.