Another change came after an outage where an alert hours in advance indicated a problem, "but it was buried in noisy alerts and it looked like alerts that are traditionally ignored — so they ignored it." Now developers are responsible for closing alerts, and if an alert is too easily triggered and gets ignored, that includes fixing it to be more useful.
This is all part of the way Microsoft is doing devops at scale. It's not just the operations team that gets paged in the middle of the night when things go wrong. Even senior executives take turns carrying pagers overnight for major incidents. As well as the 24-7 monitoring team, Harry has developers around the world who can assess the problem, with an engineer on call for each service available in 15 minutes.
That 15-minute window is the Visual Studio Online team policy. "Each team is finding their way to how they manage this," he says, and reaching someone who understood the SQL Azure change took over an hour.
Making that work comes down to not just rotating who is on call, but how leaders focus on understanding what went wrong — and not who was to blame.
"When I say we, I often mean we, Microsoft," Harry explains. "It's not my purpose to point fingers and say that team needs to improve, but to really think as one company and to think about accountability in a slightly bigger way. One of my first rules is, everybody is allowed to make a mistake; nobody is allowed to repeat a mistake."
Sign up for CIO Asia eNewsletters.