You've heard the adage "practice makes perfect," but at PagerDuty, an organization focused on helping IT operations and DevOps teams manage incident resolution through software, failure makes perfect. Or at least practicing failure helps this company improve its products and services, keeping customers happy and engineers in control.
This approach started in the fall of 2013 when an engineer at PagerDuty became fed up with the fact that his team couldn't discover product bugs early enough in the production stages. It was difficult for the engineers to be proactive and find solutions to problems before a customer encountered a bug for themselves. So, inspired by Netflix's controlled failure testing, they decided to introduce Failure Fridays.
It might sound counterintuitive, but it's grown into a tradition that has helped his team become better prepared when disaster strikes. And it's become an important ingredient in running a successful business, especially at a company like PagerDuty.
"We were mainly inspired by what Netflix had been doing already, how they tested and prepped for infrastructure resiliency, introducing failure scenarios into [their] infrastructure in a live environment and being able to do that in a controlled and safe manner," says Tim Armandpour, vice president of Engineering at PagerDuty.
Introducing failure Friday
Every Friday, Armandpour's team leads get together to figure out what they want to test that week. It can be anything from a new service they launched or sometimes it goes as far as taking an entire availability zone or data center down and forcing it out of commission. The goal is to make sure they can actually stay up and running in the face of emergencies, all without affecting the customers and clients.
It's about understanding failure scenarios, says Armandpour, and establishing best practices and developing thoughtful strategies for when things go wrong. It's also about fostering a team bond, so that everyone can work together under pressure and remain calm through a "controlled and intentional" approach, he says.
PagerDuty is in the business of digital disaster preparedness. The company offers software to businesses to help them better approach, handle and remedy those late night emergencies and disaster scenarios that can occur with technology. For Armandpour, Failure Fridays seemed like a natural extension of the company's overall mission, "We actually started to practice what we preached around how quickly you can get from identifying an issue to actually resolving it," he says. "We want to make sure we're at our best when our customers are at their worst."
What they've learned
Managing three data centers hosted through two different cloud providers, PagerDuty strives for an "always on" environment, so clients are never without their data. And they've gotten close, thanks in part to what they've learned from past Failure Fridays.
Sign up for CIO Asia eNewsletters.