For example, Armandpour's team has uncovered scaling issues within their infrastructure, bugs in their products and redundancies in their process. And in one instance, they were able to uncover a bug in Apache Zookeeper that was causing a persistent problem for years and then alert that community to the issue.
Taking a controlled approach
Of course, Failure Fridays aren't done without care -- a lot of planning, strategy and thought goes into each failure scenario. War rooms are set up, teams are briefed and everyone has a handle on what is about to happen and what part they will play. For example, Armandpour's team will often take down an entire data center -- which is one third of their infrastructure -- and then methodically bring it back up within an hour without a single customer noticing, which is the point. He says it helped them become better triage and implement fixes, as well as, build confidence in the team and even develop practices around fire-drill scenarios.
"We're big believers in the notion that you need to plan for things that will go wrong, especially those things that aren't in your control," says Armandpour. And, as he points out, when you are relying on third-party cloud infrastructure for part of your business, "you only have so much control."
Each business will have to build its own approach to Failure Fridays to be successful, he points out. There isn't a simple formula for everyone to follow. Some smaller businesses might have some employees dedicated to managing failure scenarios, while a bigger company might treat it as a more centralized fire-drill, he says. However you approach it, you want the goal to be staying two steps ahead, and being proactive instead of reactive -- so you're never left struggling to fix a problem for a customer that could have been prevented.
"Building that super strong culture where you're not panicking in moments of failure, which I think is fairly commonplace, you build a ton of trust and empathy inside your organization that I think is absolutely invaluable, especially as organizations grow and infrastructures get more complex," Armandpour says.
Sign up for CIO Asia eNewsletters.