Insufficient devops coverage
Sometimes the failure comes from the way devops is applied to a particular project.
A company involved in lease originations for vehicles has a large number of partners scattered across the United States. Any customers that enter a partner location and want to lease vehicles will have their information and request processed through a custom application. A large part of this information has to be verified through third-party services, since this is a financial transaction and none of the financial companies involved want to be stuck holding a bad lease.
"The devops setup for this software is focused around server metrics, primarily response times and breakdowns for various requests, along with deployment statistics and automation," says Nathaniel Rowe, a software consultant who worked with the lease origination company, which he declined to identify.
"A few weeks back, we had what amounted to a total system outage due to a hole in the monitoring," Rowe says. "A necessary third-party validation service had a network outage that brought their entire infrastructure down."
This shouldn't have been a problem, Rowe says. But due to the initial subpar construction of the software -- which was offshored for a bargain rate -- all the lease submissions processes were tightly linked to the service that went down. "In a company like this, that means the money stops flowing," he says.
The issue was a lack of complete devops coverage, because of a reliance on system metrics rather than adding in active monitoring of outside resources that were necessary for operations to continue. "That was a low-visibility hole in our coverage, which was masked by the fact that 99 percent of issues are explicitly code-based problems rather than due to outside interference," Rowe says.
Once the outage became known, the development team jumped in and decoupled the particular validation code and inserted procedures to bypass it, which allowed the company's partners to save the information they had entered into the system.
"We identified the root cause by contacting the service provider and receiving the information from them about what happened," Rowe says. "To safeguard against this in the future, any time a network failure like that occurs, a global setting is triggered to reroute the submission process to save successfully and notify partners that the corresponding service is down."
A major benefit of this failure was that time and money is now dedicated to patching these holes in monitoring and automatic recovery for other weak spots in the system, Rowe says.
Forgetting about people and process
When Brian Dawson, now devops evangelist at CloudBees, was working as a process consultant for a vendor on a contract with a U.S. government agency several years ago, he had one of his first experiences with devops. It was not a good one.
Sign up for CIO Asia eNewsletters.