If on-call workers are constantly paged for non-emergency issues, they're going to start growing resentful of the job. In fact, Klippert says they might even go as far as to "snooze superfluous alarms," out of an annoyance that they're not important, but someday, that could lead to inadvertently hitting snooze on a crucial emergency alert. She calls alerts like this "noise," because that's literally what your on-call workers experience -- loud alarms and flashing screens, sometimes at all hours of the night. And if they go to take action on that "noise," only to find out it's a normal issue, and not an emergency, you're essentially creating a "boy who cried wolf" scenario.
"Alerts should be a trusted source of truth as to the health of your business -- applications and infrastructure -- and are critical to business and IT success," she says.
Tim Armandpour, vice president of Engineering at PagerDuty, says you need to know what you're monitoring and why, and ensure that anything tied to on-call work is absolutely necessary. As long as everyone is clear on what constitutes an emergency, your workers will know that every alert is important.
Have resources in place
"Don't let your on-call engineers feel like they're on an island, with a Google doc and a rotary phone at four in the morning," says Klippert. Make sure that when engineers and IT workers respond to an alert that the right plans are in place. That way, if they need additional support, either from other IT staff or engineers, there's a way to re-route the alert to the right person.
A great way to ensure the right resources are in place is to ensure everyone is properly trained so that emergency situations can be as low-stress as possible. Andrew says to document process and procedures and have new on-call staff shadow more experienced workers. Give them a chance to understand how the process works from reacting to incidents, analyzing issues and debug logs and how to relay relevant information to other internal teams. Even consider automating as many steps as possible, so everything remains consistent and streamlined.
Establish best practices
Armandpour says that on-call work doesn't have to be stressful if there are process and procedures in place that help instill some peace of mind in your on-call staff. If every time an on-call worker responds to an alert, they're faced with extra work or confusion, it's just going to make the experience worse. He suggests going as far as to determine your biggest failure scenarios and then make them a reality to give everyone more experience.
"Introduce controlled failure to a system in order to exercise monitoring and process, and fix problems before they happen for real. Practicing failure also means those on-call have a set of best practices to follow when an incident does occur. It makes everyone feel more prepared and better equipped to handle whatever comes their way," he says.
Sign up for CIO Asia eNewsletters.