ON-CALL NIGHTMARE: 5 ways to fix your on-call alerts

Warren Veerasingam
5 min readSep 4, 2018

“Wait a minute. We boost the signal. That’s it. We transmit that telephone number through Torchwood itself, using all the power of the Rift. And we’ve got Mister Smith. He can link up with every telephone exchange on the Earth. He can get the whole world to call the same number, all at the same time. Billions of phones, calling out all at once. Transmitting, then this Subwave Network is going to become visible. I mean, to the Daleks. Yes, and they’ll trace it back to me. But my life doesn’t matter. Not if it saves the Earth” —said Harriet to the rest of the team at Torchwood (Doctor Who — The Stolen Earth)

If only we had way to call the Doctor or use advanced technological communication tools to combine all future knowledge to resolve our on-call alerts, I wouldn’t be writing this article.

1. Alerts should be directed to the right team or person

Here’s a scenario: You’re a developer, and you don’t have access to the infrastructure or servers. At 2 am on a Saturday, you get an automated alert on your phone informing you that the CPU is high. And within minutes, the application goes down.

At this point, what do you do?
You might need to call the DevOps team for access to the infrastructure and request for an increase to the number of servers or, add more CPU or memory to your servers. This extra step is unnecessary and will only increase the downtime.

The fastest way to resolve an alert is to direct the alert to the right person or team to resolve the issue. When there are more points of communication, the original message may be lost or become inaccurate between the source and receiver. The fewer the path of communication, the faster the issue will be resolved.

When there are more points of communication, the original message may be lost or become inaccurate between the source and receiver.

2. Phone call alert VS Email alert

Product owners want to monitor everything — CPU, memory, network, bandwidth, and so on. They, however, do not want to be responsible to resolve the alerts.

We should be selective when deciding between alerts that deserve a phone call or a simple email. You do not want your phone to go off at the middle of the night because your server’s CPU spiked at 70% and went down back to 30% a minute later. You need to measure and set the right thresholds to alert the person on-call. The on-call person may regard unnecessary alerts or false alerts as noise. And, eventually ignore the noise.

If an alert does not require urgent attention, configure the alert to be sent as an email or a slack message. Developers and DevOps engineers can simply monitor the alert and adjust the threshold accordingly to prevent future misleading alerts.

3. Set a meaningful message in your alert

The on-call person should be able to see what is the issue as quickly as possible from the alert message. The alert should provide a meaningful message. For example:

Bad Alert:

“High CPU usage”

Good Alert:

“High CPU usage on 10.168.16.10. CPU exceeded 100% for 3 consecutive times in 5 minutes. [link to metrics dashboard]”

The alert message should include a link to the metrics, graphs and history, where the on-call person can see the server’s past history and trend.

4. We should fix the problem to prevent future alerts or better yet, create a self healing infrastructure

We tend to look at our application’s health in a very binary way. This means a application is either up or down. If it’s up, it’s good, otherwise we need to restart the application.

We should shift our paradigm and view applications in a non-binary way. If the application is unhealthy, we automate steps to make it healthy again.

When we are ill, our immune system does not simply let us die and restart our body again. Our immune system is designed to recognize the cells that make up our bodies and repel any foreign invaders such as viruses.

Perhaps we could design a similar self healing system to maintain and improve the health of our applications.

We should practice writing automation scripts to fix the issue when an alert is triggered.

Over time, we should study the pattern and behavior of failures, we should fix the application or the infrastructure to prevent future issues.

5. Set a protocol, a workflow — steps to take and provide resources to resolve an alert

When you are new to a team, you may not know what are the necessary steps to resolve an alert. This can be stressful and may cause more damage.

Every team should have an on-call manifesto. This manifesto should describe the steps to take to resolve each alert for each application.

For example,
Alert A — may require you to:
1) Close any database connection
2) Restart the application
3) Notify the development team

Alert B — may require you to:
2) Shut down the server
2) Increase the memory size of the server
3) Restart the application

Proper documentation to resolve an issue may reduce the downtime of our application.

Conclusion

The goal is to get fewer alerts. The more resilient we build our infrastructure or application, the fewer alerts are triggered. Alerts should be directed to the right person or team. This way, the right person has the permission or tools to resolve the issue as quickly as possible. We have to set priorities to alerts. Urgent alerts need to be directed as phone calls, while warning alerts can be sent as emails. Unnecessary alerts may become noise if it does not need immediate attention. An on-call person should be able to see the issue as quickly as possible from the alert message. We should design systems that can heal itself when something goes wrong. Every team should have an on-call manifesto. That way, your team knows the proper steps to take to resolve an issue.

--

--