Alerting

Alerting is the system that wakes you up at 3am when your application is on fire. It watches metrics, compares them to thresholds, and sends notifications when something crosses the line. CPU above 95% for 5 minutes? Alert. Error rate above 2%? Alert. Zero transactions processed in 10 minutes? Definitely alert.

Good alerting is precise. Every alert should be actionable: when it fires, someone needs to do something right now. If the response to an alert is "check it and do nothing," the alert should not exist. Alert fatigue is real. Google's SRE book reports that if more than 50% of alerts are non-actionable, the team starts ignoring all of them. That is when real incidents get missed.

The routing matters as much as the detection. Critical alerts go to PagerDuty and page the on-call engineer's phone. High-priority alerts create tickets in the backlog. Informational alerts go to a Slack channel. Matching severity to notification channel prevents the 3am page for something that could wait until morning.

Examples

An alert catches a memory leak before it causes an outage.

The alert rule monitors container memory usage over a 15-minute window. Memory normally sits at 60%. The alert threshold is 85% with a 5-minute sustained trigger. At 2pm, memory starts climbing steadily due to a leak in the new deployment. At 2:20pm, the alert fires. The on-call engineer rolls back the deployment before memory hits 100% and the container crashes.

A team restructures their alerting to reduce noise.

The team receives 40 alerts per week. They audit each one: 12 are actionable, 18 are informational, and 10 are false positives. They delete the false positive alerts, convert the informational ones to dashboard metrics, and keep the 12 actionable ones. Alert volume drops 70%. Response time to real alerts improves from 15 minutes to 3 minutes because engineers trust the system again.

Business metric alerts catch a silent failure.

All technical metrics look normal: servers are up, latency is low, error rates are zero. But the alert on 'signups per hour' fires because signups dropped from 50/hour to 0 in the last 30 minutes. Investigation reveals that the signup form's submit button was accidentally hidden by a CSS change. No errors were generated because the button was never clicked. Technical monitoring would not have caught this.

Frequently asked questions

What is the difference between an alert and a notification?

An alert demands immediate action. It pages someone, interrupts their work, and expects a response. A notification is informational: a Slack message, an email summary, a dashboard update. The distinction matters because using page-level alerts for informational events causes alert fatigue. Reserve paging alerts for situations where a human must act within minutes. Everything else is a notification.

How do you set good alert thresholds?

Start by measuring baseline behavior over at least two weeks. If your normal error rate is 0.1% with occasional spikes to 0.5%, setting the alert at 0.2% will produce constant false alarms. Set it at 1%, which is clearly abnormal. Use sustained thresholds (condition must be true for 5 minutes) to avoid alerting on momentary spikes. Review and adjust thresholds monthly based on alert accuracy.

Examples

In practice

Read more on the blog

Frequently asked questions

What is the difference between an alert and a notification?

How do you set good alert thresholds?

Related terms

Want the complete playbook?