Monitoring
MON-ih-ter-ing
Continuously checking system health using metrics, dashboards, and alerts to detect problems.
Monitoring is watching your systems to know when something is wrong. It is the practice of collecting metrics from your applications and infrastructure, displaying them on dashboards, and setting up alerts that page someone when thresholds are crossed. It is your early warning system.
Good monitoring answers three questions: Is the system up? Is the system healthy? Is the system getting worse? You measure CPU usage, memory consumption, request latency, error rates, queue depths, and disk space. You set thresholds: if error rate exceeds 1%, page the on-call engineer. If disk usage exceeds 85%, create a ticket.
The tools are mature. Prometheus collects metrics. Grafana displays dashboards. PagerDuty sends alerts. Datadog does all three. But the hard part is not the tools. It is knowing what to monitor. Teams that monitor everything drown in noise. Teams that monitor too little miss critical signals. The art is identifying the metrics that actually indicate user impact.
Examples
A team sets up monitoring for a new API.
They create a Grafana dashboard with four panels: request rate (current: 500 req/s), error rate (current: 0.2%), p50 latency (45ms), and p99 latency (180ms). They set alerts: error rate above 1% pages the on-call engineer. P99 latency above 500ms creates a high-priority ticket. On the third day, the error rate alert fires at 2am. The on-call engineer finds a database connection pool exhaustion and fixes it in 20 minutes.
A monitoring system generates too many false alarms.
The team gets 15 alerts per day, but only 2 require action. Engineers start ignoring alerts. A real incident goes unnoticed for 40 minutes because the alert was dismissed as noise. The team reviews every alert over two weeks, eliminates the noisy ones, adjusts thresholds, and adds context to remaining alerts. Alert volume drops to 3 per day, all actionable.
A company monitors business metrics alongside technical ones.
The dashboard shows not just server health but also signups per hour, checkout completions, and API key activations. When a deploy causes a 30% drop in checkout completions, the business metric alerts the team before any technical metric shows a problem. The server is healthy. The code has a bug that silently skips the final checkout step.
In practice
Read more on the blog
Frequently asked questions
What are the most important metrics to monitor?
Google's SRE book defines the 'four golden signals': latency (how long requests take), traffic (how many requests per second), errors (how many requests fail), and saturation (how full is the system). If you monitor these four things well, you will catch most problems. Add business metrics (conversion rates, signup rates) to catch issues that do not manifest as technical failures.
How do you avoid alert fatigue?
Only alert on conditions that require immediate human action. If an alert fires and the response is 'wait and see,' it should not be an alert. It should be a dashboard metric. Route alerts by severity: pages for critical issues, tickets for non-urgent ones, dashboards for informational. Review alert frequency monthly and tune or remove alerts that are not actionable.
Related terms
The ability to understand a system's internal state by examining its outputs: logs, metrics, and traces.
Automatically notifying engineers when system metrics cross predefined thresholds indicating problems.
Service level agreement: a contractual commitment to specific performance and availability levels.
Service level indicator: a specific metric used to measure the reliability of a service, like latency or error rate.
An unplanned event that disrupts a service or degrades it below its expected quality, requiring a coordinated response.

Want the complete playbook?
Picks and Shovels is the definitive guide to developer marketing. Amazon #1 bestseller with practical strategies from 30 years of marketing to developers.