I wrote the book on developer marketing. Literally. Picks and Shovels hit #1 on Amazon.

Engineering and DevOps

Monitoring

MON-ih-ter-ing

Continuously checking system health using metrics, dashboards, and alerts to detect problems.

Monitoring is watching your systems to know when something is wrong. It is the practice of collecting metrics from your applications and infrastructure, displaying them on dashboards, and setting up alerts that page someone when thresholds are crossed. It is your early warning system.

Good monitoring answers three questions: Is the system up? Is the system healthy? Is the system getting worse? You measure CPU usage, memory consumption, request latency, error rates, queue depths, and disk space. You set thresholds: if error rate exceeds 1%, page the on-call engineer. If disk usage exceeds 85%, create a ticket.

The tools are mature. Prometheus collects metrics. Grafana displays dashboards. PagerDuty sends alerts. Datadog does all three. But the hard part is not the tools. It is knowing what to monitor. Teams that monitor everything drown in noise. Teams that monitor too little miss critical signals. The art is identifying the metrics that actually indicate user impact.

Examples

A team sets up monitoring for a new API.

They create a Grafana dashboard with four panels: request rate (current: 500 req/s), error rate (current: 0.2%), p50 latency (45ms), and p99 latency (180ms). They set alerts: error rate above 1% pages the on-call engineer. P99 latency above 500ms creates a high-priority ticket. On the third day, the error rate alert fires at 2am. The on-call engineer finds a database connection pool exhaustion and fixes it in 20 minutes.

A monitoring system generates too many false alarms.

The team gets 15 alerts per day, but only 2 require action. Engineers start ignoring alerts. A real incident goes unnoticed for 40 minutes because the alert was dismissed as noise. The team reviews every alert over two weeks, eliminates the noisy ones, adjusts thresholds, and adds context to remaining alerts. Alert volume drops to 3 per day, all actionable.

A company monitors business metrics alongside technical ones.

The dashboard shows not just server health but also signups per hour, checkout completions, and API key activations. When a deploy causes a 30% drop in checkout completions, the business metric alerts the team before any technical metric shows a problem. The server is healthy. The code has a bug that silently skips the final checkout step.

In practice

Frequently asked questions

What are the most important metrics to monitor?

Google's SRE book defines the 'four golden signals': latency (how long requests take), traffic (how many requests per second), errors (how many requests fail), and saturation (how full is the system). If you monitor these four things well, you will catch most problems. Add business metrics (conversion rates, signup rates) to catch issues that do not manifest as technical failures.

How do you avoid alert fatigue?

Only alert on conditions that require immediate human action. If an alert fires and the response is 'wait and see,' it should not be an alert. It should be a dashboard metric. Route alerts by severity: pages for critical issues, tickets for non-urgent ones, dashboards for informational. Review alert frequency monthly and tune or remove alerts that are not actionable.

Related terms

Observability

The ability to understand a system's internal state by examining its outputs: logs, metrics, and traces.

Alerting

Automatically notifying engineers when system metrics cross predefined thresholds indicating problems.

SLASLA

Service level agreement: a contractual commitment to specific performance and availability levels.

SLISLI

Service level indicator: a specific metric used to measure the reliability of a service, like latency or error rate.

Incident

An unplanned event that disrupts a service or degrades it below its expected quality, requiring a coordinated response.

Picks and Shovels: Marketing to Developers During the AI Gold Rush

Want the complete playbook?

Picks and Shovels is the definitive guide to developer marketing. Amazon #1 bestseller with practical strategies from 30 years of marketing to developers.

Get your copy Browse the FAQ