Engineering and DevOps

Runbook

RUN-book

A documented set of step-by-step procedures for handling specific operational tasks or incidents.

A runbook is a document that tells an on-call engineer exactly what to do when something goes wrong. It is a step-by-step checklist: check this dashboard, run this query, restart this service, escalate to this team. The goal is to turn incident response from an art (senior engineer investigates from scratch) into a process (anyone on-call follows the steps).

Good runbooks are specific. Not "investigate the database" but "run this SQL query to check connection count. If the count exceeds 100, restart the connection pooler with this command. If that does not resolve the issue, page the database team." The person reading the runbook at 3 AM should not need to think creatively. They should follow the steps.

Runbooks live alongside the services they describe. A common pattern is a runbook directory in each service's repository, or a shared wiki (Notion, Confluence, GitHub Wiki) with runbooks organized by service and alert. The best teams link runbooks directly to their monitoring alerts: when an alert fires, the notification includes a link to the relevant runbook. The on-call engineer clicks the link and starts following steps immediately.

Examples

An on-call engineer handles an alert at 2 AM.

PagerDuty fires an alert: 'API error rate above 5%.' The alert includes a runbook link. The engineer opens it and follows the steps: check the API dashboard in Datadog (error rate is 7%), check the most common error (503 from the user service), check the user service health (3 of 5 pods are in CrashLoopBackOff), restart the failing pods (kubectl rollout restart). The error rate drops to 0.3% within 2 minutes. Total resolution time: 6 minutes. The engineer did not need to understand the entire system.

A team creates runbooks for common incidents.

The team reviews their last 20 incidents and finds five patterns that account for 80% of pages: database connection exhaustion, certificate expiration, disk space alerts, memory leaks, and deployment failures. They write a runbook for each, with exact commands, expected outputs, and escalation criteria. On-call rotations become manageable for junior engineers. Mean time to resolution drops from 45 minutes to 12 minutes.

A runbook prevents an escalation.

The junior on-call engineer gets paged for high memory usage on the search service. The runbook says: check the memory graph (is it a spike or gradual leak?). It is a gradual leak. The runbook says: restart the service to clear the leak and create a Jira ticket for the memory leak investigation. The engineer restarts the service, memory returns to normal, and the ticket is created. No senior engineer woken up. The leak is investigated during business hours.

In practice

Runbook template

RUNBOOK: [Service/Process Name]
Owner: [Team] | Last updated: [DATE] | Review cadence: Quarterly

OVERVIEW
[What this service does in 1-2 sentences]

ARCHITECTURE
[Key components, dependencies, data flow]

ACCESS
- Dashboard: [URL]
- Logs: [URL/command]
- Metrics: [URL]
- On-call: [Rotation link]

COMMON ALERTS AND RESPONSES

Alert: [Alert name]
Severity: [SEV level]
Symptoms: [What the user sees]
Diagnosis:
  1. [Check this first]
  2. [Then check this]
Resolution:
  1. [Step to fix]
  2. [Step to verify]
Escalation: [Who to page if this does not resolve in X minutes]

DEPLOYMENT
- Deploy command: [command]
- Rollback command: [command]
- Feature flags: [list]

EMERGENCY PROCEDURES
- Kill switch: [How to disable the service]
- Data recovery: [Backup location and restore steps]

Frequently asked questions

How do you keep runbooks up to date?

Update the runbook every time you use it. If a step is wrong, fix it immediately. If a command changed, update it. If you discovered a better approach during the incident, add it. Make runbook updates part of the postmortem action items. Some teams assign a quarterly runbook review to each service owner. The worst runbook is one that has not been updated in a year because every step might be wrong.

Should runbooks replace incident investigation?

No. Runbooks handle known problems with known solutions. They are the first line of defense. If the runbook steps do not resolve the issue, the engineer escalates and investigates. Runbooks handle the 80% of incidents that are variations of known patterns. The remaining 20% require creative investigation, deep system knowledge, and collaboration. Runbooks free up engineer attention for the hard problems by automating the easy ones.

Related terms

On-call

A rotation where engineers are responsible for responding to production alerts and incidents outside business hours.

Incident

An unplanned event that disrupts a service or degrades it below its expected quality, requiring a coordinated response.

Postmortem

A written analysis of an incident: what happened, why, and what the team will do to prevent it from recurring.

Root cause analysisRCA

A systematic process for identifying the underlying cause of an incident rather than just fixing the symptoms.

Alerting

Automatically notifying engineers when system metrics cross predefined thresholds indicating problems.

Picks and Shovels: Marketing to Developers During the AI Gold Rush

Want the complete playbook?

Picks and Shovels is the definitive guide to developer marketing. Amazon #1 bestseller with practical strategies from 30 years of marketing to developers.

Get your copy Browse the FAQ