Engineering and DevOps

Postmortem

pohst-MOR-tum

A written analysis of an incident: what happened, why, and what the team will do to prevent it from recurring.

A postmortem is a document written after an incident that explains what happened, why it happened, and what the team will change to prevent recurrence. The best postmortems are blameless. They focus on systemic causes (missing monitoring, unclear runbooks, insufficient testing) rather than individual mistakes.

The structure is usually straightforward: timeline, impact, root cause, contributing factors, and action items. The timeline is minute-by-minute. The impact section quantifies damage (users affected, revenue lost, SLO budget consumed). The root cause goes deeper than "a bad deploy" to ask why the bad deploy was possible in the first place. Action items are specific, assigned, and tracked to completion.

Companies with strong postmortem cultures learn faster. Cloudflare publishes detailed postmortems on their blog. Google's SRE book dedicates an entire chapter to the practice. The goal is not to write a report and file it away. The goal is to make the system more resilient. A postmortem without completed action items is just documentation of your failure to learn.

Examples

A team writes a postmortem after a database outage.

The postmortem reveals: a schema migration locked a table for 12 minutes during peak traffic. Root cause: the migration was not tested against production-sized data. Action items: require migration dry-runs against a production replica, add lock-duration monitoring, and update the deployment checklist. All three items are assigned owners with deadlines.

A company holds a postmortem review meeting.

The incident commander presents the postmortem to 20 engineers. They walk through the timeline, discuss what worked (detection was fast) and what did not (the runbook was outdated). An engineer suggests a new action item: automated runbook testing. The meeting produces three additional action items beyond the original five.

A VP uses postmortem data to justify infrastructure investment.

Over the past year, 8 of 12 SEV-1 postmortems cite the legacy message queue as a contributing factor. The VP compiles the data: 47 hours of cumulative downtime, $2.3M in estimated revenue impact, 180 engineering hours spent on incident response. The board approves a $1.5M project to replace the queue.

In practice

Postmortem template (Google SRE format)

INCIDENT POSTMORTEM
Date: [DATE] | Severity: SEV-[N] | Duration: [X hours]
Authors: [Names] | Status: [Draft/Final]

SUMMARY
[1-2 sentence description of what happened and the impact]

IMPACT
- Duration: [Start time] to [End time] ([X] hours)
- Users affected: [Number/percentage]
- Revenue impact: [$X or N/A]
- SLA impact: [Yes/No, details]

TIMELINE (all times UTC)
| Time  | Event |
|-------|-------|
| HH:MM | [What happened] |

ROOT CAUSE
[Detailed technical explanation of why this happened]

TRIGGER
[What specific action or event triggered the incident]

DETECTION
- How detected: [Alert/Customer report/Manual]
- Time to detect: [X minutes]
- Gap: [What should have caught this sooner]

RESOLUTION
[What was done to resolve the incident]

ACTION ITEMS
| Action | Type | Owner | Bug/Ticket | Priority |
|--------|------|-------|------------|----------|
| [Fix]  | Mitigate/Prevent/Detect | [Name] | [Link] | P[0-3] |

LESSONS LEARNED
What went well:
- [Item]
What went poorly:
- [Item]
Where we got lucky:
- [Item]

Frequently asked questions

What makes a postmortem 'blameless'?

A blameless postmortem assumes that everyone involved made reasonable decisions given the information they had at the time. It focuses on systems, processes, and tools rather than individual errors. Instead of 'Engineer X deployed bad code,' it asks 'Why did our deployment process allow untested code to reach production?' The goal is to fix systems, not punish people.

When should you write a postmortem?

Write a postmortem for every SEV-1 and SEV-2 incident, and for any SEV-3 that reveals a systemic problem. Some teams write postmortems for near-misses too, situations where a problem was caught before it caused user impact. The postmortem should be published within 48-72 hours while the incident is still fresh. Waiting longer means losing details and momentum on action items.

Related terms

Incident

An unplanned event that disrupts a service or degrades it below its expected quality, requiring a coordinated response.

Severity levels

A classification system (SEV-1 through SEV-4) that ranks incidents by impact and urgency to determine response priority.

On-call

A rotation where engineers are responsible for responding to production alerts and incidents outside business hours.

SLOSLO

Service level objective: an internal reliability target for a service, like 99.9% availability or p99 latency under 200ms.

AvailabilityUptime

The percentage of time a system is operational and accessible to users.

Picks and Shovels: Marketing to Developers During the AI Gold Rush

Want the complete playbook?

Picks and Shovels is the definitive guide to developer marketing. Amazon #1 bestseller with practical strategies from 30 years of marketing to developers.

Get your copy Browse the FAQ