Root cause analysis (RCA) - Definition and examples

Root cause analysis is the practice of asking "why did this really happen?" after an incident. Not just "the server crashed" but "why did the server crash?" And then "why did that happen?" And again "why?" This is the "Five Whys" technique, and it keeps going until you find the systemic issue that, if fixed, would prevent the problem from recurring.

The point is to stop firefighting. If a database runs out of disk space and you add more disk, you have fixed the symptom. If you ask why the disk filled up and discover that logs are never rotated, you have found the root cause. Fix the log rotation, and the disk space problem never comes back. Without RCA, teams fix the same problems over and over.

RCA is a key part of the postmortem process. After an incident is resolved, the team meets (or writes an async document) to analyze what happened, identify the root cause, and create action items to prevent recurrence. The best RCAs are blameless: they focus on systems and processes, not on which person made a mistake. The question is "why did the system allow this to happen?" not "who did this?"

Examples

A team investigates a production outage.

The API went down for 45 minutes. The immediate cause: the database connection pool was exhausted. Why? A new query was missing an index and took 30 seconds per call instead of 30 milliseconds. Why was there no index? The migration was written but never applied to production. Why? The deployment script does not run migrations automatically. Root cause: manual migration process. Action item: add automatic migration execution to the CI/CD pipeline.

A billing error affects 200 customers.

Customers are charged twice for their subscription. The immediate cause: the webhook handler processes the same event twice. Why? The handler is not idempotent. Why? There is no idempotency check in the code. Why? The team did not have a standard pattern for webhook handling. Root cause: missing engineering standards for webhook processing. Action items: add idempotency keys to all webhook handlers, create a shared webhook processing library, and refund affected customers.

A team uses the Five Whys technique.

A deploy caused a 10-minute outage. Why did the deploy break production? The new version had a bug. Why was the bug not caught? There are no tests for that code path. Why are there no tests? The feature was rushed before a deadline. Why was it rushed? The scope was not reduced when the timeline shortened. Root cause: the team does not have a process for descoping when timelines change. Action item: add a mandatory scope review when deadlines move forward.

Frequently asked questions

How do you know when you have found the root cause?

The root cause is the deepest systemic issue that, if fixed, would prevent the entire category of problem from recurring. If your fix only prevents this specific incident, you have not gone deep enough. If your fix would also prevent similar incidents in other services, you are probably at the right level. For example, 'add an index to this query' fixes one slow query. 'Add query performance testing to CI' prevents all slow queries from reaching production.

What is a blameless postmortem?

A blameless postmortem focuses on system and process failures, not individual mistakes. Instead of 'John deployed the bad code,' it asks 'Why did the system allow bad code to be deployed?' The answer might be: no automated tests, no canary deployment, no rollback mechanism. Blame discourages reporting. If engineers fear being blamed, they hide mistakes. If the culture is blameless, engineers report problems early, share lessons, and the whole team learns. Every major tech company with a strong reliability culture practices blameless postmortems.

Root cause analysis

Examples

In practice

Read more on the blog

Frequently asked questions

How do you know when you have found the root cause?

What is a blameless postmortem?

Related terms

Want the complete playbook?