Rollback

A rollback is the act of reverting your production system to the previous version when a deploy goes wrong. The new version shipped a bug. Error rates are spiking. Users are seeing 500 errors. You rollback: the old, working version is restored, and the problem stops. Then you fix the bug at your own pace and try again.

Fast rollbacks are the safety net that makes frequent deployments possible. If deploying is scary because a bad deploy means hours of downtime, teams deploy less often. If a bad deploy means a 2-minute rollback, teams deploy with confidence. The ability to rollback quickly is more valuable than the ability to prevent all bugs, because you cannot prevent all bugs.

Rollback strategies vary by technology. Container deployments rollback by pointing to the previous container image. Blue-green deployments rollback by switching traffic back to the old environment. Database rollbacks are harder because data changes (new rows, modified records) cannot be easily reverted. This is why smart teams separate code deployments from data migrations and ensure migrations are backwards compatible.

Examples

A deployment causes a spike in error rates.

The team deploys version 2.14.0 at 10:00 AM. By 10:05, the error rate jumps from 0.1% to 8%. The on-call engineer triggers a rollback. Kubernetes pulls the previous container image (2.13.0) and restarts all pods. By 10:08, error rates return to normal. Total user impact: 8 minutes. The team investigates the bug in 2.14.0, finds a nil pointer dereference in the new caching layer, fixes it, and redeploys as 2.14.1.

A team cannot rollback due to a database migration.

Version 3.0 includes a migration that renames the 'email' column to 'primary_email.' After deploying, a critical bug is found. The team cannot rollback to version 2.9 because the old code references the 'email' column that no longer exists. Future migrations are designed to be additive: add the new column, deploy code that reads both, migrate data, then remove the old column in a later release.

An automated rollback catches a problem before humans notice.

The deployment pipeline includes automated canary analysis. After deploying the new version to 5% of traffic, it compares error rates, latency, and CPU usage against the old version. The new version shows 3x higher latency on the /checkout endpoint. The pipeline automatically rolls back and notifies the team in Slack. No user-facing impact. No human intervention required.

Frequently asked questions

When should you rollback vs. roll forward?

Rollback when the fix is not immediately obvious and users are being impacted right now. Get back to the known-good state first, then debug at your own pace. Roll forward (deploy a fix) when the bug is simple, the fix is already written, and you can deploy it faster than rolling back. A one-line typo fix that deploys in 3 minutes is faster to roll forward. A complex logic bug in a new feature is faster to rollback.

How do you make rollbacks fast?

Three things. First, keep the previous version's artifacts (container images, build outputs) readily available, not just the latest. Second, automate the rollback command so it is a single button click or CLI command, not a manual process. Third, separate database migrations from code deployments. If the code can rollback without needing to reverse a migration, rollbacks take seconds instead of hours.

Examples

In practice

Read more on the blog

Frequently asked questions

When should you rollback vs. roll forward?

How do you make rollbacks fast?

Related terms

Want the complete playbook?