SRE (SRE) - Definition and examples

SRE (Site Reliability Engineering) is a discipline created by Google in 2003 to solve a specific problem: how do you keep massive, complex systems running reliably without burning out your operations team? The answer is to treat operations as a software engineering problem. SREs write code to automate away operational work, define reliability targets with math, and use error budgets to balance reliability against feature velocity.

The core concept is the error budget. If your SLO is 99.9% availability, you have 0.1% room for failure per month, about 43 minutes of downtime. That is your error budget. If you have budget remaining, you can take risks: deploy more often, run experiments, migrate databases. If your budget is spent, you freeze changes and focus on reliability. This turns the abstract question "are we reliable enough?" into a concrete number that teams can act on.

SRE teams are typically embedded within engineering organizations. They are software engineers who happen to work on reliability. They build monitoring systems, automate incident response, design capacity planning tools, and write the infrastructure that keeps services running. The SRE book published by Google in 2016 became the definitive guide and made SRE the standard practice at companies running services at scale.

Examples

An SRE team implements error budgets.

The payment service has an SLO of 99.95% availability (about 22 minutes of allowed downtime per month). In January, two incidents consume 18 minutes of the budget. The SRE team communicates this to the product team: only 4 minutes of budget remain. The product team postpones a risky database migration to February and focuses on smaller, safer changes for the rest of January.

An SRE automates a toil-heavy process.

Every week, an engineer manually rotates TLS certificates for 40 services. It takes 3 hours. The SRE writes a system that automatically rotates certificates 30 days before expiration, validates the new certificates, and alerts if rotation fails. Three hours of weekly toil becomes zero. The SRE moves on to the next source of manual work.

A company hires its first SRE.

The startup has 15 engineers and is experiencing monthly outages. They hire an SRE who defines SLOs for each service, sets up Datadog dashboards and alerts, creates runbooks for common failures, and implements automated rollbacks. Within six months, mean time to recovery drops from 2 hours to 15 minutes. The on-call rotation goes from dreaded to manageable.

Frequently asked questions

How is an SRE different from a DevOps engineer?

An SRE is a software engineer who applies engineering principles to operations. A DevOps engineer typically focuses on CI/CD pipelines, infrastructure automation, and tooling. SREs tend to go deeper: they define SLOs, manage error budgets, analyze failure patterns, and design systems for reliability. In practice, the overlap is significant. At Google, SREs write code 50% of the time. At many companies, the DevOps engineer and SRE roles are nearly identical. The title matters less than the skills.

What is toil in SRE?

Toil is manual, repetitive, automatable work that scales linearly with service growth. Restarting a server manually is toil. Rotating certificates by hand is toil. Reviewing logs to find errors instead of having alerts is toil. SRE teams aim to spend no more than 50% of their time on toil. The other 50% goes to engineering work that eliminates toil permanently. If an SRE is spending 80% of their time on toil, the team is understaffed or under-automated.

SRE

Examples

In practice

Read more on the blog

Frequently asked questions

How is an SRE different from a DevOps engineer?

What is toil in SRE?

Related terms

Want the complete playbook?