Incident

An incident is an event that breaks or degrades a production service enough to require a coordinated response. A single failed request is not an incident. A 30-minute outage affecting thousands of customers is. The line between "everything is fine" and "we have an incident" is usually defined by severity levels and impact thresholds.

Incident management is the structured process for detecting, responding to, and resolving incidents. Good teams have a clear playbook: who gets paged, who leads the response, how communication flows, and when to escalate. PagerDuty, Opsgenie, and Incident.io are common tools. The incident commander runs the response while others investigate, communicate with stakeholders, and implement fixes.

The real value of incident management is what happens after. Postmortems document what went wrong, why, and how to prevent recurrence. Companies with strong incident cultures (Google, Cloudflare, Atlassian) publish public postmortems, building trust through transparency. A company that never has incidents is not more reliable. It is just not measuring.

Examples

An on-call engineer declares a SEV-1 incident.

Monitoring shows the API error rate at 25% and rising. The on-call engineer declares a SEV-1, which automatically pages the incident commander, creates a Slack channel (#inc-20260215-api-outage), and notifies the VP of Engineering. Within five minutes, six engineers are in the channel triaging the problem.

A company manages customer communication during an incident.

The status page is updated within 10 minutes of declaration: 'We are investigating elevated error rates on the REST API.' Updates follow every 15 minutes. The support team sends proactive emails to affected enterprise customers. After resolution, a detailed postmortem is published within 48 hours.

A team tracks incident metrics over time.

Over the past quarter, the team had 12 incidents: 2 SEV-1, 4 SEV-2, and 6 SEV-3. Mean time to detect (MTTD) was 3 minutes. Mean time to resolve (MTTR) was 47 minutes. The two SEV-1s both involved the same database cluster. The team prioritizes a database migration as a result.

Frequently asked questions

When should you declare an incident?

Declare an incident when a production issue meets any of these criteria: it affects a significant number of users, it breaches an SLO, it requires coordination across multiple teams, or it cannot be resolved by a single engineer in a few minutes. When in doubt, declare. It is far better to declare and resolve quickly than to let a problem grow while debating whether it counts.

What is the difference between an incident and a bug?

A bug is a defect in code. An incident is an event that disrupts service. Many incidents are caused by bugs, but not all bugs cause incidents. A bug in a rarely-used feature might go unnoticed for months. An incident demands immediate attention because users are being affected right now. Some incidents are not bugs at all: hardware failures, traffic spikes, or third-party outages.

Examples

In practice

Read more on the blog

Frequently asked questions

When should you declare an incident?

What is the difference between an incident and a bug?

Related terms

Want the complete playbook?