Incident
IN-sih-dent
An unplanned event that disrupts a service or degrades it below its expected quality, requiring a coordinated response.
An incident is an event that breaks or degrades a production service enough to require a coordinated response. A single failed request is not an incident. A 30-minute outage affecting thousands of customers is. The line between "everything is fine" and "we have an incident" is usually defined by severity levels and impact thresholds.
Incident management is the structured process for detecting, responding to, and resolving incidents. Good teams have a clear playbook: who gets paged, who leads the response, how communication flows, and when to escalate. PagerDuty, Opsgenie, and Incident.io are common tools. The incident commander runs the response while others investigate, communicate with stakeholders, and implement fixes.
The real value of incident management is what happens after. Postmortems document what went wrong, why, and how to prevent recurrence. Companies with strong incident cultures (Google, Cloudflare, Atlassian) publish public postmortems, building trust through transparency. A company that never has incidents is not more reliable. It is just not measuring.
Examples
An on-call engineer declares a SEV-1 incident.
Monitoring shows the API error rate at 25% and rising. The on-call engineer declares a SEV-1, which automatically pages the incident commander, creates a Slack channel (#inc-20260215-api-outage), and notifies the VP of Engineering. Within five minutes, six engineers are in the channel triaging the problem.
A company manages customer communication during an incident.
The status page is updated within 10 minutes of declaration: 'We are investigating elevated error rates on the REST API.' Updates follow every 15 minutes. The support team sends proactive emails to affected enterprise customers. After resolution, a detailed postmortem is published within 48 hours.
A team tracks incident metrics over time.
Over the past quarter, the team had 12 incidents: 2 SEV-1, 4 SEV-2, and 6 SEV-3. Mean time to detect (MTTD) was 3 minutes. Mean time to resolve (MTTR) was 47 minutes. The two SEV-1s both involved the same database cluster. The team prioritizes a database migration as a result.
In practice
Read more on the blog
Frequently asked questions
When should you declare an incident?
Declare an incident when a production issue meets any of these criteria: it affects a significant number of users, it breaches an SLO, it requires coordination across multiple teams, or it cannot be resolved by a single engineer in a few minutes. When in doubt, declare. It is far better to declare and resolve quickly than to let a problem grow while debating whether it counts.
What is the difference between an incident and a bug?
A bug is a defect in code. An incident is an event that disrupts service. Many incidents are caused by bugs, but not all bugs cause incidents. A bug in a rarely-used feature might go unnoticed for months. An incident demands immediate attention because users are being affected right now. Some incidents are not bugs at all: hardware failures, traffic spikes, or third-party outages.
Related terms
A classification system (SEV-1 through SEV-4) that ranks incidents by impact and urgency to determine response priority.
A written analysis of an incident: what happened, why, and what the team will do to prevent it from recurring.
A rotation where engineers are responsible for responding to production alerts and incidents outside business hours.
The percentage of time a system is operational and accessible to users.
The percentage of requests that fail compared to total requests, usually measured over a rolling time window.

Want the complete playbook?
Picks and Shovels is the definitive guide to developer marketing. Amazon #1 bestseller with practical strategies from 30 years of marketing to developers.