Error rate
AIR-ur rayt
The percentage of requests that fail compared to total requests, usually measured over a rolling time window.
Error rate is the percentage of requests to a service that fail. If your API handles 10,000 requests per minute and 50 return errors, your error rate is 0.5%. It is one of the most important SLIs because it directly measures whether your service is working.
Not all errors are equal. A 500 Internal Server Error means your code broke. A 429 Too Many Requests means the client is sending too much traffic and your rate limiter is working correctly. A 404 Not Found might be a typo in the URL. Good error rate tracking distinguishes between server errors (5xx), which are your fault, and client errors (4xx), which are usually not.
Error rate spikes are the fastest signal that something is wrong. Latency might degrade slowly. Availability might look fine in aggregate. But a sudden jump from 0.1% to 5% error rate is unmistakable. Most alerting systems fire on error rate first. Stripe, for example, monitors error rates per API endpoint per customer segment, catching problems before they become widespread outages.
Examples
An on-call engineer investigates an error rate spike.
At 2:14 AM, the error rate on the payments API jumps from 0.05% to 12%. The engineer checks the error breakdown: all 5xx errors, all from the same downstream dependency. They fail over to the backup provider and the error rate drops to 0.03% within two minutes.
A team sets an SLO based on error rate.
The search service has a 0.1% error rate SLO measured over a rolling 30-day window. That means no more than 1 in 1,000 requests can fail. With 100 million requests per month, the error budget allows 100,000 failed requests. The team tracks this budget daily on their dashboard.
A product manager uses error rate data in a customer conversation.
An enterprise customer reports intermittent failures. The PM pulls error rate data filtered by customer ID: 0.8% error rate over the past week, 8x the platform average. Engineering traces it to a specific query pattern that triggers a timeout. They ship a fix within 48 hours.
In practice
Read more on the blog
Frequently asked questions
What is an acceptable error rate?
It depends on the service. Payment processing APIs typically target below 0.01% (one in ten thousand). General-purpose APIs target below 0.1%. Internal tools might tolerate 1%. The key is setting an explicit target (SLO) rather than assuming any error rate is acceptable. Zero errors is neither realistic nor necessary.
Should client errors (4xx) count toward error rate?
Usually no. A 4xx response means the client sent a bad request, not that your service is broken. Most teams track server errors (5xx) as the primary error rate SLI. But tracking 4xx separately is still useful. A sudden spike in 400 Bad Request errors might indicate a broken client SDK or a confusing API change.
Related terms
Service level indicator: a specific metric used to measure the reliability of a service, like latency or error rate.
Service level objective: an internal reliability target for a service, like 99.9% availability or p99 latency under 200ms.
The percentage of time a system is operational and accessible to users.
An unplanned event that disrupts a service or degrades it below its expected quality, requiring a coordinated response.
Restricting how many requests a client can make to an API within a time window to prevent abuse and overload.

Want the complete playbook?
Picks and Shovels is the definitive guide to developer marketing. Amazon #1 bestseller with practical strategies from 30 years of marketing to developers.