Error rate

Error rate is the percentage of requests to a service that fail. If your API handles 10,000 requests per minute and 50 return errors, your error rate is 0.5%. It is one of the most important SLIs because it directly measures whether your service is working.

Not all errors are equal. A 500 Internal Server Error means your code broke. A 429 Too Many Requests means the client is sending too much traffic and your rate limiter is working correctly. A 404 Not Found might be a typo in the URL. Good error rate tracking distinguishes between server errors (5xx), which are your fault, and client errors (4xx), which are usually not.

Error rate spikes are the fastest signal that something is wrong. Latency might degrade slowly. Availability might look fine in aggregate. But a sudden jump from 0.1% to 5% error rate is unmistakable. Most alerting systems fire on error rate first. Stripe, for example, monitors error rates per API endpoint per customer segment, catching problems before they become widespread outages.

Examples

An on-call engineer investigates an error rate spike.

At 2:14 AM, the error rate on the payments API jumps from 0.05% to 12%. The engineer checks the error breakdown: all 5xx errors, all from the same downstream dependency. They fail over to the backup provider and the error rate drops to 0.03% within two minutes.

A team sets an SLO based on error rate.

The search service has a 0.1% error rate SLO measured over a rolling 30-day window. That means no more than 1 in 1,000 requests can fail. With 100 million requests per month, the error budget allows 100,000 failed requests. The team tracks this budget daily on their dashboard.

A product manager uses error rate data in a customer conversation.

An enterprise customer reports intermittent failures. The PM pulls error rate data filtered by customer ID: 0.8% error rate over the past week, 8x the platform average. Engineering traces it to a specific query pattern that triggers a timeout. They ship a fix within 48 hours.

Frequently asked questions

What is an acceptable error rate?

It depends on the service. Payment processing APIs typically target below 0.01% (one in ten thousand). General-purpose APIs target below 0.1%. Internal tools might tolerate 1%. The key is setting an explicit target (SLO) rather than assuming any error rate is acceptable. Zero errors is neither realistic nor necessary.

Should client errors (4xx) count toward error rate?

Usually no. A 4xx response means the client sent a bad request, not that your service is broken. Most teams track server errors (5xx) as the primary error rate SLI. But tracking 4xx separately is still useful. A sudden spike in 400 Bad Request errors might indicate a broken client SDK or a confusing API change.

Examples

In practice

Read more on the blog

Frequently asked questions

What is an acceptable error rate?

Should client errors (4xx) count toward error rate?

Related terms

Want the complete playbook?