SLI
ess-ell-eye
Service level indicator: a specific metric used to measure the reliability of a service, like latency or error rate.
An SLI (service level indicator) is a quantitative measurement of some aspect of your service's behavior. It is the raw data that tells you whether your service is healthy. Common SLIs include request latency, error rate, throughput, and availability.
Good SLIs measure what users experience, not what your infrastructure reports. A server might show 0% CPU usage while returning errors on every request. The SLI that matters is the error rate seen by users, not the CPU metric seen by your monitoring. This distinction separates useful SLIs from vanity metrics.
Choose SLIs carefully because they drive everything downstream. Your SLOs are targets for SLIs. Your SLAs are promises based on SLOs. Your alerts fire when SLIs breach thresholds. Pick the wrong SLI and you optimize for the wrong thing. Most services need only 3-5 SLIs: availability, latency (p50 and p99), error rate, and throughput.
Examples
A team selects SLIs for a checkout API.
They choose four SLIs: availability (percentage of non-5xx responses), latency (p99 response time), error rate (percentage of failed transactions), and throughput (successful checkouts per minute). Each SLI maps to a user-visible outcome. If latency spikes, carts get abandoned. If error rate climbs, revenue drops.
An engineer discovers a misleading SLI.
The team measures availability by checking if the load balancer returns 200 OK on a health endpoint. But the health check does not test the database connection. The SLI shows 100% availability during a database outage because the health endpoint still responds. They fix the health check to include a database query.
A platform team standardizes SLIs across services.
The company has 40 microservices, each with different monitoring setups. The platform team defines four standard SLIs and ships a shared library that instruments them automatically. Within a quarter, every service reports the same metrics in the same format, making cross-service debugging much faster.
In practice
Read more on the blog
Frequently asked questions
What are the most common SLIs?
The four most common SLIs are availability (percentage of successful requests), latency (how long requests take, usually p50 and p99), error rate (percentage of requests that fail), and throughput (requests per second the system handles). Most services need all four. Some add saturation (how full your resources are) as a fifth.
How is an SLI different from a regular metric?
Every SLI is a metric, but not every metric is an SLI. CPU usage, memory consumption, and disk I/O are useful metrics for debugging but they are not SLIs. SLIs specifically measure user-facing service quality. They answer 'Is the user having a good experience?' not 'Is the server busy?'
Related terms
Service level objective: an internal reliability target for a service, like 99.9% availability or p99 latency under 200ms.
Service level agreement: a contractual commitment to specific performance and availability levels.
The percentage of requests that fail compared to total requests, usually measured over a rolling time window.
The percentage of time a system is operational and accessible to users.
An unplanned event that disrupts a service or degrades it below its expected quality, requiring a coordinated response.

Want the complete playbook?
Picks and Shovels is the definitive guide to developer marketing. Amazon #1 bestseller with practical strategies from 30 years of marketing to developers.