Observability

Observability is the ability to understand what is happening inside your system by looking at what comes out of it. If your application is slow, can you figure out why without adding new code? If a request fails, can you trace its path through 15 microservices to find where it broke? That is observability.

It rests on three pillars: logs (what happened), metrics (how much and how fast), and traces (the path a request took through the system). Logs tell you that a payment failed. Metrics tell you that payment failure rates spiked from 0.1% to 5% at 3:14pm. Traces tell you that the failure happened in the fraud detection service because the third-party API timed out after 30 seconds.

Observability is different from monitoring. Monitoring tells you when something is broken. Observability helps you figure out why. With monitoring, you set up alerts for known failure modes. With observability, you can investigate unknown failure modes. Datadog, Grafana, Honeycomb, and New Relic are the major platforms in this space. Honeycomb, founded by Charity Majors, popularized the term in the context of modern distributed systems.

Examples

A team debugs a latency spike in production.

Users report slow page loads. The team opens their observability platform and sees p99 latency jumped from 200ms to 2 seconds at 2:30pm. They trace a slow request through the system: the API gateway, the user service, the database. The trace reveals the database query is taking 1.8 seconds because a missing index causes a full table scan on a table that grew past 10 million rows overnight.

An SRE team sets up observability for a new service.

The service emits structured JSON logs with request IDs, duration, and status codes. It exports metrics (request count, latency histogram, error rate) to Prometheus. It sends distributed traces to Jaeger with span context propagated across service boundaries. Within a week, the team can answer any question about the service's behavior without SSHing into a server.

A startup outgrows console.log debugging.

The team has been debugging by adding console.log statements, deploying, checking the output, and repeating. This worked with one service. With twelve microservices, it is impossible. They adopt OpenTelemetry for structured tracing and Grafana Loki for log aggregation. Debugging time drops from hours to minutes because they can see the full request lifecycle across all services.

Frequently asked questions

What is the difference between observability and monitoring?

Monitoring watches for known problems: is the server up? Is CPU above 90%? Is the error rate above 1%? Observability lets you investigate unknown problems: why is this specific user seeing slow responses? What changed between yesterday and today? Monitoring answers 'is something wrong?' Observability answers 'what is wrong and why?' You need both.

What is OpenTelemetry and why does it matter?

OpenTelemetry (OTel) is an open-source standard for collecting logs, metrics, and traces. Before OTel, every observability vendor had its own SDK. If you wanted to switch from Datadog to Honeycomb, you had to re-instrument your entire codebase. OTel standardizes the instrumentation. You instrument once and can send data to any compatible backend. It is supported by every major cloud provider and observability vendor.

Examples

In practice

Read more on the blog

Frequently asked questions

What is the difference between observability and monitoring?

What is OpenTelemetry and why does it matter?

Related terms

Want the complete playbook?