Tracing

Tracing tracks a single request as it flows through your system. In a modern application, one user action might touch an API gateway, an authentication service, a business logic service, a database, a cache, and a third-party API. Tracing connects all of these into a single story: request came in, went here, waited there, failed here.

Each step in the trace is called a span. A span has a start time, a duration, metadata, and a parent span. The top-level span is the root. Child spans represent sub-operations. Together, they form a trace tree that shows exactly where time was spent and where failures occurred.

Distributed tracing became essential as monoliths gave way to microservices. In a monolith, you can follow a request with a stack trace. In 50 microservices communicating over the network, a stack trace only shows you what happened inside one service. Tracing connects the dots across service boundaries. Jaeger, Zipkin, and the tracing capabilities in Datadog and Honeycomb are the primary tools.

Examples

A developer investigates a slow API response.

The API responds in 3 seconds instead of the expected 200ms. The developer pulls up the trace in Jaeger. The trace shows: API gateway (5ms), auth service (15ms), order service (20ms), payment service (2,900ms). Within the payment service, a span shows a call to the fraud detection API that took 2,850ms due to a timeout. The root cause is clear without checking a single log file.

A team adds tracing to a legacy system.

The system has 30 services with no tracing. The team adds OpenTelemetry instrumentation to the 5 most critical services first. They propagate trace context through HTTP headers. Within a week, they discover that 40% of their latency comes from redundant database queries: three services query the same user data for the same request because there is no shared cache.

An SRE uses traces to identify a cascading failure.

Service A is timing out. The trace shows that Service A calls Service B, which calls Service C, which calls an external API. The external API is returning 503 errors. Service C retries three times, taking 30 seconds. Service B's timeout is 10 seconds, so it fails. Service A's timeout is 5 seconds, so it fails immediately. The fix: add circuit breakers at Service C so retries do not cascade.

Frequently asked questions

What is the difference between tracing and logging?

Logging records individual events within a single service. Tracing connects events across multiple services into a single request story. A log entry tells you 'this service processed a request in 50ms.' A trace tells you 'this user's request spent 50ms in the order service, 30ms in the payment service, and 200ms waiting for a database query in the inventory service.' Tracing gives you the full picture.

Do you need to trace every request?

No. Tracing every request generates massive amounts of data and is expensive. Most teams sample: trace 1% or 10% of requests in normal operation and increase sampling during incidents. Tail-based sampling is smarter: it keeps traces for slow or failed requests and discards traces for fast, successful ones. This way you have detailed data for the requests that matter.

Examples

In practice

Read more on the blog

Frequently asked questions

What is the difference between tracing and logging?

Do you need to trace every request?

Related terms

Want the complete playbook?