OpenTelemetry Tracing Playbook (Sampling, Correlation, Tail-Based Investigation)

Date: 2026-02-26
Category: knowledge
Domain: observability / backend reliability

Why this matters

Teams often say “we have tracing” when what they actually have is:

spans without consistent service/resource naming,
random head sampling that drops the exact slow/error traces you need,
no log/metric correlation,
no runbook for using traces during incidents.

Result: pretty waterfall screenshots, weak operational value.

The goal is to treat tracing as an investigation system, not a dashboard ornament.

Mental model: traces are for causality, not averages

Metrics answer:

“How often?”
“How bad?”

Traces answer:

“Why this specific request failed or got slow?”
“Where exactly did latency accumulate across services?”

If your SLOs are metric-driven, your incident triage should be trace-assisted.

Architecture that works in production

Instrument app code with OpenTelemetry SDK
Export OTLP to OpenTelemetry Collector
Collector does:
- enrichment (resource attributes),
- filtering,
- sampling policies,
- fan-out to backends (Tempo/Jaeger/vendor)
Correlate traces with:
- logs (trace_id, span_id),
- metrics (exemplars + span metrics)

Collector-first architecture avoids app-level exporter sprawl and gives a single policy plane.

Naming & attributes: the boring part that decides success

Service identity

Standardize at least:

service.name
service.namespace
service.version
deployment.environment

If these are inconsistent, cross-service queries become painful fast.

HTTP/RPC semantic attributes

Use OTel semantic conventions (HTTP method, route, status, peer/service attributes).
Do not invent ad-hoc keys unless you must.

Business attributes (sparingly)

Useful examples:

tenant tier (tenant.plan),
checkout/payment flow id,
model version (for ML calls),
exchange/venue id (for trading systems).

Rule: attributes should improve triage decisions. If not, don’t add them.

Sampling strategy: move beyond naive head sampling

1) Head sampling (cheap, early)

Pros:

low overhead,
simple.

Cons:

misses rare tail failures and high-latency paths by chance.

Good baseline:

keep 100% for errors (if possible),
modest percentage for healthy traffic.

2) Tail sampling (collector-side, decision after seeing spans)

Pros:

can retain traces that are slow/error/high-value,
much better for incident forensics.

Cons:

collector memory/CPU cost,
more operational tuning.

Typical policy stack:

keep all traces with server error status,
keep traces above latency threshold (e.g., p99-like duration),
keep 100% for critical endpoints/tenants,
probabilistic keep for the rest.

In practice, hybrid works best: light head + policy-rich tail.

Correlation pattern (non-optional)

A reliable incident workflow needs all three:

Metrics → detect (SLO burn, latency spike, error budget)
Trace → explain causality (where and why)
Logs → inspect details (payload/result/edge-case context)

Implementation checklist:

inject trace_id/span_id into structured logs,
propagate W3C Trace Context headers (traceparent, tracestate),
enable exemplars linking metrics points to trace IDs.

Without this chain, teams bounce between tools and lose minutes during incidents.

Incident runbook (15-minute first pass)

Start from alert window (time + service + endpoint)
Query traces for:
- error status,
- top latency contributors,
- affected dependency/service pairs.
Compare good vs bad traces:
- which span changed duration/failure rate?
Pivot into logs for same trace IDs
Classify issue quickly:
- dependency saturation,
- retry storm,
- timeout misbudget,
- deploy regression,
- tenant-specific hot key/path.
Apply known mitigation (rate limit, rollback, brownout, timeout tune)

Tracing should shorten MTTD/MTTR, not add detective theater.

High-value derived metrics from traces

Generate span-derived metrics to reduce query load:

request rate by route + status,
latency histogram by route/service,
error ratio by dependency,
queue/wait vs execution time split,
external API call fan-out count.

These guide where to open raw traces first.

Common failure modes

Inconsistent route labels (/users/:id vs /users/123)
Cardinality explodes, aggregation becomes noisy.
No privacy filtering
PII/secrets leak into span attributes/events.
Single global sample rate
Critical flows under-sampled, low-value traffic over-sampled.
No collector SLOs
Tracing pipeline drops data silently during incidents.
No ownership
Everyone emits spans, nobody curates conventions.

Practical rollout plan

Week 1:

lock naming conventions,
instrument one critical user journey end-to-end,
establish trace↔log correlation.

Week 2:

deploy collector with basic tail policies,
create “golden query pack” for on-call engineers.

Week 3:

add span metrics + exemplar links,
define sampling budgets by service criticality.

Week 4:

run game-day incident drill using only metrics+trace+logs workflow,
refine policies based on what was hard to find.

Decision cheat sheet

Need lowest cost quickly? → start head sampling + strict conventions.
Need better incident forensics? → add tail sampling in collector.
Drowning in traces? → derive span metrics + route normalization.
On-call still slow? → improve trace↔log join + prebuilt queries.

References (researched)

OpenTelemetry docs (overview/spec/concepts)
https://opentelemetry.io/docs/
OTel semantic conventions
https://opentelemetry.io/docs/specs/semconv/
OTel Collector (architecture + processors)
https://opentelemetry.io/docs/collector/
Tail Sampling Processor (OTel Collector contrib)
https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor
W3C Trace Context
https://www.w3.org/TR/trace-context/
Grafana Tempo docs (trace backend + exemplars)
https://grafana.com/docs/tempo/latest/
Jaeger docs
https://www.jaegertracing.io/docs/
Google SRE Workbook (Monitoring distributed systems / alerting context)
https://sre.google/workbook/monitoring/