OpenTelemetry Tracing Playbook (Sampling, Correlation, Tail-Based Investigation)
Date: 2026-02-26
Category: knowledge
Domain: observability / backend reliability
Why this matters
Teams often say “we have tracing” when what they actually have is:
- spans without consistent service/resource naming,
- random head sampling that drops the exact slow/error traces you need,
- no log/metric correlation,
- no runbook for using traces during incidents.
Result: pretty waterfall screenshots, weak operational value.
The goal is to treat tracing as an investigation system, not a dashboard ornament.
Mental model: traces are for causality, not averages
Metrics answer:
- “How often?”
- “How bad?”
Traces answer:
- “Why this specific request failed or got slow?”
- “Where exactly did latency accumulate across services?”
If your SLOs are metric-driven, your incident triage should be trace-assisted.
Architecture that works in production
- Instrument app code with OpenTelemetry SDK
- Export OTLP to OpenTelemetry Collector
- Collector does:
- enrichment (resource attributes),
- filtering,
- sampling policies,
- fan-out to backends (Tempo/Jaeger/vendor)
- Correlate traces with:
- logs (
trace_id,span_id), - metrics (exemplars + span metrics)
- logs (
Collector-first architecture avoids app-level exporter sprawl and gives a single policy plane.
Naming & attributes: the boring part that decides success
Service identity
Standardize at least:
service.nameservice.namespaceservice.versiondeployment.environment
If these are inconsistent, cross-service queries become painful fast.
HTTP/RPC semantic attributes
Use OTel semantic conventions (HTTP method, route, status, peer/service attributes).
Do not invent ad-hoc keys unless you must.
Business attributes (sparingly)
Useful examples:
- tenant tier (
tenant.plan), - checkout/payment flow id,
- model version (for ML calls),
- exchange/venue id (for trading systems).
Rule: attributes should improve triage decisions. If not, don’t add them.
Sampling strategy: move beyond naive head sampling
1) Head sampling (cheap, early)
Pros:
- low overhead,
- simple.
Cons:
- misses rare tail failures and high-latency paths by chance.
Good baseline:
- keep 100% for errors (if possible),
- modest percentage for healthy traffic.
2) Tail sampling (collector-side, decision after seeing spans)
Pros:
- can retain traces that are slow/error/high-value,
- much better for incident forensics.
Cons:
- collector memory/CPU cost,
- more operational tuning.
Typical policy stack:
- keep all traces with server error status,
- keep traces above latency threshold (e.g., p99-like duration),
- keep 100% for critical endpoints/tenants,
- probabilistic keep for the rest.
In practice, hybrid works best: light head + policy-rich tail.
Correlation pattern (non-optional)
A reliable incident workflow needs all three:
- Metrics → detect (SLO burn, latency spike, error budget)
- Trace → explain causality (where and why)
- Logs → inspect details (payload/result/edge-case context)
Implementation checklist:
- inject
trace_id/span_idinto structured logs, - propagate W3C Trace Context headers (
traceparent,tracestate), - enable exemplars linking metrics points to trace IDs.
Without this chain, teams bounce between tools and lose minutes during incidents.
Incident runbook (15-minute first pass)
- Start from alert window (time + service + endpoint)
- Query traces for:
- error status,
- top latency contributors,
- affected dependency/service pairs.
- Compare good vs bad traces:
- which span changed duration/failure rate?
- Pivot into logs for same trace IDs
- Classify issue quickly:
- dependency saturation,
- retry storm,
- timeout misbudget,
- deploy regression,
- tenant-specific hot key/path.
- Apply known mitigation (rate limit, rollback, brownout, timeout tune)
Tracing should shorten MTTD/MTTR, not add detective theater.
High-value derived metrics from traces
Generate span-derived metrics to reduce query load:
- request rate by route + status,
- latency histogram by route/service,
- error ratio by dependency,
- queue/wait vs execution time split,
- external API call fan-out count.
These guide where to open raw traces first.
Common failure modes
Inconsistent route labels (
/users/:idvs/users/123)
Cardinality explodes, aggregation becomes noisy.No privacy filtering
PII/secrets leak into span attributes/events.Single global sample rate
Critical flows under-sampled, low-value traffic over-sampled.No collector SLOs
Tracing pipeline drops data silently during incidents.No ownership
Everyone emits spans, nobody curates conventions.
Practical rollout plan
Week 1:
- lock naming conventions,
- instrument one critical user journey end-to-end,
- establish trace↔log correlation.
Week 2:
- deploy collector with basic tail policies,
- create “golden query pack” for on-call engineers.
Week 3:
- add span metrics + exemplar links,
- define sampling budgets by service criticality.
Week 4:
- run game-day incident drill using only metrics+trace+logs workflow,
- refine policies based on what was hard to find.
Decision cheat sheet
- Need lowest cost quickly? → start head sampling + strict conventions.
- Need better incident forensics? → add tail sampling in collector.
- Drowning in traces? → derive span metrics + route normalization.
- On-call still slow? → improve trace↔log join + prebuilt queries.
References (researched)
- OpenTelemetry docs (overview/spec/concepts)
https://opentelemetry.io/docs/ - OTel semantic conventions
https://opentelemetry.io/docs/specs/semconv/ - OTel Collector (architecture + processors)
https://opentelemetry.io/docs/collector/ - Tail Sampling Processor (OTel Collector contrib)
https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor - W3C Trace Context
https://www.w3.org/TR/trace-context/ - Grafana Tempo docs (trace backend + exemplars)
https://grafana.com/docs/tempo/latest/ - Jaeger docs
https://www.jaegertracing.io/docs/ - Google SRE Workbook (Monitoring distributed systems / alerting context)
https://sre.google/workbook/monitoring/