OpenTelemetry Tracing Playbook (Sampling, Correlation, Tail-Based Investigation)

2026-02-26 · software

OpenTelemetry Tracing Playbook (Sampling, Correlation, Tail-Based Investigation)

Date: 2026-02-26
Category: knowledge
Domain: observability / backend reliability

Why this matters

Teams often say “we have tracing” when what they actually have is:

Result: pretty waterfall screenshots, weak operational value.

The goal is to treat tracing as an investigation system, not a dashboard ornament.


Mental model: traces are for causality, not averages

Metrics answer:

Traces answer:

If your SLOs are metric-driven, your incident triage should be trace-assisted.


Architecture that works in production

  1. Instrument app code with OpenTelemetry SDK
  2. Export OTLP to OpenTelemetry Collector
  3. Collector does:
    • enrichment (resource attributes),
    • filtering,
    • sampling policies,
    • fan-out to backends (Tempo/Jaeger/vendor)
  4. Correlate traces with:
    • logs (trace_id, span_id),
    • metrics (exemplars + span metrics)

Collector-first architecture avoids app-level exporter sprawl and gives a single policy plane.


Naming & attributes: the boring part that decides success

Service identity

Standardize at least:

If these are inconsistent, cross-service queries become painful fast.

HTTP/RPC semantic attributes

Use OTel semantic conventions (HTTP method, route, status, peer/service attributes).
Do not invent ad-hoc keys unless you must.

Business attributes (sparingly)

Useful examples:

Rule: attributes should improve triage decisions. If not, don’t add them.


Sampling strategy: move beyond naive head sampling

1) Head sampling (cheap, early)

Pros:

Cons:

Good baseline:

2) Tail sampling (collector-side, decision after seeing spans)

Pros:

Cons:

Typical policy stack:

In practice, hybrid works best: light head + policy-rich tail.


Correlation pattern (non-optional)

A reliable incident workflow needs all three:

  1. Metrics → detect (SLO burn, latency spike, error budget)
  2. Trace → explain causality (where and why)
  3. Logs → inspect details (payload/result/edge-case context)

Implementation checklist:

Without this chain, teams bounce between tools and lose minutes during incidents.


Incident runbook (15-minute first pass)

  1. Start from alert window (time + service + endpoint)
  2. Query traces for:
    • error status,
    • top latency contributors,
    • affected dependency/service pairs.
  3. Compare good vs bad traces:
    • which span changed duration/failure rate?
  4. Pivot into logs for same trace IDs
  5. Classify issue quickly:
    • dependency saturation,
    • retry storm,
    • timeout misbudget,
    • deploy regression,
    • tenant-specific hot key/path.
  6. Apply known mitigation (rate limit, rollback, brownout, timeout tune)

Tracing should shorten MTTD/MTTR, not add detective theater.


High-value derived metrics from traces

Generate span-derived metrics to reduce query load:

These guide where to open raw traces first.


Common failure modes

  1. Inconsistent route labels (/users/:id vs /users/123)
    Cardinality explodes, aggregation becomes noisy.

  2. No privacy filtering
    PII/secrets leak into span attributes/events.

  3. Single global sample rate
    Critical flows under-sampled, low-value traffic over-sampled.

  4. No collector SLOs
    Tracing pipeline drops data silently during incidents.

  5. No ownership
    Everyone emits spans, nobody curates conventions.


Practical rollout plan

Week 1:

Week 2:

Week 3:

Week 4:


Decision cheat sheet


References (researched)