OpenTelemetry Exemplars for Metrics↔Traces Correlation (Production Playbook)

2026-03-23 · software

OpenTelemetry Exemplars for Metrics↔Traces Correlation (Production Playbook)

Date: 2026-03-23
Category: knowledge
Scope: Practical rollout guide for linking latency metrics to concrete traces using OpenTelemetry exemplars, Prometheus, and Grafana.


1) Why this matters

Dashboards tell you what is slow (p95/p99 spikes). Traces tell you why it was slow (DB lock, retry storm, cold cache, etc.).

Exemplars are the bridge: they attach a specific trace context to selected metric observations, so you can jump directly from a bad point on a graph to the corresponding trace.


2) Mental model (what an exemplar actually is)

From the OpenTelemetry metrics SDK spec, an exemplar records a measurement sample with:

Two-stage decision path:

  1. ExemplarFilter decides if a measurement is eligible (trace_based, always_on, always_off).
  2. ExemplarReservoir performs final sampling/storage for exemplars per timeseries.

Key default behavior to remember:


3) Pipeline prerequisites (end-to-end)

Correlation only works if every hop preserves exemplar data:

  1. App instrumentation emits exemplars (typically on histograms).
  2. Scrape/protocol path uses OpenMetrics-capable exposition when needed.
  3. Metrics backend stores exemplars.
  4. UI is configured to link exemplar label (trace_id) to tracing datasource (Tempo/Jaeger/etc.).

If any one of these is missing, you will see metrics and traces separately but no clickable bridge.


4) Practical implementation pattern

4.1 Instrument where exemplars are highest value

Prioritize latency histograms for user-facing critical paths:

Why histograms first: you care most about outliers in tail buckets, and exemplars are perfect for drilling into those.

4.2 Ensure stable cross-signal identity labels

Keep service.name, environment, and (if relevant) cluster consistent across metrics/logs/traces. Correlation UX is much smoother when all signals share the same identity spine.

4.3 Prometheus storage and capacity knobs

Rule-of-thumb memory estimate:

4.4 Remote write forwarding

If forwarding to Grafana Cloud/Mimir-compatible backends, explicitly enable exemplar forwarding (send_exemplars: true) on remote write/export path.


5) Rollout strategy (safe in production)

Phase A — One service, one histogram

Phase B — Expand by SLO criticality

Phase C — Tighten cost controls


6) Common failure modes and fixes

  1. No diamonds in panel / no exemplars visible

    • Verify backend storage enabled.
    • Verify panel type supports exemplars (Grafana Time series panel).
    • Verify query actually targets instrument with exemplars.
  2. Diamonds exist but link is broken/404

    • Check datasource correlation mapping (trace_id label name, Tempo/Jaeger datasource selection).
    • Ensure trace backend retention still includes linked traces.
  3. Exemplars exist but traces missing

    • Tail sampling may have dropped the trace even though metric exemplar was emitted.
    • Align tracing sampling/retention policy with exemplar expectations.
  4. Cardinality/cost surprises

    • Don’t attach many custom exemplar labels.
    • Keep shared identity labels controlled and consistent.
  5. Format/protocol mismatch

    • Ensure scrape/export path supports exemplar-capable format (OpenMetrics/proto paths as appropriate in your stack).

7) Opinionated defaults (good starting point)


8) Bottom line

Exemplars are one of the highest-ROI observability upgrades because they remove the manual search step between "SLO graph looks bad" and "which exact request was bad?".

Treat exemplar correlation as a productized path (instrumentation + storage + UI mapping + sampling policy), not a one-off dashboard trick.


References