OpenTelemetry Collector Reliability Playbook: Backpressure, Queues, WAL, and Scale (2026)

2026-03-31 · software

OpenTelemetry Collector Reliability Playbook: Backpressure, Queues, WAL, and Scale (2026)

TL;DR


1) Failure model first (what actually loses data)

In production, telemetry loss usually happens through one of these paths:

  1. Endpoint outage + queue overflow
    • exporter queue fills, new batches get dropped.
  2. Endpoint outage longer than retry budget
    • oldest queued data ages out (max_elapsed_time) and is dropped.
  3. Collector restart/crash without persistence
    • in-memory queue is gone.
  4. Collector overload without effective backpressure
    • memory spikes, forced drops, or OOM kill.
  5. Backend is slow/saturated
    • queue rises forever; scaling collectors only increases pressure.

Treat these as explicit risk branches in your observability SLO, not edge cases.


2) Reliability architecture (practical default)

Recommended topology

Why this works


3) Backpressure-first configuration strategy

A. Memory limiter (first processor)

Key behavior:

Operational rules:

B. Batch after limiter

batch improves compression and export efficiency, but increases burst amplitude if oversized. Tune it with latency budget in mind, not max throughput alone.

C. Export queue + retry

Every network exporter should define:

This creates controlled buffering during transient downstream failures.


4) WAL persistence (file_storage) when you need restart durability

Use file_storage for exporter queues when "collector restarts = data loss" is unacceptable.

Benefits:

Risks:

Minimum rule: if SLO cannot tolerate restart-window loss, use persistent queue storage and monitor disk aggressively.


5) Scale triggers and anti-triggers

Scale-up signals

Do NOT scale collectors yet when

That often means backend saturation. More collectors = more pressure = faster failure.


6) Baseline config template (gateway-oriented)

extensions:
  health_check:
  file_storage:
    directory: /var/lib/otelcol/storage

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 2048
    spike_limit_mib: 410
  batch:
    timeout: 5s
    send_batch_size: 1024
    send_batch_max_size: 2048

exporters:
  otlphttp:
    endpoint: https://observability-backend.example.com
    sending_queue:
      enabled: true
      storage: file_storage
      queue_size: 5000
      num_consumers: 8
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 10m

auth: {}

service:
  extensions: [health_check, file_storage]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlphttp]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlphttp]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlphttp]

Adjust numbers from measured traffic, not copy-paste dogma.


7) SLO-oriented runbook

Incident A: queue climbing fast

  1. Check backend health/latency first.
  2. Confirm retry + queue settings are active.
  3. If backend healthy but collectors saturated, scale collector replicas.
  4. If backend unhealthy, protect ingestion (sampling/rate limits) and fix backend before scaling collector aggressively.

Incident B: memory limiter refusing continuously

  1. Confirm limiter is first in processors.
  2. Verify GOMEMLIMIT and container memory limits are coherent.
  3. Reduce burstiness (batch size/timeout), raise capacity, or scale out.
  4. Audit receivers/exporters that don’t handle retries correctly.

Incident C: data lost during restart

  1. Confirm queue persistence (file_storage) is configured and writable.
  2. Verify disk headroom and WAL I/O behavior.
  3. Recalculate retry horizon vs worst-case downstream outage.

8) Metrics to dashboard (minimum set)

Alerting hints:


9) Rollout plan (low-drama)

  1. Baseline current loss/latency profile.
  2. Enable memory_limiter + exporter queue/retry in canary.
  3. Enable file_storage for canary gateways.
  4. Add queue/refusal SLO alerts before broad rollout.
  5. Scale test with synthetic backend slowdown.
  6. Promote by signal type (traces -> metrics -> logs) to reduce blast radius.

10) Final take

Collector reliability is less about one magic component and more about system discipline:

If one change must happen this week: make memory_limiter first, and verify every exporter path has queue + retry configured and monitored.


Evidence anchors / references