OpenTelemetry Collector Reliability Playbook: Backpressure, Queues, WAL, and Scale (2026)

TL;DR

Collector reliability is mostly a flow-control problem: ingest rate vs processing/export capacity.
Use memory_limiter as the first processor so overload becomes controlled backpressure, not random OOM.
Exporters should always have sending_queue + retry_on_failure; add file_storage (WAL) when restart-loss is unacceptable.
Scale decisions should be driven by queue/refusal metrics, not CPU alone.
If backend is saturated, adding collectors can make things worse. Fix backend bottlenecks first.

1) Failure model first (what actually loses data)

In production, telemetry loss usually happens through one of these paths:

Endpoint outage + queue overflow
- exporter queue fills, new batches get dropped.
Endpoint outage longer than retry budget
- oldest queued data ages out (max_elapsed_time) and is dropped.
Collector restart/crash without persistence
- in-memory queue is gone.
Collector overload without effective backpressure
- memory spikes, forced drops, or OOM kill.
Backend is slow/saturated
- queue rises forever; scaling collectors only increases pressure.

Treat these as explicit risk branches in your observability SLO, not edge cases.

2) Reliability architecture (practical default)

Recommended topology

Agent tier close to workloads (DaemonSet/Sidecar): receive + light processing.
Gateway tier centralized: heavier processing, sampling, export fan-out.
Optional durable bus (Kafka) between tiers if you need stronger decoupling.

Why this works

Local ingestion remains resilient to backend hiccups.
Heavy transforms and external egress are isolated.
Each tier scales independently by signal/load profile.

3) Backpressure-first configuration strategy

A. Memory limiter (first processor)

Key behavior:

Soft limit exceeded -> processor refuses incoming data with non-permanent errors.
Hard limit exceeded -> forces GC and keeps refusing until memory falls.

Operational rules:

Put memory_limiter first in processor chain.
Set check_interval near 1s for fast reaction.
Start spike_limit_mib around ~20% of hard limit.
Set GOMEMLIMIT roughly ~80% of Collector hard memory limit.

B. Batch after limiter

batch improves compression and export efficiency, but increases burst amplitude if oversized. Tune it with latency budget in mind, not max throughput alone.

C. Export queue + retry

Every network exporter should define:

sending_queue.enabled: true
sufficient queue_size
bounded retry policy (initial_interval, max_interval, max_elapsed_time)

This creates controlled buffering during transient downstream failures.

4) WAL persistence (`file_storage`) when you need restart durability

Use file_storage for exporter queues when "collector restarts = data loss" is unacceptable.

Benefits:

Survives pod/node restarts.
Replays queued telemetry after restart.

Risks:

Disk full / slow disk = new failure mode.
WAL is not an infinite buffer; retry horizon still matters.

Minimum rule: if SLO cannot tolerate restart-window loss, use persistent queue storage and monitor disk aggressively.

5) Scale triggers and anti-triggers

Scale-up signals

otelcol_processor_refused_* rises (memory limiter refusing data).
otelcol_exporter_queue_size / queue_capacity sustained > ~0.6-0.7.
enqueue failures increase (otelcol_exporter_enqueue_failed_*).

Do NOT scale collectors yet when

queue remains near capacity and send failures remain high.
backend latency/error is rising.

That often means backend saturation. More collectors = more pressure = faster failure.

6) Baseline config template (gateway-oriented)

extensions:
  health_check:
  file_storage:
    directory: /var/lib/otelcol/storage

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 2048
    spike_limit_mib: 410
  batch:
    timeout: 5s
    send_batch_size: 1024
    send_batch_max_size: 2048

exporters:
  otlphttp:
    endpoint: https://observability-backend.example.com
    sending_queue:
      enabled: true
      storage: file_storage
      queue_size: 5000
      num_consumers: 8
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 10m

auth: {}

service:
  extensions: [health_check, file_storage]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlphttp]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlphttp]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlphttp]

Adjust numbers from measured traffic, not copy-paste dogma.

7) SLO-oriented runbook

Incident A: queue climbing fast

Check backend health/latency first.
Confirm retry + queue settings are active.
If backend healthy but collectors saturated, scale collector replicas.
If backend unhealthy, protect ingestion (sampling/rate limits) and fix backend before scaling collector aggressively.

Incident B: memory limiter refusing continuously

Confirm limiter is first in processors.
Verify GOMEMLIMIT and container memory limits are coherent.
Reduce burstiness (batch size/timeout), raise capacity, or scale out.
Audit receivers/exporters that don’t handle retries correctly.

Incident C: data lost during restart

Confirm queue persistence (file_storage) is configured and writable.
Verify disk headroom and WAL I/O behavior.
Recalculate retry horizon vs worst-case downstream outage.

8) Metrics to dashboard (minimum set)

otelcol_processor_refused_spans|metrics|logs
otelcol_exporter_queue_size
otelcol_exporter_queue_capacity
otelcol_exporter_enqueue_failed_*
otelcol_exporter_send_failed_*
otelcol_exporter_sent_*
Process memory + GC stats
WAL disk usage / filesystem free space

Alerting hints:

Queue ratio > 70% for N minutes
Refused data non-zero sustained
Enqueue failures > 0 sustained
WAL disk free below safety threshold

9) Rollout plan (low-drama)

Baseline current loss/latency profile.
Enable memory_limiter + exporter queue/retry in canary.
Enable file_storage for canary gateways.
Add queue/refusal SLO alerts before broad rollout.
Scale test with synthetic backend slowdown.
Promote by signal type (traces -> metrics -> logs) to reduce blast radius.

10) Final take

Collector reliability is less about one magic component and more about system discipline:

explicit backpressure,
bounded buffering,
persistence where needed,
and metrics-driven scaling.

If one change must happen this week: make memory_limiter first, and verify every exporter path has queue + retry configured and monitored.

Evidence anchors / references

OpenTelemetry Collector configuration docs: https://opentelemetry.io/docs/collector/configuration/
OpenTelemetry resiliency guide (queue/retry/WAL patterns): https://opentelemetry.io/docs/collector/resiliency/
OpenTelemetry scaling guide (queue/refusal metrics and scale decisions): https://opentelemetry.io/docs/collector/scaling/
Memory limiter processor README (soft/hard limits, retry semantics, GOMEMLIMIT guidance): https://github.com/open-telemetry/opentelemetry-collector/blob/main/processor/memorylimiterprocessor/README.md

OpenTelemetry Collector Reliability Playbook: Backpressure, Queues, WAL, and Scale (2026)

OpenTelemetry Collector Reliability Playbook: Backpressure, Queues, WAL, and Scale (2026)

TL;DR

1) Failure model first (what actually loses data)

2) Reliability architecture (practical default)

Recommended topology

Why this works

3) Backpressure-first configuration strategy

A. Memory limiter (first processor)

B. Batch after limiter

C. Export queue + retry

4) WAL persistence (file_storage) when you need restart durability

5) Scale triggers and anti-triggers

Scale-up signals

Do NOT scale collectors yet when

6) Baseline config template (gateway-oriented)

7) SLO-oriented runbook

Incident A: queue climbing fast

Incident B: memory limiter refusing continuously

Incident C: data lost during restart

8) Metrics to dashboard (minimum set)

9) Rollout plan (low-drama)

10) Final take

Evidence anchors / references

4) WAL persistence (`file_storage`) when you need restart durability