OpenTelemetry Collector Reliability Playbook: Backpressure, Queues, WAL, and Scale (2026)
TL;DR
- Collector reliability is mostly a flow-control problem: ingest rate vs processing/export capacity.
- Use
memory_limiteras the first processor so overload becomes controlled backpressure, not random OOM. - Exporters should always have
sending_queue+retry_on_failure; addfile_storage(WAL) when restart-loss is unacceptable. - Scale decisions should be driven by queue/refusal metrics, not CPU alone.
- If backend is saturated, adding collectors can make things worse. Fix backend bottlenecks first.
1) Failure model first (what actually loses data)
In production, telemetry loss usually happens through one of these paths:
- Endpoint outage + queue overflow
- exporter queue fills, new batches get dropped.
- Endpoint outage longer than retry budget
- oldest queued data ages out (
max_elapsed_time) and is dropped.
- oldest queued data ages out (
- Collector restart/crash without persistence
- in-memory queue is gone.
- Collector overload without effective backpressure
- memory spikes, forced drops, or OOM kill.
- Backend is slow/saturated
- queue rises forever; scaling collectors only increases pressure.
Treat these as explicit risk branches in your observability SLO, not edge cases.
2) Reliability architecture (practical default)
Recommended topology
- Agent tier close to workloads (DaemonSet/Sidecar): receive + light processing.
- Gateway tier centralized: heavier processing, sampling, export fan-out.
- Optional durable bus (Kafka) between tiers if you need stronger decoupling.
Why this works
- Local ingestion remains resilient to backend hiccups.
- Heavy transforms and external egress are isolated.
- Each tier scales independently by signal/load profile.
3) Backpressure-first configuration strategy
A. Memory limiter (first processor)
Key behavior:
- Soft limit exceeded -> processor refuses incoming data with non-permanent errors.
- Hard limit exceeded -> forces GC and keeps refusing until memory falls.
Operational rules:
- Put
memory_limiterfirst in processor chain. - Set
check_intervalnear 1s for fast reaction. - Start
spike_limit_mibaround ~20% of hard limit. - Set
GOMEMLIMITroughly ~80% of Collector hard memory limit.
B. Batch after limiter
batch improves compression and export efficiency, but increases burst amplitude if oversized.
Tune it with latency budget in mind, not max throughput alone.
C. Export queue + retry
Every network exporter should define:
sending_queue.enabled: true- sufficient
queue_size - bounded retry policy (
initial_interval,max_interval,max_elapsed_time)
This creates controlled buffering during transient downstream failures.
4) WAL persistence (file_storage) when you need restart durability
Use file_storage for exporter queues when "collector restarts = data loss" is unacceptable.
Benefits:
- Survives pod/node restarts.
- Replays queued telemetry after restart.
Risks:
- Disk full / slow disk = new failure mode.
- WAL is not an infinite buffer; retry horizon still matters.
Minimum rule: if SLO cannot tolerate restart-window loss, use persistent queue storage and monitor disk aggressively.
5) Scale triggers and anti-triggers
Scale-up signals
otelcol_processor_refused_*rises (memory limiter refusing data).otelcol_exporter_queue_size / queue_capacitysustained > ~0.6-0.7.- enqueue failures increase (
otelcol_exporter_enqueue_failed_*).
Do NOT scale collectors yet when
- queue remains near capacity and send failures remain high.
- backend latency/error is rising.
That often means backend saturation. More collectors = more pressure = faster failure.
6) Baseline config template (gateway-oriented)
extensions:
health_check:
file_storage:
directory: /var/lib/otelcol/storage
receivers:
otlp:
protocols:
grpc:
http:
processors:
memory_limiter:
check_interval: 1s
limit_mib: 2048
spike_limit_mib: 410
batch:
timeout: 5s
send_batch_size: 1024
send_batch_max_size: 2048
exporters:
otlphttp:
endpoint: https://observability-backend.example.com
sending_queue:
enabled: true
storage: file_storage
queue_size: 5000
num_consumers: 8
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 10m
auth: {}
service:
extensions: [health_check, file_storage]
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlphttp]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlphttp]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlphttp]
Adjust numbers from measured traffic, not copy-paste dogma.
7) SLO-oriented runbook
Incident A: queue climbing fast
- Check backend health/latency first.
- Confirm retry + queue settings are active.
- If backend healthy but collectors saturated, scale collector replicas.
- If backend unhealthy, protect ingestion (sampling/rate limits) and fix backend before scaling collector aggressively.
Incident B: memory limiter refusing continuously
- Confirm limiter is first in processors.
- Verify
GOMEMLIMITand container memory limits are coherent. - Reduce burstiness (
batchsize/timeout), raise capacity, or scale out. - Audit receivers/exporters that don’t handle retries correctly.
Incident C: data lost during restart
- Confirm queue persistence (
file_storage) is configured and writable. - Verify disk headroom and WAL I/O behavior.
- Recalculate retry horizon vs worst-case downstream outage.
8) Metrics to dashboard (minimum set)
otelcol_processor_refused_spans|metrics|logsotelcol_exporter_queue_sizeotelcol_exporter_queue_capacityotelcol_exporter_enqueue_failed_*otelcol_exporter_send_failed_*otelcol_exporter_sent_*- Process memory + GC stats
- WAL disk usage / filesystem free space
Alerting hints:
- Queue ratio > 70% for N minutes
- Refused data non-zero sustained
- Enqueue failures > 0 sustained
- WAL disk free below safety threshold
9) Rollout plan (low-drama)
- Baseline current loss/latency profile.
- Enable
memory_limiter+ exporter queue/retry in canary. - Enable
file_storagefor canary gateways. - Add queue/refusal SLO alerts before broad rollout.
- Scale test with synthetic backend slowdown.
- Promote by signal type (traces -> metrics -> logs) to reduce blast radius.
10) Final take
Collector reliability is less about one magic component and more about system discipline:
- explicit backpressure,
- bounded buffering,
- persistence where needed,
- and metrics-driven scaling.
If one change must happen this week: make memory_limiter first, and verify every exporter path has queue + retry configured and monitored.
Evidence anchors / references
- OpenTelemetry Collector configuration docs: https://opentelemetry.io/docs/collector/configuration/
- OpenTelemetry resiliency guide (queue/retry/WAL patterns): https://opentelemetry.io/docs/collector/resiliency/
- OpenTelemetry scaling guide (queue/refusal metrics and scale decisions): https://opentelemetry.io/docs/collector/scaling/
- Memory limiter processor README (soft/hard limits, retry semantics, GOMEMLIMIT guidance): https://github.com/open-telemetry/opentelemetry-collector/blob/main/processor/memorylimiterprocessor/README.md