Dead-Letter Queue (DLQ) Replay Governance: Poison-Message Handling Playbook

2026-03-01 · software

Dead-Letter Queue (DLQ) Replay Governance: Poison-Message Handling Playbook

Date: 2026-03-01
Category: knowledge
Domain: software / distributed systems / reliability engineering

Why this matters

A DLQ is not a trash can. It is a risk buffer.

If you treat DLQ as “we’ll check later,” incidents accumulate quietly:

Good systems assume DLQ events are inevitable and make replay safe, bounded, and auditable.


First principle: classify failures before replay

Not all DLQ messages are equal. Replay policy must depend on failure class.

Failure classes

  1. Transient dependency failure

    • timeout, 503, short network partition
    • likely replayable after backoff
  2. Capacity/overload failure

    • dependency saturation, queue backlog collapse
    • replayable only with rate limits and retry budget
  3. Data contract/schema failure

    • missing fields, incompatible schema, parse errors
    • not replayable until producer/consumer contract is fixed
  4. Business rule failure (deterministic reject)

    • invalid state transition, permanent validation failure
    • replay usually wrong; route to manual/compensation path
  5. Code bug / deterministic handler crash

    • same message always crashes current version
    • replay only after deploy fix + canary redrive

Rule: replay only when the failure mode has moved from deterministic to non-deterministic (or has been fixed).


Architecture pattern (production baseline)

A) Primary queue/topic with bounded retry

B) DLQ with rich failure envelope

When dead-lettering, include metadata:

{
  "messageId": "...",
  "source": "orders.events.v1",
  "failedAt": "2026-03-01T17:20:00+09:00",
  "attempt": 8,
  "errorClass": "SchemaValidationError",
  "errorCode": "ORDERS_422_MISSING_FIELD",
  "handlerVersion": "consumer@2.3.1",
  "traceId": "...",
  "payload": {"...": "..."}
}

Without this envelope, replay becomes archaeology.

C) Replay lane (separate from normal traffic)

Never bulk-replay directly into the same overloaded path that just failed.


Minimal control state machine

Use an explicit queue health state:

Transition examples

State machines beat “operator vibes” during incidents.


Replay governance policy (the core)

1) Replay admission checks

Before replaying a batch, verify:

If any check fails: do not replay.

2) Replay strategy ladder

Start small and escalate only if stable:

  1. single-message probe
  2. 100-message canary batch
  3. 1% backlog stream
  4. progressive ramp (5% -> 20% -> 50% -> 100%)

At each stage, evaluate:

3) Abort conditions

Immediately stop replay on:

Automation should enforce stop conditions without waiting for manual heroics.


Poison-message containment patterns

Pattern A: Quarantine topic/queue

Messages that fail deterministic validation go to quarantine.*, not back to retry loop.

Pattern B: Transform-and-replay patch lane

For known fixable shape issues (e.g., field rename), run controlled transform job with:

Pattern C: Compensate instead of replay

For business-invalid events, emit compensating workflow tasks instead of forcing replay.

Pattern D: Human-in-the-loop approval

High-risk financial/customer-impact flows require manual approval for large replay windows.


Idempotency is non-negotiable for replay

Replay without idempotency is duplicate side effects with extra steps.

Required controls:

If idempotency is weak, keep replay rate near zero until controls are in place.


Data model sketch for DLQ operations

create table dlq_events (
  id bigserial primary key,
  message_id text not null,
  source text not null,
  error_class text not null,
  error_code text,
  failed_at timestamptz not null,
  payload jsonb not null,
  metadata jsonb not null
);

create table replay_jobs (
  replay_id uuid primary key,
  created_at timestamptz not null default now(),
  created_by text not null,
  scope jsonb not null,
  status text not null check (status in ('planned','running','paused','completed','aborted')),
  max_rate_per_sec int not null,
  max_concurrency int not null,
  stop_policy jsonb not null,
  notes text
);

create table replay_audit (
  replay_id uuid not null,
  message_id text not null,
  replayed_at timestamptz not null default now(),
  outcome text not null check (outcome in ('success','failed','skipped_duplicate','quarantined')),
  reason text,
  primary key (replay_id, message_id)
);

This gives an audit trail for what was replayed, skipped, or quarantined.


SLOs and metrics that actually matter

Flow metrics

Quality metrics

Safety metrics

Governance metrics

If you only track backlog size, you miss the true risk.


30-day implementation blueprint

Week 1:

Week 2:

Week 3:

Week 4:

Goal: replay becomes routine operations, not a one-off panic ritual.


Anti-patterns

  1. Infinite retry before DLQ

    • burns resources, delays diagnosis, increases blast radius.
  2. “Just redrive all” button during incident

    • turns one failure into two outages.
  3. No error classification, only stack traces

    • impossible to prioritize and automate policy.
  4. Replay on hot path with no throttling

    • self-inflicted overload.
  5. No replay auditability

    • no proof of what was recovered vs lost.
  6. Ignoring deterministic business rejects

    • replay loop hides product/data correctness issues.

Bottom line

DLQ handling is not just queue configuration. It is an operational control system:

When done right, DLQ stops being a graveyard and becomes a reliable recovery pipeline.


References (researched)