Redis Streams Consumer-Group Reliability Playbook

Why this note

Redis Streams is great for low-latency, operationally simple event pipelines when you already run Redis. But many teams hit reliability issues (stuck PEL, unbounded stream growth, duplicate side effects) before they hit throughput limits.

This playbook focuses on practical reliability patterns for production.

TL;DR

Treat Redis Streams consumer groups as at-least-once delivery by default.
Design handlers as idempotent (or externally deduplicated).
Monitor and control the Pending Entries List (PEL) aggressively.
Use XAUTOCLAIM (or XCLAIM) for abandoned messages.
Set explicit retention (XTRIM / MAXLEN / MINID) so memory does not silently explode.
If you need long retention + replay at large scale + multi-tenant throughput isolation, consider Kafka-class systems.

1) Correct mental model

Delivery semantics

Redis Streams + consumer groups give you:

Ordered IDs within a stream
Group-based work distribution
Redelivery via pending/claim flows

But practically, you should assume:

At-least-once processing
Possible duplicate processing after crashes/timeouts
Need for idempotent side effects

Core lifecycle

Producer appends with XADD
Consumer group reads with XREADGROUP
Message enters group PEL until acked
Consumer does side effect
Consumer XACKs message

If consumer dies after side effect but before XACK, message is redelivered. That is expected behavior.

2) Reliability design patterns

Pattern A — Idempotent handler boundary (non-negotiable)

Use one of:

Idempotency key table (stream + message-id)
Upsert/merge semantics in target store
External dedupe cache with bounded TTL

Rule: A redelivered message must be safe.

Pattern B — Ack only after durable side effect

XACK means “done”.

Acking early loses messages on worker crash.
Acking late increases PEL size but preserves correctness.

Prefer correctness; tune capacity to keep lag reasonable.

Pattern C — Reclaim abandoned messages

Use periodic sweeper logic:

Detect stale pending messages (XPENDING)
Reassign with XAUTOCLAIM (preferred in modern Redis)
Retry with bounded attempts; dead-letter after threshold

Pattern D — Dead-letter stream

After N failed attempts:

Write event + metadata to mystream.dlq
XACK original
Alert if DLQ rate crosses threshold

A DLQ keeps the main pipeline moving while preserving forensic data.

3) Operational guardrails

Stream growth control

Never leave retention implicit.

Use one:

XADD ... MAXLEN ~ <n> for approximate cap by count
XTRIM mystream MAXLEN ~ <n> background-ish trimming
XTRIM mystream MINID <id> to enforce age/time-based retention

Choose retention from replay/SLA requirements, not guesswork.

PEL hygiene

Track per group/consumer:

Pending count
Oldest pending idle time
Reclaim rate
Retry-count distribution
Messages pending beyond SLA (e.g., > 60s)

High pending idle time is often a better early-warning signal than raw lag.

Consumer identity discipline

Use stable, explicit consumer names (host/pod + instance ID).

Avoid random names each restart (PEL becomes messy)
Clean up dead consumers when needed (XGROUP DELCONSUMER)

Backpressure policy

When downstream is unhealthy:

Slow or pause readers intentionally
Keep bounded in-memory worker queues
Shed low-priority work before everything times out

Without explicit backpressure, retries become a feedback loop.

4) Failure-mode checklist

Symptom: pending count climbs forever

Likely causes:

Consumer crashes before XACK
Downstream dependency stalls
No claim sweeper

Actions:

Check oldest pending idle + owning consumer
Trigger XAUTOCLAIM for idle > threshold
Scale workers if downstream healthy
Route poison messages to DLQ

Symptom: duplicates causing money-loss side effects

Likely cause: non-idempotent handler.

Actions:

Add idempotency key on side-effect boundary
Reprocess replay safely in staging
Block deploys without idempotency tests

Symptom: Redis memory pressure from streams

Likely causes:

Missing retention trim policy
Oversized messages

Actions:

Set MAXLEN/MINID policy
Move blobs to object storage, keep pointers in stream
Revisit retention windows by business value

5) Suggested baseline defaults (starting point)

Batch read size: 50–500 (workload-dependent)
Block timeout: 1000–5000 ms
Claim idle threshold: 30–120 s
Max delivery attempts before DLQ: 5–20
Retention: explicit, based on replay objective (e.g., 24h/72h)

These are not universal constants; calibrate with production telemetry.

6) When to move from Redis Streams to Kafka-class backbone

Consider migration when you need:

Much longer retention with large replay volume
Higher fan-out with strict throughput isolation
Richer stream governance across many teams
Stronger ecosystem around large-scale ETL/analytics

Stay on Redis Streams when:

Latency and operational simplicity matter more
Existing Redis footprint is strong
Workload scale remains within memory + persistence budget

7) Minimal runbook commands

# create group (once)
XGROUP CREATE mystream mygroup $ MKSTREAM

# consume
XREADGROUP GROUP mygroup consumer-1 COUNT 100 BLOCK 2000 STREAMS mystream >

# inspect pending summary
XPENDING mystream mygroup

# inspect pending details
XPENDING mystream mygroup - + 20

# reclaim idle pending messages
XAUTOCLAIM mystream mygroup consumer-recovery 60000 0-0 COUNT 100

# ack after successful processing
XACK mystream mygroup 1711450000000-0

# trim stream (approximate)
XTRIM mystream MAXLEN ~ 1000000

References

Redis Streams data type docs: https://redis.io/docs/latest/develop/data-types/streams/
XREADGROUP: https://redis.io/docs/latest/commands/xreadgroup/
XPENDING: https://redis.io/docs/latest/commands/xpending/
XCLAIM: https://redis.io/docs/latest/commands/xclaim/
Kafka EOS design context (for comparison): https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging