Redis Streams Consumer-Group Reliability Playbook
Why this note
Redis Streams is great for low-latency, operationally simple event pipelines when you already run Redis. But many teams hit reliability issues (stuck PEL, unbounded stream growth, duplicate side effects) before they hit throughput limits.
This playbook focuses on practical reliability patterns for production.
TL;DR
- Treat Redis Streams consumer groups as at-least-once delivery by default.
- Design handlers as idempotent (or externally deduplicated).
- Monitor and control the Pending Entries List (PEL) aggressively.
- Use
XAUTOCLAIM(orXCLAIM) for abandoned messages. - Set explicit retention (
XTRIM/MAXLEN/MINID) so memory does not silently explode. - If you need long retention + replay at large scale + multi-tenant throughput isolation, consider Kafka-class systems.
1) Correct mental model
Delivery semantics
Redis Streams + consumer groups give you:
- Ordered IDs within a stream
- Group-based work distribution
- Redelivery via pending/claim flows
But practically, you should assume:
- At-least-once processing
- Possible duplicate processing after crashes/timeouts
- Need for idempotent side effects
Core lifecycle
- Producer appends with
XADD - Consumer group reads with
XREADGROUP - Message enters group PEL until acked
- Consumer does side effect
- Consumer
XACKs message
If consumer dies after side effect but before XACK, message is redelivered. That is expected behavior.
2) Reliability design patterns
Pattern A β Idempotent handler boundary (non-negotiable)
Use one of:
- Idempotency key table (
stream + message-id) - Upsert/merge semantics in target store
- External dedupe cache with bounded TTL
Rule: A redelivered message must be safe.
Pattern B β Ack only after durable side effect
XACK means βdoneβ.
- Acking early loses messages on worker crash.
- Acking late increases PEL size but preserves correctness.
Prefer correctness; tune capacity to keep lag reasonable.
Pattern C β Reclaim abandoned messages
Use periodic sweeper logic:
- Detect stale pending messages (
XPENDING) - Reassign with
XAUTOCLAIM(preferred in modern Redis) - Retry with bounded attempts; dead-letter after threshold
Pattern D β Dead-letter stream
After N failed attempts:
- Write event + metadata to
mystream.dlq XACKoriginal- Alert if DLQ rate crosses threshold
A DLQ keeps the main pipeline moving while preserving forensic data.
3) Operational guardrails
Stream growth control
Never leave retention implicit.
Use one:
XADD ... MAXLEN ~ <n>for approximate cap by countXTRIM mystream MAXLEN ~ <n>background-ish trimmingXTRIM mystream MINID <id>to enforce age/time-based retention
Choose retention from replay/SLA requirements, not guesswork.
PEL hygiene
Track per group/consumer:
- Pending count
- Oldest pending idle time
- Reclaim rate
- Retry-count distribution
- Messages pending beyond SLA (e.g., > 60s)
High pending idle time is often a better early-warning signal than raw lag.
Consumer identity discipline
Use stable, explicit consumer names (host/pod + instance ID).
- Avoid random names each restart (PEL becomes messy)
- Clean up dead consumers when needed (
XGROUP DELCONSUMER)
Backpressure policy
When downstream is unhealthy:
- Slow or pause readers intentionally
- Keep bounded in-memory worker queues
- Shed low-priority work before everything times out
Without explicit backpressure, retries become a feedback loop.
4) Failure-mode checklist
Symptom: pending count climbs forever
Likely causes:
- Consumer crashes before
XACK - Downstream dependency stalls
- No claim sweeper
Actions:
- Check oldest pending idle + owning consumer
- Trigger
XAUTOCLAIMfor idle > threshold - Scale workers if downstream healthy
- Route poison messages to DLQ
Symptom: duplicates causing money-loss side effects
Likely cause: non-idempotent handler.
Actions:
- Add idempotency key on side-effect boundary
- Reprocess replay safely in staging
- Block deploys without idempotency tests
Symptom: Redis memory pressure from streams
Likely causes:
- Missing retention trim policy
- Oversized messages
Actions:
- Set
MAXLEN/MINIDpolicy - Move blobs to object storage, keep pointers in stream
- Revisit retention windows by business value
5) Suggested baseline defaults (starting point)
- Batch read size: 50β500 (workload-dependent)
- Block timeout: 1000β5000 ms
- Claim idle threshold: 30β120 s
- Max delivery attempts before DLQ: 5β20
- Retention: explicit, based on replay objective (e.g., 24h/72h)
These are not universal constants; calibrate with production telemetry.
6) When to move from Redis Streams to Kafka-class backbone
Consider migration when you need:
- Much longer retention with large replay volume
- Higher fan-out with strict throughput isolation
- Richer stream governance across many teams
- Stronger ecosystem around large-scale ETL/analytics
Stay on Redis Streams when:
- Latency and operational simplicity matter more
- Existing Redis footprint is strong
- Workload scale remains within memory + persistence budget
7) Minimal runbook commands
# create group (once)
XGROUP CREATE mystream mygroup $ MKSTREAM
# consume
XREADGROUP GROUP mygroup consumer-1 COUNT 100 BLOCK 2000 STREAMS mystream >
# inspect pending summary
XPENDING mystream mygroup
# inspect pending details
XPENDING mystream mygroup - + 20
# reclaim idle pending messages
XAUTOCLAIM mystream mygroup consumer-recovery 60000 0-0 COUNT 100
# ack after successful processing
XACK mystream mygroup 1711450000000-0
# trim stream (approximate)
XTRIM mystream MAXLEN ~ 1000000
References
- Redis Streams data type docs: https://redis.io/docs/latest/develop/data-types/streams/
XREADGROUP: https://redis.io/docs/latest/commands/xreadgroup/XPENDING: https://redis.io/docs/latest/commands/xpending/XCLAIM: https://redis.io/docs/latest/commands/xclaim/- Kafka EOS design context (for comparison): https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging