Outbox + Inbox Pattern Practical Playbook
Why this matters
Most "exactly-once" claims in distributed systems are really at-least-once delivery + idempotent consumption. If you skip this truth, retries turn into duplicate side effects.
This playbook is for building event-driven services that survive crashes, retries, and network jitter without corrupting business state.
Core model (production reality)
- Producer side: use an Outbox table in the same transaction as domain state changes.
- Relay side: publish outbox rows to broker (Kafka/SQS/NATS/etc.) with retry.
- Consumer side: use an Inbox table (or dedupe key store) to make handlers idempotent.
- Result: operationally reliable "effectively-once" business outcomes.
1) Producer: Transactional Outbox
When a command updates domain data, write event records in the same DB transaction.
Minimal outbox schema
id(ULID/UUID)aggregate_typeaggregate_idevent_typepayload_jsoncreated_atpublished_at(nullable)attempt_countlast_error
Rules
- Domain write and outbox insert must commit together.
- Never publish directly from request thread after commit as your only mechanism.
- Relay is asynchronous and retry-safe.
2) Relay: Safe publisher loop
Relay process:
- Select unpublished outbox rows in small batches.
- Publish each event.
- Mark
published_atonly after broker ack. - On failure, increment
attempt_count, storelast_error, retry with backoff.
Operational guardrails
- Use bounded batch size.
- Exponential backoff + jitter.
- Dead-letter after threshold (with alert).
- Track publish lag (
now - created_at) as SLO metric.
3) Consumer: Inbox idempotency
Every consumer should treat duplicates as normal.
Inbox schema (per consumer)
consumer_namemessage_id(from event id)processed_atstatus(done|failed)- unique key:
(consumer_name, message_id)
Handler pattern
- Begin transaction.
- Try insert
(consumer_name, message_id). - If duplicate key: message already handled → ACK and exit.
- Run business side effects.
- Mark inbox status done.
- Commit.
If side effects involve external APIs, include idempotency keys on those API calls too.
4) Versioning and replay safety
Event contracts evolve. Keep handlers replayable.
- Add
event_versionfield. - Prefer additive payload changes.
- Keep upcasters/adapters for old versions.
- Replays should run in sandbox/staging first.
5) Failure modes checklist
- Relay down, app up (outbox growth alert exists)
- Broker down (retry + backoff)
- Consumer crash mid-handler (inbox dedupe blocks duplicate effects)
- Poison message path (DLQ + operator runbook)
- Clock skew and ordering assumptions reviewed
- Reprocessing tool requires explicit scope + dry-run
6) Metrics that actually matter
- Outbox backlog count
- Outbox max age / p95 age
- Publish failure rate
- Inbox duplicate hit rate (should be >0 in real systems)
- Consumer retry-to-success ratio
- DLQ rate
Avoid vanity metric: "messages sent" without lag/failure context.
7) 30-minute implementation plan
- Add outbox table + write path in existing transaction.
- Add relay worker with bounded batch + retry.
- Add inbox table for one critical consumer.
- Add duplicate-safe handler transaction.
- Add 4 dashboards: backlog, lag, failures, DLQ.
- Run chaos drill: kill relay/consumer and verify recovery.
TL;DR
Reliable event-driven architecture is not magic exactly-once transport. It is:
- transactional outbox,
- idempotent inbox,
- explicit retries,
- and observability for backlog/lag/failure.
Design for duplicates and crashes from day one, and production incidents become recoverable operations instead of data forensics.