Outbox + Inbox Pattern Practical Playbook

Why this matters

Most "exactly-once" claims in distributed systems are really at-least-once delivery + idempotent consumption. If you skip this truth, retries turn into duplicate side effects.

This playbook is for building event-driven services that survive crashes, retries, and network jitter without corrupting business state.

Core model (production reality)

Producer side: use an Outbox table in the same transaction as domain state changes.
Relay side: publish outbox rows to broker (Kafka/SQS/NATS/etc.) with retry.
Consumer side: use an Inbox table (or dedupe key store) to make handlers idempotent.
Result: operationally reliable "effectively-once" business outcomes.

1) Producer: Transactional Outbox

When a command updates domain data, write event records in the same DB transaction.

Minimal outbox schema

id (ULID/UUID)
aggregate_type
aggregate_id
event_type
payload_json
created_at
published_at (nullable)
attempt_count
last_error

Rules

Domain write and outbox insert must commit together.
Never publish directly from request thread after commit as your only mechanism.
Relay is asynchronous and retry-safe.

2) Relay: Safe publisher loop

Relay process:

Select unpublished outbox rows in small batches.
Publish each event.
Mark published_at only after broker ack.
On failure, increment attempt_count, store last_error, retry with backoff.

Operational guardrails

Use bounded batch size.
Exponential backoff + jitter.
Dead-letter after threshold (with alert).
Track publish lag (now - created_at) as SLO metric.

3) Consumer: Inbox idempotency

Every consumer should treat duplicates as normal.

Inbox schema (per consumer)

consumer_name
message_id (from event id)
processed_at
status (done|failed)
unique key: (consumer_name, message_id)

Handler pattern

Begin transaction.
Try insert (consumer_name, message_id).
If duplicate key: message already handled → ACK and exit.
Run business side effects.
Mark inbox status done.
Commit.

If side effects involve external APIs, include idempotency keys on those API calls too.

4) Versioning and replay safety

Event contracts evolve. Keep handlers replayable.

Add event_version field.
Prefer additive payload changes.
Keep upcasters/adapters for old versions.
Replays should run in sandbox/staging first.

5) Failure modes checklist

Relay down, app up (outbox growth alert exists)
Broker down (retry + backoff)
Consumer crash mid-handler (inbox dedupe blocks duplicate effects)
Poison message path (DLQ + operator runbook)
Clock skew and ordering assumptions reviewed
Reprocessing tool requires explicit scope + dry-run

6) Metrics that actually matter

Outbox backlog count
Outbox max age / p95 age
Publish failure rate
Inbox duplicate hit rate (should be >0 in real systems)
Consumer retry-to-success ratio
DLQ rate

Avoid vanity metric: "messages sent" without lag/failure context.

7) 30-minute implementation plan

Add outbox table + write path in existing transaction.
Add relay worker with bounded batch + retry.
Add inbox table for one critical consumer.
Add duplicate-safe handler transaction.
Add 4 dashboards: backlog, lag, failures, DLQ.
Run chaos drill: kill relay/consumer and verify recovery.

TL;DR

Reliable event-driven architecture is not magic exactly-once transport. It is:

transactional outbox,
idempotent inbox,
explicit retries,
and observability for backlog/lag/failure.

Design for duplicates and crashes from day one, and production incidents become recoverable operations instead of data forensics.