Outbox + Inbox Pattern Practical Playbook

2026-02-22 · software

Outbox + Inbox Pattern Practical Playbook

Why this matters

Most "exactly-once" claims in distributed systems are really at-least-once delivery + idempotent consumption. If you skip this truth, retries turn into duplicate side effects.

This playbook is for building event-driven services that survive crashes, retries, and network jitter without corrupting business state.


Core model (production reality)


1) Producer: Transactional Outbox

When a command updates domain data, write event records in the same DB transaction.

Minimal outbox schema

Rules

  1. Domain write and outbox insert must commit together.
  2. Never publish directly from request thread after commit as your only mechanism.
  3. Relay is asynchronous and retry-safe.

2) Relay: Safe publisher loop

Relay process:

  1. Select unpublished outbox rows in small batches.
  2. Publish each event.
  3. Mark published_at only after broker ack.
  4. On failure, increment attempt_count, store last_error, retry with backoff.

Operational guardrails


3) Consumer: Inbox idempotency

Every consumer should treat duplicates as normal.

Inbox schema (per consumer)

Handler pattern

  1. Begin transaction.
  2. Try insert (consumer_name, message_id).
  3. If duplicate key: message already handled → ACK and exit.
  4. Run business side effects.
  5. Mark inbox status done.
  6. Commit.

If side effects involve external APIs, include idempotency keys on those API calls too.


4) Versioning and replay safety

Event contracts evolve. Keep handlers replayable.


5) Failure modes checklist


6) Metrics that actually matter

Avoid vanity metric: "messages sent" without lag/failure context.


7) 30-minute implementation plan

  1. Add outbox table + write path in existing transaction.
  2. Add relay worker with bounded batch + retry.
  3. Add inbox table for one critical consumer.
  4. Add duplicate-safe handler transaction.
  5. Add 4 dashboards: backlog, lag, failures, DLQ.
  6. Run chaos drill: kill relay/consumer and verify recovery.

TL;DR

Reliable event-driven architecture is not magic exactly-once transport. It is:

Design for duplicates and crashes from day one, and production incidents become recoverable operations instead of data forensics.