Saga Orchestration vs Choreography — Practical Playbook

2026-02-23 · software

Saga Orchestration vs Choreography — Practical Playbook

Date: 2026-02-23
Category: knowledge

Why this matters

Distributed transactions fail at the seams: payment succeeds, inventory fails, email retries forever, and support gets the ticket. Sagas are how you keep business consistency without pretending 2PC works across modern services.

Core model (short)

Two styles

1) Orchestration (central workflow brain)

A coordinator service decides the next step and timeout policy.

Best when:

Pros

Cons

2) Choreography (event-driven dance)

Each service reacts to domain events and emits new events.

Best when:

Pros

Cons

Decision heuristic

Use this as a quick classifier:

Hybrid pattern (recommended in practice)

Reliability rules that actually prevent pain

  1. Idempotency keys everywhere

    • Command key for forward action
    • Separate key namespace for compensation
  2. Timeouts are business decisions

    • Technical timeout != business timeout
    • Encode per-step deadline + max retry window
  3. Outbox + Inbox pattern

    • Atomic publish via transactional outbox
    • Consumer inbox table for de-duplication
  4. Compensation semantics first

    • “Undo” is not always inverse action
    • Some steps need refund not delete
  5. Poison message quarantine

    • Dead-letter with reason code
    • Human-runbook path for non-retriable failures
  6. Versioned saga contract

    • Event schema version + compatibility window
    • Never roll out breaking event changes in one shot

Minimal state model (orchestrated)

Recommended saga states:

Store:

Practical SLOs

Track these weekly:

Alert on drift, not just absolute thresholds.

Anti-pattern checklist

Rollout plan (safe)

  1. Pick one high-value workflow.
  2. Draw explicit step/compensation graph.
  3. Add idempotency and outbox/inbox before new retries.
  4. Ship with observability dashboard first.
  5. Run failure-injection drills (timeout, duplicate event, partial outage).
  6. Expand only after manual intervention rate is stable.

If your system is young, orchestration gives control. If your domain is mature, choreography gives scale. Most real systems should do both—intentionally, not accidentally.