Saga Orchestration vs Choreography — Practical Playbook

Date: 2026-02-23
Category: knowledge

Why this matters

Distributed transactions fail at the seams: payment succeeds, inventory fails, email retries forever, and support gets the ticket. Sagas are how you keep business consistency without pretending 2PC works across modern services.

Core model (short)

Saga = sequence of local transactions + compensations.
Forward action does work (reserve inventory).
Compensation semantically undoes work (release inventory).
Guarantee is eventual consistency, not instant atomicity.

Two styles

1) Orchestration (central workflow brain)

A coordinator service decides the next step and timeout policy.

Best when:

Flow is long and business-critical
Strong observability/auditability is required
Teams want one place to enforce retries, deadlines, rollback policy

Pros

Clear state machine and ownership
Easier incident debugging and replay
Policy changes in one place

Cons

Coordinator can become bottleneck/mono-brain
Risk of coupling if every team depends on coordinator release cycle

2) Choreography (event-driven dance)

Each service reacts to domain events and emits new events.

Best when:

Domain boundaries are mature
Teams are autonomous
Flows are relatively simple and additive

Pros

Loose coupling, local autonomy
Easy to add new listeners

Cons

Hidden control flow (“who triggers what?”)
Harder global reasoning and failure tracing
Easy to create event loops/duplicate side-effects

Decision heuristic

Use this as a quick classifier:

Need strict timeline/SLA and clear “single pane of truth”? → Orchestration
Need high team autonomy and composable extensions? → Choreography
If uncertain: start with orchestrated core + choreographed side effects.

Hybrid pattern (recommended in practice)

Orchestrate the critical money/state path (order, payment, inventory).
Choreograph non-critical observers (analytics, notifications, enrichment).
Keep a hard contract: observers must be idempotent and never block core completion.

Reliability rules that actually prevent pain

Idempotency keys everywhere
- Command key for forward action
- Separate key namespace for compensation
Timeouts are business decisions
- Technical timeout != business timeout
- Encode per-step deadline + max retry window
Outbox + Inbox pattern
- Atomic publish via transactional outbox
- Consumer inbox table for de-duplication
Compensation semantics first
- “Undo” is not always inverse action
- Some steps need refund not delete
Poison message quarantine
- Dead-letter with reason code
- Human-runbook path for non-retriable failures
Versioned saga contract
- Event schema version + compatibility window
- Never roll out breaking event changes in one shot

Minimal state model (orchestrated)

Recommended saga states:

PENDING
RUNNING
WAITING_RETRY
COMPENSATING
COMPLETED
FAILED_FINAL

Store:

saga_id, correlation_id, step, attempt
state, last_error_code, deadline_at
started_at, updated_at, completed_at

Practical SLOs

Track these weekly:

Saga completion rate (%)
p95 end-to-end latency
Compensation rate (%)
Retry fan-out (avg retries per completed saga)
Manual intervention rate (%)

Alert on drift, not just absolute thresholds.

Anti-pattern checklist

“At least once delivery, exactly once effects” not enforced → ❌
Compensation missing for one step because “rare” → ❌
No global correlation id across services → ❌
Event names encode transport details, not domain facts → ❌
Retrying blindly without backoff/jitter → ❌

Rollout plan (safe)

Pick one high-value workflow.
Draw explicit step/compensation graph.
Add idempotency and outbox/inbox before new retries.
Ship with observability dashboard first.
Run failure-injection drills (timeout, duplicate event, partial outage).
Expand only after manual intervention rate is stable.

If your system is young, orchestration gives control. If your domain is mature, choreography gives scale. Most real systems should do both—intentionally, not accidentally.