Saga Orchestration vs Choreography — Practical Playbook
Date: 2026-02-23
Category: knowledge
Why this matters
Distributed transactions fail at the seams: payment succeeds, inventory fails, email retries forever, and support gets the ticket. Sagas are how you keep business consistency without pretending 2PC works across modern services.
Core model (short)
- Saga = sequence of local transactions + compensations.
- Forward action does work (
reserve inventory). - Compensation semantically undoes work (
release inventory). - Guarantee is eventual consistency, not instant atomicity.
Two styles
1) Orchestration (central workflow brain)
A coordinator service decides the next step and timeout policy.
Best when:
- Flow is long and business-critical
- Strong observability/auditability is required
- Teams want one place to enforce retries, deadlines, rollback policy
Pros
- Clear state machine and ownership
- Easier incident debugging and replay
- Policy changes in one place
Cons
- Coordinator can become bottleneck/mono-brain
- Risk of coupling if every team depends on coordinator release cycle
2) Choreography (event-driven dance)
Each service reacts to domain events and emits new events.
Best when:
- Domain boundaries are mature
- Teams are autonomous
- Flows are relatively simple and additive
Pros
- Loose coupling, local autonomy
- Easy to add new listeners
Cons
- Hidden control flow (“who triggers what?”)
- Harder global reasoning and failure tracing
- Easy to create event loops/duplicate side-effects
Decision heuristic
Use this as a quick classifier:
- Need strict timeline/SLA and clear “single pane of truth”? → Orchestration
- Need high team autonomy and composable extensions? → Choreography
- If uncertain: start with orchestrated core + choreographed side effects.
Hybrid pattern (recommended in practice)
- Orchestrate the critical money/state path (order, payment, inventory).
- Choreograph non-critical observers (analytics, notifications, enrichment).
- Keep a hard contract: observers must be idempotent and never block core completion.
Reliability rules that actually prevent pain
Idempotency keys everywhere
- Command key for forward action
- Separate key namespace for compensation
Timeouts are business decisions
- Technical timeout != business timeout
- Encode per-step deadline + max retry window
Outbox + Inbox pattern
- Atomic publish via transactional outbox
- Consumer inbox table for de-duplication
Compensation semantics first
- “Undo” is not always inverse action
- Some steps need
refundnotdelete
Poison message quarantine
- Dead-letter with reason code
- Human-runbook path for non-retriable failures
Versioned saga contract
- Event schema version + compatibility window
- Never roll out breaking event changes in one shot
Minimal state model (orchestrated)
Recommended saga states:
PENDINGRUNNINGWAITING_RETRYCOMPENSATINGCOMPLETEDFAILED_FINAL
Store:
saga_id,correlation_id,step,attemptstate,last_error_code,deadline_atstarted_at,updated_at,completed_at
Practical SLOs
Track these weekly:
- Saga completion rate (%)
- p95 end-to-end latency
- Compensation rate (%)
- Retry fan-out (avg retries per completed saga)
- Manual intervention rate (%)
Alert on drift, not just absolute thresholds.
Anti-pattern checklist
- “At least once delivery, exactly once effects” not enforced → ❌
- Compensation missing for one step because “rare” → ❌
- No global correlation id across services → ❌
- Event names encode transport details, not domain facts → ❌
- Retrying blindly without backoff/jitter → ❌
Rollout plan (safe)
- Pick one high-value workflow.
- Draw explicit step/compensation graph.
- Add idempotency and outbox/inbox before new retries.
- Ship with observability dashboard first.
- Run failure-injection drills (timeout, duplicate event, partial outage).
- Expand only after manual intervention rate is stable.
If your system is young, orchestration gives control. If your domain is mature, choreography gives scale. Most real systems should do both—intentionally, not accidentally.