Shadow Traffic & Dark Launch Playbook (Practical)
Date: 2026-02-23
Category: knowledge
Why this matters
When releases fail in production, the root cause is often not code correctness but environment mismatch (real traffic shape, payload skew, latency cascades, retries, noisy neighbors).
Shadow traffic and dark launches reduce that mismatch by testing new code paths under realistic conditions before user-visible cutover.
Core concepts
1) Shadow traffic
Duplicate live requests to a candidate system (read-only / side-effect suppressed), compare behavior with the current system.
- Goal: detect correctness + latency + stability gaps under real traffic
- Rule: shadow path must not trigger external side effects (emails, payments, writes)
2) Dark launch
Deploy feature code to production, but keep user exposure at 0% (or internal-only) via flags/routing.
- Goal: validate runtime behavior safely in prod
- Rule: separate deployment from release
3) Progressive exposure
0% โ internal โ 1% โ 5% โ 25% โ 100%, with explicit guardrails at each step.
Execution blueprint
Phase A โ Readiness checks
- Define golden signals per endpoint:
- p50/p95/p99 latency
- error rate (5xx, timeout)
- resource pressure (CPU, memory)
- output divergence (schema/value deltas)
- Define rollback triggers up front (no ad-hoc debate during incident).
- Add request correlation IDs to old/new paths for exact diffing.
Phase B โ Shadow mode
- Mirror a bounded percentage of traffic (start at 1-5%).
- Disable or sandbox side effects in candidate path.
- Compute diff metrics:
- response code parity
- semantic parity (important fields)
- tail latency drift
- Log only sampled payloads if PII risk exists.
Phase C โ Dark launch
- Keep user-facing routing at 0%, but execute new path in production infra.
- Validate:
- connection pool behavior
- cache churn
- retry storms / thundering herd
- downstream quota interaction
Phase D โ Progressive release
- Increase exposure stepwise by flag or segment.
- Pause between steps for markout window (e.g., 20-60 min) to observe delayed failures.
- Auto-stop if guardrail breached.
Guardrail examples (copy/adapt)
- error_rate_new <= error_rate_old + 0.2%p
- p99_new <= p99_old * 1.15
- timeout_new <= timeout_old + 0.1%p
- semantic_divergence <= 0.5%
If any breached for N consecutive windows (e.g., 3 x 5min), freeze rollout and revert traffic split.
Common failure patterns
- Fake shadowing: replay uses synthetic traffic, not real burstiness.
- Leaky side effects: supposedly read-only path still triggers webhooks or writes.
- Schema drift blind spot: status code matches but payload contract differs.
- No rollback contract: team argues under pressure instead of executing predefined threshold.
- Observability lag: metrics delayed, causing over-advancement of rollout.
Minimal implementation checklist
- Traffic duplicator with per-route sampling
- Side-effect kill switch for candidate path
- Correlation ID propagated across services
- Comparator job (status + semantic + latency)
- Feature flag with segmented rollout support
- Auto-halt policy wired to SLO guardrails
- One-command rollback runbook
Quick decision rule
Use shadow traffic when correctness confidence is low.
Use dark launch when integration/runtime confidence is low.
Use both for high-risk releases.
The key insight: release safety is an operations design problem, not just a testing problem.