Shadow Traffic & Dark Launch Playbook (Practical)

Date: 2026-02-23
Category: knowledge

Why this matters

When releases fail in production, the root cause is often not code correctness but environment mismatch (real traffic shape, payload skew, latency cascades, retries, noisy neighbors).
Shadow traffic and dark launches reduce that mismatch by testing new code paths under realistic conditions before user-visible cutover.

Core concepts

1) Shadow traffic

Duplicate live requests to a candidate system (read-only / side-effect suppressed), compare behavior with the current system.

Goal: detect correctness + latency + stability gaps under real traffic
Rule: shadow path must not trigger external side effects (emails, payments, writes)

2) Dark launch

Deploy feature code to production, but keep user exposure at 0% (or internal-only) via flags/routing.

Goal: validate runtime behavior safely in prod
Rule: separate deployment from release

3) Progressive exposure

0% → internal → 1% → 5% → 25% → 100%, with explicit guardrails at each step.

Execution blueprint

Phase A — Readiness checks

Define golden signals per endpoint:
- p50/p95/p99 latency
- error rate (5xx, timeout)
- resource pressure (CPU, memory)
- output divergence (schema/value deltas)
Define rollback triggers up front (no ad-hoc debate during incident).
Add request correlation IDs to old/new paths for exact diffing.

Phase B — Shadow mode

Mirror a bounded percentage of traffic (start at 1-5%).
Disable or sandbox side effects in candidate path.
Compute diff metrics:
- response code parity
- semantic parity (important fields)
- tail latency drift
Log only sampled payloads if PII risk exists.

Phase C — Dark launch

Keep user-facing routing at 0%, but execute new path in production infra.
Validate:
- connection pool behavior
- cache churn
- retry storms / thundering herd
- downstream quota interaction

Phase D — Progressive release

Increase exposure stepwise by flag or segment.
Pause between steps for markout window (e.g., 20-60 min) to observe delayed failures.
Auto-stop if guardrail breached.

Guardrail examples (copy/adapt)

error_rate_new <= error_rate_old + 0.2%p
p99_new <= p99_old * 1.15
timeout_new <= timeout_old + 0.1%p
semantic_divergence <= 0.5%

If any breached for N consecutive windows (e.g., 3 x 5min), freeze rollout and revert traffic split.

Common failure patterns

Fake shadowing: replay uses synthetic traffic, not real burstiness.
Leaky side effects: supposedly read-only path still triggers webhooks or writes.
Schema drift blind spot: status code matches but payload contract differs.
No rollback contract: team argues under pressure instead of executing predefined threshold.
Observability lag: metrics delayed, causing over-advancement of rollout.

Minimal implementation checklist

Traffic duplicator with per-route sampling
Side-effect kill switch for candidate path
Correlation ID propagated across services
Comparator job (status + semantic + latency)
Feature flag with segmented rollout support
Auto-halt policy wired to SLO guardrails
One-command rollback runbook

Quick decision rule

Use shadow traffic when correctness confidence is low.
Use dark launch when integration/runtime confidence is low.
Use both for high-risk releases.

The key insight: release safety is an operations design problem, not just a testing problem.