Brownout + Load Shedding Playbook (Practical)
Date: 2026-02-25
Category: knowledge
Domain: distributed systems / reliability engineering
Why this matters
Most incidents are not binary (up/down). They are overload gradients:
- latency stretches first,
- retries amplify pressure,
- dependency pools saturate,
- then hard failures cascade.
Brownout is the discipline of intentionally reducing non-critical work before collapse.
Load shedding is the enforcement mechanism: reject/defer work when capacity is unsafe.
The goal is simple: keep core user journeys alive while the system is stressed.
Core mental model
Think in four service states:
- Green: full features, normal SLO
- Yellow: disable expensive optional features
- Orange: enforce strict admission control, degrade responses
- Red: preserve only critical APIs, shed the rest
This is better than ad-hoc per-incident toggles because behavior becomes predictable and testable.
Brownout budget (what to trim first)
Classify each endpoint or feature by criticality:
- P0 (must survive): auth, order placement, payments commit, portfolio read
- P1 (important): search, recommendations, analytics summaries
- P2 (optional): personalization, heavy enrichments, non-essential fan-out
Rule: in overload, cut P2 first, then P1, never P0 unless absolute emergency.
A practical trick: define an explicit "optional work budget" (e.g., max 20% CPU / DB QPS for P2). Once exceeded, auto-disable optional code paths.
Minimal implementation blueprint
1) Admission at the edge
At gateway/ingress:
- enforce global + per-tenant token buckets,
- reserve capacity slices for P0 traffic,
- reject excess with explicit
429or503+Retry-After.
2) In-service concurrency guard
Per instance:
- cap in-flight requests,
- fail fast when at cap,
- prefer bounded queues over unbounded wait.
3) Brownout toggles per dependency
For expensive downstream calls:
- allow fallback mode (cached/stale/default response),
- short-circuit non-critical fan-outs,
- set smaller timeout budgets during stress.
4) Retry discipline
- one retry owner only,
- retry budget (e.g., <= 20% of original request volume),
- exponential backoff + jitter,
- no retries for already-overloaded signals.
What to degrade first (high ROI)
- N+1 enrichments (recommendations, badges, social counts)
- Large page-size queries (force smaller limits)
- Synchronous analytics writes (switch to async buffering)
- Expensive consistency modes where eventual consistency is acceptable
- Read-after-write strictness for non-critical screens
Many teams recover 30โ60% headroom by trimming these before touching core transaction paths.
Overload signals to trigger brownout
Use multiple signals (not one):
- queue depth or in-flight saturation,
- p95/p99 latency vs baseline,
- dependency timeout/error ratio,
- CPU/memory pressure,
- rejection rate trend.
Trigger policy example:
- Yellow when any 2 signals breach 2 minutes,
- Orange when any 3 signals breach 1 minute,
- Red when P0 latency SLO is at risk.
Hysteresis matters: use different enter/exit thresholds to prevent flap.
Client-facing behavior (important)
Graceful degradation should feel intentional:
- return partial data with explicit metadata (
degraded=true), - preserve response shape when possible,
- give actionable error semantics (
429with retry hints), - avoid silent timeouts.
A clear, smaller response is better than a full response that arrives too late.
Observability checklist
Track per priority class (P0/P1/P2):
- accepted vs shed request rate,
- in-flight and queue wait,
- brownout level transitions,
- fallback hit rate,
- success latency during each level,
- retry amplification factor.
Key question during incidents:
"Did shedding improve P0 latency and success?" If yes, policy works.
Safe rollout sequence
- Define criticality tags for all major endpoints.
- Add no-op brownout levels (observe only).
- Turn on Yellow policies in one service.
- Run game day with synthetic overload.
- Add Orange policies + retry budget enforcement.
- Document Red runbook with explicit manual override.
Do not launch all levels at once fleet-wide.
Common failure modes
- No priority separation โ important traffic gets dropped with bulk traffic.
- Unbounded queues โ latency death spiral before shedding activates.
- Aggressive client retries โ overload gets worse even as server sheds.
- Toggle sprawl โ operators cannot reason quickly under pressure.
- No drills โ policy exists on paper but fails during real incidents.
Decision cheat sheet
- Need to preserve core transaction path under burst? โ Brownout + priority admission.
- Tail latency exploding during dependency slowdown? โ Shorter deadlines + optional fan-out cut.
- Frequent retry storms? โ Retry budget + overload-aware no-retry signals.
- Multi-tenant fairness required? โ Per-tenant quotas + protected P0 reserve.
Bottom line: brownout is controlled quality reduction that prevents uncontrolled availability collapse.
References (researched)
- AWS Builders Library โ Using load shedding to avoid overload
https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/ - Google SRE Book โ Handling Overload
https://sre.google/sre-book/handling-overload/ - Google SRE Book โ Addressing Cascading Failures
https://sre.google/sre-book/addressing-cascading-failures/ - Envoy docs โ Overload Manager
https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/overload_manager