Circuit Breaker Half-Open Tuning Playbook (Practical)
Why this matters
Most teams add circuit breakers, but leave half-open behavior at defaults. That creates two classic failures:
- Flap loops: breaker opens/closes repeatedly under unstable dependency latency.
- Recovery drag: dependency is healthy again, but traffic recovers too slowly.
Half-open tuning is the bridge between protection and fast recovery.
Mental model
Treat a circuit breaker as a state machine with confidence:
- Closed: normal operation, collect rolling stats.
- Open: hard reject/fallback for a cool-down period.
- Half-Open: controlled probe mode to estimate current health.
Half-open is not โtry again randomly.โ It is a bounded experiment.
Core knobs (and practical defaults)
1) Open duration (openWaitMs)
How long to stay open before probing.
- Too short โ noise-driven flapping.
- Too long โ unnecessary user-facing degradation.
Start with:
openWaitMs = max(3 * p95_latency_ms, 1000)for low-latency servicesopenWaitMs = max(2 * p95_latency_ms, 3000)for heavier downstreams
Then adjust by observed flap rate.
2) Probe concurrency (halfOpenMaxInFlight)
Max concurrent trial requests in half-open.
Start with:
- 1 for fragile dependencies
- 2โ5 for stable, horizontally scaled dependencies
Rule: if retries exist at higher layer, keep half-open concurrency lower.
3) Probe sample size (halfOpenMinSamples)
How many probe outcomes before close/open decision.
Start with 20โ50 requests (or 10 for low-QPS paths).
Decision quality is poor below 10 unless traffic is very deterministic.
4) Success threshold (halfOpenSuccessRate)
Min success ratio in probe window to close breaker.
Start with:
- 95% for user-critical paths
- 90% for tolerant async paths
5) Slow-call threshold (slowCallRate)
Count near-timeouts as failures, not only hard errors.
If timeout is 2s, mark >1.2s as โslowโ and include in open/half-open criteria. This catches brownouts earlier.
Recommended policy by path type
User-facing synchronous API
openWaitMs: 2โ5shalfOpenMaxInFlight: 1โ2halfOpenMinSamples: 30halfOpenSuccessRate: 95%- aggressive slow-call detection
Async worker dependency
openWaitMs: 5โ15shalfOpenMaxInFlight: 3โ10halfOpenMinSamples: 50halfOpenSuccessRate: 90%- allow longer timeout, still track slow calls
Third-party rate-limited API
openWaitMs: align with provider reset windowshalfOpenMaxInFlight: 1halfOpenMinSamples: 10โ20halfOpenSuccessRate: 95%- use jitter + token bucket to avoid synchronized thundering
Anti-flapping guardrails
Hysteresis thresholds
- Open if failure/slow rate > 50%
- Close only if failure/slow rate < 10% for probe window
Minimum state dwell time
- Stay in closed at least N seconds before opening again unless catastrophic error rate.
Jittered probe scheduling
- Randomize probe start by ยฑ10โ20% to avoid herd behavior across pods.
Per-error-class weighting
- Count connection-reset/timeouts heavier than 5xx with fast response.
Observability: what to chart
Track per dependency:
breaker_state(closed/open/half-open)- transitions count and reason
- probe success rate
- half-open probe latency distribution
- fallback rate
- user-visible error rate and latency
Key SLO-like indicators:
- Flap Rate: transitions/hour
- Recovery Time: open โ stable closed duration
- False Open Ratio: opened without meaningful user-level impact
Runbook snippet
When flap rate spikes:
- Verify dependency latency/error regime changed.
- Increase
openWaitMsby 1.5x. - Reduce
halfOpenMaxInFlightby 1 step. - Tighten slow-call detection if brownout observed.
- Re-evaluate after 30โ60 minutes.
When recovery is too slow:
- Increase
halfOpenMaxInFlightslightly. - Reduce
openWaitMsby 20%. - Keep success threshold unchanged first.
- Confirm user error rate does not regress.
Implementation checklist
- Breaker parameters are per dependency, not global defaults.
- Slow-call tracking enabled and surfaced.
- Half-open probe limits and sample size explicitly configured.
- Jitter + hysteresis added.
- Dashboard includes flap rate and recovery time.
- Runbook and owner assigned.
One-line takeaway
A circuit breaker is only as good as its half-open tuning: treat recovery as controlled experimentation, not hope-driven retries.