Circuit Breaker Half-Open Tuning Playbook (Practical)

2026-02-23 ยท software

Circuit Breaker Half-Open Tuning Playbook (Practical)

Why this matters

Most teams add circuit breakers, but leave half-open behavior at defaults. That creates two classic failures:

Half-open tuning is the bridge between protection and fast recovery.


Mental model

Treat a circuit breaker as a state machine with confidence:

Half-open is not โ€œtry again randomly.โ€ It is a bounded experiment.


Core knobs (and practical defaults)

1) Open duration (openWaitMs)

How long to stay open before probing.

Start with:

Then adjust by observed flap rate.

2) Probe concurrency (halfOpenMaxInFlight)

Max concurrent trial requests in half-open.

Start with:

Rule: if retries exist at higher layer, keep half-open concurrency lower.

3) Probe sample size (halfOpenMinSamples)

How many probe outcomes before close/open decision.

Start with 20โ€“50 requests (or 10 for low-QPS paths).

Decision quality is poor below 10 unless traffic is very deterministic.

4) Success threshold (halfOpenSuccessRate)

Min success ratio in probe window to close breaker.

Start with:

5) Slow-call threshold (slowCallRate)

Count near-timeouts as failures, not only hard errors.

If timeout is 2s, mark >1.2s as โ€œslowโ€ and include in open/half-open criteria. This catches brownouts earlier.


Recommended policy by path type

User-facing synchronous API

Async worker dependency

Third-party rate-limited API


Anti-flapping guardrails

  1. Hysteresis thresholds

    • Open if failure/slow rate > 50%
    • Close only if failure/slow rate < 10% for probe window
  2. Minimum state dwell time

    • Stay in closed at least N seconds before opening again unless catastrophic error rate.
  3. Jittered probe scheduling

    • Randomize probe start by ยฑ10โ€“20% to avoid herd behavior across pods.
  4. Per-error-class weighting

    • Count connection-reset/timeouts heavier than 5xx with fast response.

Observability: what to chart

Track per dependency:

Key SLO-like indicators:


Runbook snippet

When flap rate spikes:

  1. Verify dependency latency/error regime changed.
  2. Increase openWaitMs by 1.5x.
  3. Reduce halfOpenMaxInFlight by 1 step.
  4. Tighten slow-call detection if brownout observed.
  5. Re-evaluate after 30โ€“60 minutes.

When recovery is too slow:

  1. Increase halfOpenMaxInFlight slightly.
  2. Reduce openWaitMs by 20%.
  3. Keep success threshold unchanged first.
  4. Confirm user error rate does not regress.

Implementation checklist


One-line takeaway

A circuit breaker is only as good as its half-open tuning: treat recovery as controlled experimentation, not hope-driven retries.