Circuit Breaker Half-Open Tuning Playbook (Practical)

Why this matters

Most teams add circuit breakers, but leave half-open behavior at defaults. That creates two classic failures:

Flap loops: breaker opens/closes repeatedly under unstable dependency latency.
Recovery drag: dependency is healthy again, but traffic recovers too slowly.

Half-open tuning is the bridge between protection and fast recovery.

Mental model

Treat a circuit breaker as a state machine with confidence:

Closed: normal operation, collect rolling stats.
Open: hard reject/fallback for a cool-down period.
Half-Open: controlled probe mode to estimate current health.

Half-open is not “try again randomly.” It is a bounded experiment.

Core knobs (and practical defaults)

1) Open duration (`openWaitMs`)

How long to stay open before probing.

Too short → noise-driven flapping.
Too long → unnecessary user-facing degradation.

Start with:

openWaitMs = max(3 * p95_latency_ms, 1000) for low-latency services
openWaitMs = max(2 * p95_latency_ms, 3000) for heavier downstreams

Then adjust by observed flap rate.

2) Probe concurrency (`halfOpenMaxInFlight`)

Max concurrent trial requests in half-open.

Start with:

1 for fragile dependencies
2–5 for stable, horizontally scaled dependencies

Rule: if retries exist at higher layer, keep half-open concurrency lower.

3) Probe sample size (`halfOpenMinSamples`)

How many probe outcomes before close/open decision.

Start with 20–50 requests (or 10 for low-QPS paths).

Decision quality is poor below 10 unless traffic is very deterministic.

4) Success threshold (`halfOpenSuccessRate`)

Min success ratio in probe window to close breaker.

Start with:

95% for user-critical paths
90% for tolerant async paths

5) Slow-call threshold (`slowCallRate`)

Count near-timeouts as failures, not only hard errors.

If timeout is 2s, mark >1.2s as “slow” and include in open/half-open criteria. This catches brownouts earlier.

Recommended policy by path type

User-facing synchronous API

openWaitMs: 2–5s
halfOpenMaxInFlight: 1–2
halfOpenMinSamples: 30
halfOpenSuccessRate: 95%
aggressive slow-call detection

Async worker dependency

openWaitMs: 5–15s
halfOpenMaxInFlight: 3–10
halfOpenMinSamples: 50
halfOpenSuccessRate: 90%
allow longer timeout, still track slow calls

Third-party rate-limited API

openWaitMs: align with provider reset windows
halfOpenMaxInFlight: 1
halfOpenMinSamples: 10–20
halfOpenSuccessRate: 95%
use jitter + token bucket to avoid synchronized thundering

Anti-flapping guardrails

Hysteresis thresholds
- Open if failure/slow rate > 50%
- Close only if failure/slow rate < 10% for probe window
Minimum state dwell time
- Stay in closed at least N seconds before opening again unless catastrophic error rate.
Jittered probe scheduling
- Randomize probe start by ±10–20% to avoid herd behavior across pods.
Per-error-class weighting
- Count connection-reset/timeouts heavier than 5xx with fast response.

Observability: what to chart

Track per dependency:

breaker_state (closed/open/half-open)
transitions count and reason
probe success rate
half-open probe latency distribution
fallback rate
user-visible error rate and latency

Key SLO-like indicators:

Flap Rate: transitions/hour
Recovery Time: open → stable closed duration
False Open Ratio: opened without meaningful user-level impact

Runbook snippet

When flap rate spikes:

Verify dependency latency/error regime changed.
Increase openWaitMs by 1.5x.
Reduce halfOpenMaxInFlight by 1 step.
Tighten slow-call detection if brownout observed.
Re-evaluate after 30–60 minutes.

When recovery is too slow:

Increase halfOpenMaxInFlight slightly.
Reduce openWaitMs by 20%.
Keep success threshold unchanged first.
Confirm user error rate does not regress.

Implementation checklist

Breaker parameters are per dependency, not global defaults.
Slow-call tracking enabled and surfaced.
Half-open probe limits and sample size explicitly configured.
Jitter + hysteresis added.
Dashboard includes flap rate and recovery time.
Runbook and owner assigned.

One-line takeaway

A circuit breaker is only as good as its half-open tuning: treat recovery as controlled experimentation, not hope-driven retries.

Circuit Breaker Half-Open Tuning Playbook (Practical)

Circuit Breaker Half-Open Tuning Playbook (Practical)

Why this matters

Mental model

Core knobs (and practical defaults)

1) Open duration (openWaitMs)

2) Probe concurrency (halfOpenMaxInFlight)

3) Probe sample size (halfOpenMinSamples)

4) Success threshold (halfOpenSuccessRate)

5) Slow-call threshold (slowCallRate)

Recommended policy by path type

User-facing synchronous API

Async worker dependency

Third-party rate-limited API

Anti-flapping guardrails

Observability: what to chart

Runbook snippet

Implementation checklist

One-line takeaway

1) Open duration (`openWaitMs`)

2) Probe concurrency (`halfOpenMaxInFlight`)

3) Probe sample size (`halfOpenMinSamples`)

4) Success threshold (`halfOpenSuccessRate`)

5) Slow-call threshold (`slowCallRate`)