Adaptive Concurrency Limit Playbook (Practical)

Date: 2026-02-24
Category: knowledge
Domain: distributed systems / overload control

Why this matters

Rate limits (RPS) protect edges, but they do not directly protect in-flight work inside a service.

Under bursty traffic or dependency slowdowns, you usually fail like this:

in-flight requests keep growing,
queueing delay explodes,
tail latency rises,
retries add more pressure,
then everything times out together.

Adaptive concurrency control flips this: instead of fixed static limits, each node continuously estimates the safe in-flight window from latency signals and sheds load early.

Core idea

Treat concurrency limit like a TCP congestion window.

If latency stays near no-load baseline (minRTT), cautiously increase limit.
If latency drifts up (queue forming), reduce limit.
Repeat continuously with small windows.

A common gradient-style update:

gradient = (minRTT + buffer) / sampleRTT
newLimit = oldLimit * gradient + headroom

Where:

sampleRTT is a percentile (often p90) over a short window,
buffer absorbs normal variance,
headroom (often sqrt(limit)) keeps controlled growth when healthy.

What to deploy first (minimal viable setup)

Per-instance limiter (no global coordinator).
Fast reject path when at limit (HTTP 429/503 or gRPC unavailable).
Short sampling window (e.g., 100ms–250ms).
p90 latency aggregate for stability.
Retry policy: one owner + jittered backoff + retry budget.

This already prevents most queue-meltdown incidents.

Practical defaults (starting points)

Use these as first pass, then tune with production histograms:

Sampling/update interval: 100ms
Sample percentile: p90
minRTT recalculation interval: 30s–90s
minRTT request count: 30–100
min concurrency floor during minRTT probe: 3
Headroom term: sqrt(limit)
minRTT jitter: 5–15% of probe interval

Why jitter matters: without it, many instances probe minRTT at the same time and all temporarily clamp concurrency, creating synchronized error spikes.

Server-side vs client-side

Server-side limiter

Best for protecting service health and keeping latency bounded.

Rejects excess before queues explode.
Stops retry storms from turning into total collapse.
Keeps instance responsive so autoscaling has time to work.

Client-side limiter

Best for protecting callers from slow dependencies.

Fail fast when downstream saturates.
Acts as backpressure for batch systems.
Prefer one retry owner to avoid multiplicative retries.

In practice: deploy both eventually, but start server-side on critical dependencies.

Partitioning for QoS (important)

One global limiter can starve important traffic during overload.

Use partitioned shares by traffic class, for example:

live: 90%
batch: 10%

And allow non-guaranteed traffic to consume only excess. This preserves user-facing SLOs while still using spare capacity.

Observability you need on day 1

Track these per endpoint + instance + traffic class:

current concurrency limit (gauge)
in-flight requests (gauge)
blocked/rejected requests (counter)
minRTT / sampleRTT (gauges or histograms)
computed gradient (gauge)
retry attempts per request
success-after-retry rate

If rejections ↑ while latency p99 ↓, limiter is likely doing its job. If both rejections ↑ and latency p99 ↑, you may be probing too aggressively or retries are still amplifying load.

Failure modes and fixes

Synchronized minRTT probes
→ Add jitter; stagger probe schedules.
Too much false backoff from noise
→ Increase sample window slightly; use percentile (p90/p95) not raw mean.
Limit stuck too low
→ Ensure headroom term exists; verify minRTT baseline is refreshed.
Retry storms despite limiter
→ Enforce single retry owner + retry budgets + exponential backoff with jitter.
Health-check traffic contaminates samples
→ Exclude health checks from latency sampling path.

Safe rollout plan

Pick one high-QPS endpoint with known tail-latency pain.
Enable adaptive limiter in shadow/observe mode if available.
Turn on enforced rejections with conservative limits.
Confirm p99 improves and queue metrics stabilize.
Add QoS partitions (live/batch/write/read).
Roll out service-by-service, not fleet-wide at once.

Success signals:

flatter p99 under burst,
fewer timeout cascades,
faster incident recovery without manual retuning.

Decision cheat sheet

Dynamic infra / autoscaling makes static limits stale? → adaptive concurrency.
Tail latency spikes during traffic bursts? → adaptive concurrency + fast reject.
Critical traffic must survive overload? → partitioned concurrency shares.
Frequent retry storms? → combine limiter with retry budget ownership.

Bottom line: adaptive concurrency is a practical anti-cascade control loop: keep queues short, latency stable, and failure localized.

References (researched)

Netflix Tech Blog — Adaptive Concurrency Limits @ Netflix
https://netflixtechblog.medium.com/performance-under-load-3e6fa9a60581
Netflix concurrency-limits (open source)
https://github.com/Netflix/concurrency-limits
Envoy Adaptive Concurrency Filter docs
https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/adaptive_concurrency_filter