Adaptive Concurrency Limit Playbook (Practical)
Date: 2026-02-24
Category: knowledge
Domain: distributed systems / overload control
Why this matters
Rate limits (RPS) protect edges, but they do not directly protect in-flight work inside a service.
Under bursty traffic or dependency slowdowns, you usually fail like this:
- in-flight requests keep growing,
- queueing delay explodes,
- tail latency rises,
- retries add more pressure,
- then everything times out together.
Adaptive concurrency control flips this: instead of fixed static limits, each node continuously estimates the safe in-flight window from latency signals and sheds load early.
Core idea
Treat concurrency limit like a TCP congestion window.
- If latency stays near no-load baseline (
minRTT), cautiously increase limit. - If latency drifts up (queue forming), reduce limit.
- Repeat continuously with small windows.
A common gradient-style update:
gradient = (minRTT + buffer) / sampleRTTnewLimit = oldLimit * gradient + headroom
Where:
sampleRTTis a percentile (often p90) over a short window,bufferabsorbs normal variance,headroom(oftensqrt(limit)) keeps controlled growth when healthy.
What to deploy first (minimal viable setup)
- Per-instance limiter (no global coordinator).
- Fast reject path when at limit (HTTP 429/503 or gRPC unavailable).
- Short sampling window (e.g., 100msโ250ms).
- p90 latency aggregate for stability.
- Retry policy: one owner + jittered backoff + retry budget.
This already prevents most queue-meltdown incidents.
Practical defaults (starting points)
Use these as first pass, then tune with production histograms:
- Sampling/update interval:
100ms - Sample percentile:
p90 - minRTT recalculation interval:
30sโ90s - minRTT request count:
30โ100 - min concurrency floor during minRTT probe:
3 - Headroom term:
sqrt(limit) - minRTT jitter:
5โ15%of probe interval
Why jitter matters: without it, many instances probe minRTT at the same time and all temporarily clamp concurrency, creating synchronized error spikes.
Server-side vs client-side
Server-side limiter
Best for protecting service health and keeping latency bounded.
- Rejects excess before queues explode.
- Stops retry storms from turning into total collapse.
- Keeps instance responsive so autoscaling has time to work.
Client-side limiter
Best for protecting callers from slow dependencies.
- Fail fast when downstream saturates.
- Acts as backpressure for batch systems.
- Prefer one retry owner to avoid multiplicative retries.
In practice: deploy both eventually, but start server-side on critical dependencies.
Partitioning for QoS (important)
One global limiter can starve important traffic during overload.
Use partitioned shares by traffic class, for example:
live: 90%batch: 10%
And allow non-guaranteed traffic to consume only excess. This preserves user-facing SLOs while still using spare capacity.
Observability you need on day 1
Track these per endpoint + instance + traffic class:
- current concurrency limit (gauge)
- in-flight requests (gauge)
- blocked/rejected requests (counter)
minRTT/sampleRTT(gauges or histograms)- computed gradient (gauge)
- retry attempts per request
- success-after-retry rate
If rejections โ while latency p99 โ, limiter is likely doing its job.
If both rejections โ and latency p99 โ, you may be probing too aggressively or retries are still amplifying load.
Failure modes and fixes
Synchronized minRTT probes
โ Add jitter; stagger probe schedules.Too much false backoff from noise
โ Increase sample window slightly; use percentile (p90/p95) not raw mean.Limit stuck too low
โ Ensure headroom term exists; verify minRTT baseline is refreshed.Retry storms despite limiter
โ Enforce single retry owner + retry budgets + exponential backoff with jitter.Health-check traffic contaminates samples
โ Exclude health checks from latency sampling path.
Safe rollout plan
- Pick one high-QPS endpoint with known tail-latency pain.
- Enable adaptive limiter in shadow/observe mode if available.
- Turn on enforced rejections with conservative limits.
- Confirm p99 improves and queue metrics stabilize.
- Add QoS partitions (live/batch/write/read).
- Roll out service-by-service, not fleet-wide at once.
Success signals:
- flatter p99 under burst,
- fewer timeout cascades,
- faster incident recovery without manual retuning.
Decision cheat sheet
- Dynamic infra / autoscaling makes static limits stale? โ adaptive concurrency.
- Tail latency spikes during traffic bursts? โ adaptive concurrency + fast reject.
- Critical traffic must survive overload? โ partitioned concurrency shares.
- Frequent retry storms? โ combine limiter with retry budget ownership.
Bottom line: adaptive concurrency is a practical anti-cascade control loop: keep queues short, latency stable, and failure localized.
References (researched)
- Netflix Tech Blog โ Adaptive Concurrency Limits @ Netflix
https://netflixtechblog.medium.com/performance-under-load-3e6fa9a60581 - Netflix concurrency-limits (open source)
https://github.com/Netflix/concurrency-limits - Envoy Adaptive Concurrency Filter docs
https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/adaptive_concurrency_filter