Adaptive Concurrency Limit Playbook (Practical)

2026-02-24 ยท software

Adaptive Concurrency Limit Playbook (Practical)

Date: 2026-02-24
Category: knowledge
Domain: distributed systems / overload control

Why this matters

Rate limits (RPS) protect edges, but they do not directly protect in-flight work inside a service.

Under bursty traffic or dependency slowdowns, you usually fail like this:

Adaptive concurrency control flips this: instead of fixed static limits, each node continuously estimates the safe in-flight window from latency signals and sheds load early.


Core idea

Treat concurrency limit like a TCP congestion window.

A common gradient-style update:

Where:

What to deploy first (minimal viable setup)

  1. Per-instance limiter (no global coordinator).
  2. Fast reject path when at limit (HTTP 429/503 or gRPC unavailable).
  3. Short sampling window (e.g., 100msโ€“250ms).
  4. p90 latency aggregate for stability.
  5. Retry policy: one owner + jittered backoff + retry budget.

This already prevents most queue-meltdown incidents.


Practical defaults (starting points)

Use these as first pass, then tune with production histograms:

Why jitter matters: without it, many instances probe minRTT at the same time and all temporarily clamp concurrency, creating synchronized error spikes.

Server-side vs client-side

Server-side limiter

Best for protecting service health and keeping latency bounded.

Client-side limiter

Best for protecting callers from slow dependencies.

In practice: deploy both eventually, but start server-side on critical dependencies.


Partitioning for QoS (important)

One global limiter can starve important traffic during overload.

Use partitioned shares by traffic class, for example:

And allow non-guaranteed traffic to consume only excess. This preserves user-facing SLOs while still using spare capacity.

Observability you need on day 1

Track these per endpoint + instance + traffic class:

If rejections โ†‘ while latency p99 โ†“, limiter is likely doing its job. If both rejections โ†‘ and latency p99 โ†‘, you may be probing too aggressively or retries are still amplifying load.


Failure modes and fixes

  1. Synchronized minRTT probes
    โ†’ Add jitter; stagger probe schedules.

  2. Too much false backoff from noise
    โ†’ Increase sample window slightly; use percentile (p90/p95) not raw mean.

  3. Limit stuck too low
    โ†’ Ensure headroom term exists; verify minRTT baseline is refreshed.

  4. Retry storms despite limiter
    โ†’ Enforce single retry owner + retry budgets + exponential backoff with jitter.

  5. Health-check traffic contaminates samples
    โ†’ Exclude health checks from latency sampling path.

Safe rollout plan

  1. Pick one high-QPS endpoint with known tail-latency pain.
  2. Enable adaptive limiter in shadow/observe mode if available.
  3. Turn on enforced rejections with conservative limits.
  4. Confirm p99 improves and queue metrics stabilize.
  5. Add QoS partitions (live/batch/write/read).
  6. Roll out service-by-service, not fleet-wide at once.

Success signals:


Decision cheat sheet

Bottom line: adaptive concurrency is a practical anti-cascade control loop: keep queues short, latency stable, and failure localized.

References (researched)