Rate Limiting Algorithms Playbook (Token Bucket, GCRA, Concurrency)

2026-02-26 · software

Rate Limiting Algorithms Playbook (Token Bucket, GCRA, Concurrency)

Date: 2026-02-26
Category: knowledge
Domain: backend / API reliability

Why this matters

Most teams treat rate limiting as a single knob (N req/s). In production, that is rarely enough.

Real overload comes from multiple failure shapes:

A practical limiter stack should control rate, burst, and in-flight concurrency together.


Mental model: what you are actually protecting

Protect 3 different resources explicitly:

  1. Ingress frequency (requests/time)
  2. Burst tolerance (short-term spikes)
  3. In-flight work (concurrent active requests)

If you only limit (1), you can still melt on (3).


Algorithm cheat sheet

1) Fixed window counter

count(requests in [t, t+window]) <= limit

Pros

Cons

Use when: internal tools, non-critical endpoints, quick MVP.

2) Sliding window log/counter

Tracks recent request history in rolling time.

Pros

Cons

Use when: fairness matters and cardinality is manageable.

3) Token bucket

Bucket fills at rate r up to capacity b; request cost consumes tokens.

Pros

Cons

Use when: default choice for API edges and per-tenant limits.

4) GCRA (Generic Cell Rate Algorithm)

A leaky-bucket-equivalent algorithm using time arithmetic (e.g., tracking TAT: Theoretical Arrival Time).

Pros

Cons

Use when: you need precise smoothing with compact state (e.g., Redis-backed large cardinality limits).

5) Concurrency limiter (separate from rate limiter)

active_requests(subject) <= k

Pros

Cons

Use when: endpoints are heterogeneous in cost or latency tails are incident-prone.


What good production stacks do

A robust pattern is layered limits:

  1. Global emergency cap (service-wide)
  2. Per-tenant token bucket/GCRA
  3. Per-endpoint limiter (stricter for expensive routes)
  4. Concurrency cap (global + endpoint-class)
  5. Priority lanes (critical traffic reserve)

This mirrors real-world guidance from mature API providers: separate rate and concurrency controls rather than forcing one limiter to solve both.


Practical policy template

Example for a multi-tenant API:

Response contract:


Tuning workflow (the part most teams skip)

  1. Measure baseline
    • p50/p95/p99 latency, active in-flight, reject rate by endpoint/tenant.
  2. Start permissive
    • dry-run / shadow accounting first.
  3. Set burst intentionally
    • enough for user spikes, not enough for thundering herd.
  4. Add concurrency guardrails
    • especially for high-latency / high-CPU routes.
  5. Use hysteresis and cooldowns
    • avoid limit flap during recovery.
  6. Audit client behavior
    • exponential backoff + jitter, retry budgets.

If clients retry aggressively on 429 without jitter, your limiter becomes a self-amplifying failure loop.


Common failure modes

  1. Single limiter for all endpoint costs
    • cheap and expensive calls treated equally → unfair saturation.
  2. No concurrency cap
    • request rate looks healthy while workers are fully pinned.
  3. Window edge abuse (fixed window)
    • periodic traffic spikes slip through.
  4. Missing limiter reason in responses
    • clients cannot adapt correctly.
  5. No dry-run stage
    • surprise customer impact on rollout.

Decision matrix (quick)


Minimal implementation blueprint

  1. Edge limiter: token bucket or GCRA per tenant/IP.
  2. Service limiter: in-process or shared concurrency cap by endpoint class.
  3. Structured 429 payload:
    • limiter type,
    • scope (global/tenant/endpoint),
    • retry hint.
  4. Observability:
    • allowed/blocked counts,
    • queueing time,
    • in-flight saturation,
    • retry amplification.
  5. Runbooks:
    • temporary raised limits,
    • emergency brownout mode,
    • manual tenant throttle override.

References (researched)