Rate Limiting Algorithms Playbook (Token Bucket, GCRA, Concurrency)
Date: 2026-02-26
Category: knowledge
Domain: backend / API reliability
Why this matters
Most teams treat rate limiting as a single knob (N req/s). In production, that is rarely enough.
Real overload comes from multiple failure shapes:
- bursty traffic that starves shared dependencies,
- long-running requests that consume worker slots,
- noisy tenants that crowd out critical flows,
- synchronized retries that amplify incidents.
A practical limiter stack should control rate, burst, and in-flight concurrency together.
Mental model: what you are actually protecting
Protect 3 different resources explicitly:
- Ingress frequency (requests/time)
- Burst tolerance (short-term spikes)
- In-flight work (concurrent active requests)
If you only limit (1), you can still melt on (3).
Algorithm cheat sheet
1) Fixed window counter
count(requests in [t, t+window]) <= limit
Pros
- trivial to implement,
- cheap storage.
Cons
- boundary effects (double-spend around window edges),
- allows spiky behavior.
Use when: internal tools, non-critical endpoints, quick MVP.
2) Sliding window log/counter
Tracks recent request history in rolling time.
Pros
- fairer than fixed windows,
- smoother enforcement.
Cons
- more state and compute,
- higher implementation complexity at scale.
Use when: fairness matters and cardinality is manageable.
3) Token bucket
Bucket fills at rate r up to capacity b; request cost consumes tokens.
Pros
- simple and production-proven,
- naturally supports burst (
b) + sustained rate (r), - intuitive tuning knobs.
Cons
- naïve distributed implementations can drift,
- does not directly cap long-running in-flight work.
Use when: default choice for API edges and per-tenant limits.
4) GCRA (Generic Cell Rate Algorithm)
A leaky-bucket-equivalent algorithm using time arithmetic (e.g., tracking TAT: Theoretical Arrival Time).
Pros
- smooth rolling-window behavior,
- efficient state footprint (often one key per subject),
- no periodic “drip worker” required in common implementations.
Cons
- conceptually harder than token bucket,
- debugging/tuning is less intuitive for many teams.
Use when: you need precise smoothing with compact state (e.g., Redis-backed large cardinality limits).
5) Concurrency limiter (separate from rate limiter)
active_requests(subject) <= k
Pros
- directly protects CPU, DB pools, thread/worker slots,
- strong defense for expensive endpoints.
Cons
- can reduce throughput if k is too low,
- needs request lifecycle instrumentation.
Use when: endpoints are heterogeneous in cost or latency tails are incident-prone.
What good production stacks do
A robust pattern is layered limits:
- Global emergency cap (service-wide)
- Per-tenant token bucket/GCRA
- Per-endpoint limiter (stricter for expensive routes)
- Concurrency cap (global + endpoint-class)
- Priority lanes (critical traffic reserve)
This mirrors real-world guidance from mature API providers: separate rate and concurrency controls rather than forcing one limiter to solve both.
Practical policy template
Example for a multi-tenant API:
- Global: 2,000 req/s, burst 4,000
- Tenant default: 50 req/s, burst 100
- Heavy endpoint (
/reports/export): 5 req/s, burst 10, concurrency 2 - Critical endpoint (
/checkout/confirm): reserved concurrency lane + softer per-tenant cap
Response contract:
429 Too Many Requestsfor policy exceedance- include
Retry-Afterwhen possible - include explicit limiter reason (e.g.,
global-rate,endpoint-concurrency)
Tuning workflow (the part most teams skip)
- Measure baseline
- p50/p95/p99 latency, active in-flight, reject rate by endpoint/tenant.
- Start permissive
- dry-run / shadow accounting first.
- Set burst intentionally
- enough for user spikes, not enough for thundering herd.
- Add concurrency guardrails
- especially for high-latency / high-CPU routes.
- Use hysteresis and cooldowns
- avoid limit flap during recovery.
- Audit client behavior
- exponential backoff + jitter, retry budgets.
If clients retry aggressively on 429 without jitter, your limiter becomes a self-amplifying failure loop.
Common failure modes
- Single limiter for all endpoint costs
- cheap and expensive calls treated equally → unfair saturation.
- No concurrency cap
- request rate looks healthy while workers are fully pinned.
- Window edge abuse (fixed window)
- periodic traffic spikes slip through.
- Missing limiter reason in responses
- clients cannot adapt correctly.
- No dry-run stage
- surprise customer impact on rollout.
Decision matrix (quick)
- Need quick implementation? → Fixed window (short-term only).
- Need solid default for APIs? → Token bucket.
- Need smooth rolling behavior + compact state? → GCRA.
- Expensive/long-running requests dominate incidents? → Add concurrency limiter immediately.
- Multi-tenant fairness + business tiers? → Hierarchical per-tenant + per-endpoint + priority lanes.
Minimal implementation blueprint
- Edge limiter: token bucket or GCRA per tenant/IP.
- Service limiter: in-process or shared concurrency cap by endpoint class.
- Structured 429 payload:
- limiter type,
- scope (global/tenant/endpoint),
- retry hint.
- Observability:
- allowed/blocked counts,
- queueing time,
- in-flight saturation,
- retry amplification.
- Runbooks:
- temporary raised limits,
- emergency brownout mode,
- manual tenant throttle override.
References (researched)
- RFC 2697 — A Single Rate Three Color Marker (srTCM)
https://datatracker.ietf.org/doc/html/rfc2697 - RFC 2698 — A Two Rate Three Color Marker (trTCM)
https://datatracker.ietf.org/doc/html/rfc2698 - NGINX
limit_reqmodule (leaky bucket)
https://nginx.org/en/docs/http/ngx_http_limit_req_module.html - Cloudflare Engineering — “How we built rate limiting capable of scaling to millions of domains”
https://blog.cloudflare.com/counting-things-a-lot-of-different-things/ - Cloudflare API rate limits reference
https://developers.cloudflare.com/fundamentals/api/reference/limits/ - Stripe docs — API rate and concurrency limits
https://docs.stripe.com/rate-limits - Stripe Engineering — “Scaling your API with rate limiters”
https://stripe.com/blog/rate-limiters - Brandur — “Rate Limiting, Cells, and GCRA”
https://brandur.org/rate-limiting - Redis blog —
redis-celland GCRA
https://redis.io/blog/redis-cell-rate-limiting-redis-module/