Rate Limiting Algorithms Playbook (Token Bucket, GCRA, Concurrency)

Date: 2026-02-26
Category: knowledge
Domain: backend / API reliability

Why this matters

Most teams treat rate limiting as a single knob (N req/s). In production, that is rarely enough.

Real overload comes from multiple failure shapes:

bursty traffic that starves shared dependencies,
long-running requests that consume worker slots,
noisy tenants that crowd out critical flows,
synchronized retries that amplify incidents.

A practical limiter stack should control rate, burst, and in-flight concurrency together.

Mental model: what you are actually protecting

Protect 3 different resources explicitly:

Ingress frequency (requests/time)
Burst tolerance (short-term spikes)
In-flight work (concurrent active requests)

If you only limit (1), you can still melt on (3).

Algorithm cheat sheet

1) Fixed window counter

count(requests in [t, t+window]) <= limit

Pros

trivial to implement,
cheap storage.

Cons

boundary effects (double-spend around window edges),
allows spiky behavior.

Use when: internal tools, non-critical endpoints, quick MVP.

2) Sliding window log/counter

Tracks recent request history in rolling time.

Pros

fairer than fixed windows,
smoother enforcement.

Cons

more state and compute,
higher implementation complexity at scale.

Use when: fairness matters and cardinality is manageable.

3) Token bucket

Bucket fills at rate r up to capacity b; request cost consumes tokens.

Pros

simple and production-proven,
naturally supports burst (b) + sustained rate (r),
intuitive tuning knobs.

Cons

naïve distributed implementations can drift,
does not directly cap long-running in-flight work.

Use when: default choice for API edges and per-tenant limits.

4) GCRA (Generic Cell Rate Algorithm)

A leaky-bucket-equivalent algorithm using time arithmetic (e.g., tracking TAT: Theoretical Arrival Time).

Pros

smooth rolling-window behavior,
efficient state footprint (often one key per subject),
no periodic “drip worker” required in common implementations.

Cons

conceptually harder than token bucket,
debugging/tuning is less intuitive for many teams.

Use when: you need precise smoothing with compact state (e.g., Redis-backed large cardinality limits).

5) Concurrency limiter (separate from rate limiter)

active_requests(subject) <= k

Pros

directly protects CPU, DB pools, thread/worker slots,
strong defense for expensive endpoints.

Cons

can reduce throughput if k is too low,
needs request lifecycle instrumentation.

Use when: endpoints are heterogeneous in cost or latency tails are incident-prone.

What good production stacks do

A robust pattern is layered limits:

Global emergency cap (service-wide)
Per-tenant token bucket/GCRA
Per-endpoint limiter (stricter for expensive routes)
Concurrency cap (global + endpoint-class)
Priority lanes (critical traffic reserve)

This mirrors real-world guidance from mature API providers: separate rate and concurrency controls rather than forcing one limiter to solve both.

Practical policy template

Example for a multi-tenant API:

Global: 2,000 req/s, burst 4,000
Tenant default: 50 req/s, burst 100
Heavy endpoint (/reports/export): 5 req/s, burst 10, concurrency 2
Critical endpoint (/checkout/confirm): reserved concurrency lane + softer per-tenant cap

Response contract:

429 Too Many Requests for policy exceedance
include Retry-After when possible
include explicit limiter reason (e.g., global-rate, endpoint-concurrency)

Tuning workflow (the part most teams skip)

Measure baseline
- p50/p95/p99 latency, active in-flight, reject rate by endpoint/tenant.
Start permissive
- dry-run / shadow accounting first.
Set burst intentionally
- enough for user spikes, not enough for thundering herd.
Add concurrency guardrails
- especially for high-latency / high-CPU routes.
Use hysteresis and cooldowns
- avoid limit flap during recovery.
Audit client behavior
- exponential backoff + jitter, retry budgets.

If clients retry aggressively on 429 without jitter, your limiter becomes a self-amplifying failure loop.

Common failure modes

Single limiter for all endpoint costs
- cheap and expensive calls treated equally → unfair saturation.
No concurrency cap
- request rate looks healthy while workers are fully pinned.
Window edge abuse (fixed window)
- periodic traffic spikes slip through.
Missing limiter reason in responses
- clients cannot adapt correctly.
No dry-run stage
- surprise customer impact on rollout.

Decision matrix (quick)

Need quick implementation? → Fixed window (short-term only).
Need solid default for APIs? → Token bucket.
Need smooth rolling behavior + compact state? → GCRA.
Expensive/long-running requests dominate incidents? → Add concurrency limiter immediately.
Multi-tenant fairness + business tiers? → Hierarchical per-tenant + per-endpoint + priority lanes.

Minimal implementation blueprint

Edge limiter: token bucket or GCRA per tenant/IP.
Service limiter: in-process or shared concurrency cap by endpoint class.
Structured 429 payload:
- limiter type,
- scope (global/tenant/endpoint),
- retry hint.
Observability:
- allowed/blocked counts,
- queueing time,
- in-flight saturation,
- retry amplification.
Runbooks:
- temporary raised limits,
- emergency brownout mode,
- manual tenant throttle override.

References (researched)

RFC 2697 — A Single Rate Three Color Marker (srTCM)
https://datatracker.ietf.org/doc/html/rfc2697
RFC 2698 — A Two Rate Three Color Marker (trTCM)
https://datatracker.ietf.org/doc/html/rfc2698
NGINX limit_req module (leaky bucket)
https://nginx.org/en/docs/http/ngx_http_limit_req_module.html
Cloudflare Engineering — “How we built rate limiting capable of scaling to millions of domains”
https://blog.cloudflare.com/counting-things-a-lot-of-different-things/
Cloudflare API rate limits reference
https://developers.cloudflare.com/fundamentals/api/reference/limits/
Stripe docs — API rate and concurrency limits
https://docs.stripe.com/rate-limits
Stripe Engineering — “Scaling your API with rate limiters”
https://stripe.com/blog/rate-limiters
Brandur — “Rate Limiting, Cells, and GCRA”
https://brandur.org/rate-limiting
Redis blog — redis-cell and GCRA
https://redis.io/blog/redis-cell-rate-limiting-redis-module/