LLM Serving Throughput/Latency Playbook: Continuous Batching, Paged KV, and Speculative Decoding
Date: 2026-03-09
Category: knowledge
Domain: software / ML systems / inference infrastructure
Why this matters
LLM serving usually fails in one of two ways:
- you optimize tokens/sec but p95 TTFT (time-to-first-token) explodes,
- or you optimize interactive latency and GPU utilization collapses.
The practical goal is not "max throughput" in isolation. It is:
stable latency SLO under bursty mixed workloads while keeping $/1M tokens predictable.
Core model: what actually bottlenecks
For most production workloads:
- Prefill (prompt processing) is compute-heavy.
- Decode (next-token loop) is memory-bandwidth + scheduler heavy.
- KV cache becomes the dominant capacity limiter as context grows.
So the stack should combine:
- scheduler optimization (continuous/in-flight batching),
- memory optimization (paged KV + prefix reuse),
- algorithmic acceleration (speculative decoding),
- kernel optimization (FlashAttention-family kernels).
Treat these as complementary, not mutually exclusive.
1) Continuous batching first (highest practical ROI)
What it is
Instead of waiting for a batch window and running monolithic batches, the server re-packs work every decode step (iteration-level scheduling).
Why it works
- reduces head-of-line blocking,
- backfills slots immediately when a request finishes,
- mixes prefill/decode more efficiently under variable-length outputs.
Common failure mode
Teams enable continuous batching but keep naive queue policy, so short prompts are still trapped behind long prefills.
Practical policy
- separate prefill tokens budget from decode tokens budget per step,
- cap per-request max prefill chunk,
- age-based fairness to avoid starvation,
- explicit "interactive" vs "bulk" lanes when traffic is mixed.
2) Paged KV cache to reclaim capacity
Why it matters
Contiguous KV allocation fragments memory badly for variable sequence lengths. Paged KV stores cache in blocks + indirection table, so allocation/reuse is far more efficient.
Rule of thumb
If context length is large and request lengths vary widely, paged KV is often non-negotiable for good GPU occupancy.
Capacity intuition
Approximate KV bytes/token (decoder-only, MHA/GQA-style):
KV bytes per token ≈ 2 × layers × kv_heads × head_dim × dtype_bytes
Multiply by active tokens in memory to estimate pressure. This quickly dominates VRAM.
Operational tips
- tune block size empirically (fragmentation vs lookup overhead tradeoff),
- monitor KV block churn/eviction rate,
- validate prefix-sharing correctness (multi-sample / branching paths).
3) Prefix reuse (Radix/prefix cache) for repeated prompts
Many real workloads repeat large shared prefixes (system prompt, tool schema, few-shot scaffold).
Prefix-aware caching can cut prefill cost substantially and improve TTFT.
Use when:
- many requests share stable instruction templates,
- retrieval wrapper is constant and only document chunk differs near the tail,
- agent loops repeat planning scaffolds.
Do not assume 100% hit rates; track prefix-hit ratio by route.
4) Speculative decoding for extra decode speed
Concept
A smaller draft model proposes several tokens; the target model verifies/accepts in parallelized fashion.
Best case
- draft is cheap and aligned enough with target,
- output entropy is moderate,
- you need higher throughput without changing final distribution (algorithm-dependent setup).
When it disappoints
- high rejection rate from a weakly matched draft model,
- too much overhead in orchestration,
- short outputs where setup costs dominate.
Decision metric
Track accepted draft tokens / proposed tokens and end-to-end tokens/sec delta, not just kernel-level speed.
5) Kernel layer: FlashAttention-family acceleration
Even with good scheduling, attention kernels can cap utilization. FlashAttention-2 improves work partitioning and can materially raise achieved FLOPs utilization.
Interpretation for operators:
- better kernels raise the ceiling,
- scheduler + memory policy determine how often you actually hit it.
You need both.
6) Queueing strategy that preserves user experience
Treat serving as a queueing problem, not a single benchmark.
Recommended defaults:
- class-based queues: interactive, standard, bulk,
- admission control when queue delay budget is exhausted,
- max_new_tokens guards per class,
- retry budgets at gateway (prevent retry storms),
- explicit overload mode (degrade gracefully before total collapse).
If you only optimize average latency, p99 tails will still hurt product UX.
7) Metrics that actually predict incidents
Minimum dashboard:
- TTFT p50/p95/p99
- ITL (inter-token latency) p50/p95
- tokens/sec per GPU (prefill vs decode split)
- queue wait distribution by traffic class
- KV cache usage, fragmentation proxy, eviction/churn
- speculative accept ratio
- prefix cache hit ratio
- OOM/restart count and cause tags
Alert on sustained tail drift, not just one-minute spikes.
8) Rollout plan (low-risk)
- Baseline current engine with fixed workload mix.
- Enable continuous batching + policy tuning.
- Enable paged KV + verify memory behavior under long contexts.
- Turn on prefix reuse for high-repeat routes.
- A/B speculative decoding on selected models/traffic only.
- Compare by SLO + cost, not throughput alone.
Promotion gate example:
- p95 TTFT non-regression (or improved),
- p95 ITL non-regression,
= X% throughput gain at equal quality,
- no increase in OOM/eviction incident rate.
Quick anti-pattern checklist
- Chasing synthetic benchmark TPS with unrealistic prompt length distribution
- No separation between interactive and batch traffic
- Ignoring KV memory accounting until OOMs appear
- Turning on speculative decoding without acceptance telemetry
- Reporting only average latency
- No overload behavior (system fails "all at once")
References
- vLLM / PagedAttention paper (SOSP 2023)
https://arxiv.org/abs/2309.06180 - FlashAttention-2 paper
https://arxiv.org/abs/2307.08691 - Speculative Decoding (ICML 2023)
https://proceedings.mlr.press/v202/leviathan23a.html - SGLang paper (RadixAttention, structured LM programs)
https://arxiv.org/abs/2312.07104 - Hugging Face: Continuous batching from first principles
https://huggingface.co/blog/continuous_batching - Hugging Face TGI docs (features + maintenance-mode note)
https://huggingface.co/docs/text-generation-inference/en/index - TGI PagedAttention conceptual doc
https://huggingface.co/docs/text-generation-inference/en/conceptual/paged_attention - TensorRT-LLM overview (in-flight batching, paged attention, speculative decoding features)
https://nvidia.github.io/TensorRT-LLM/overview.html
One-line takeaway
In production LLM serving, real wins come from co-designing scheduler + KV memory + decoding algorithm, then judging success by tail latency and cost stability—not headline tokens/sec.