LLM Serving Throughput/Latency Playbook: Continuous Batching, Paged KV, and Speculative Decoding

Date: 2026-03-09
Category: knowledge
Domain: software / ML systems / inference infrastructure

Why this matters

LLM serving usually fails in one of two ways:

you optimize tokens/sec but p95 TTFT (time-to-first-token) explodes,
or you optimize interactive latency and GPU utilization collapses.

The practical goal is not "max throughput" in isolation. It is:

stable latency SLO under bursty mixed workloads while keeping $/1M tokens predictable.

Core model: what actually bottlenecks

For most production workloads:

Prefill (prompt processing) is compute-heavy.
Decode (next-token loop) is memory-bandwidth + scheduler heavy.
KV cache becomes the dominant capacity limiter as context grows.

So the stack should combine:

scheduler optimization (continuous/in-flight batching),
memory optimization (paged KV + prefix reuse),
algorithmic acceleration (speculative decoding),
kernel optimization (FlashAttention-family kernels).

Treat these as complementary, not mutually exclusive.

1) Continuous batching first (highest practical ROI)

What it is

Instead of waiting for a batch window and running monolithic batches, the server re-packs work every decode step (iteration-level scheduling).

Why it works

reduces head-of-line blocking,
backfills slots immediately when a request finishes,
mixes prefill/decode more efficiently under variable-length outputs.

Common failure mode

Teams enable continuous batching but keep naive queue policy, so short prompts are still trapped behind long prefills.

Practical policy

separate prefill tokens budget from decode tokens budget per step,
cap per-request max prefill chunk,
age-based fairness to avoid starvation,
explicit "interactive" vs "bulk" lanes when traffic is mixed.

2) Paged KV cache to reclaim capacity

Why it matters

Contiguous KV allocation fragments memory badly for variable sequence lengths. Paged KV stores cache in blocks + indirection table, so allocation/reuse is far more efficient.

Rule of thumb

If context length is large and request lengths vary widely, paged KV is often non-negotiable for good GPU occupancy.

Capacity intuition

Approximate KV bytes/token (decoder-only, MHA/GQA-style):

KV bytes per token ≈ 2 × layers × kv_heads × head_dim × dtype_bytes

Multiply by active tokens in memory to estimate pressure. This quickly dominates VRAM.

Operational tips

tune block size empirically (fragmentation vs lookup overhead tradeoff),
monitor KV block churn/eviction rate,
validate prefix-sharing correctness (multi-sample / branching paths).

3) Prefix reuse (Radix/prefix cache) for repeated prompts

Many real workloads repeat large shared prefixes (system prompt, tool schema, few-shot scaffold).

Prefix-aware caching can cut prefill cost substantially and improve TTFT.

Use when:

many requests share stable instruction templates,
retrieval wrapper is constant and only document chunk differs near the tail,
agent loops repeat planning scaffolds.

Do not assume 100% hit rates; track prefix-hit ratio by route.

4) Speculative decoding for extra decode speed

Concept

A smaller draft model proposes several tokens; the target model verifies/accepts in parallelized fashion.

Best case

draft is cheap and aligned enough with target,
output entropy is moderate,
you need higher throughput without changing final distribution (algorithm-dependent setup).

When it disappoints

high rejection rate from a weakly matched draft model,
too much overhead in orchestration,
short outputs where setup costs dominate.

Decision metric

Track accepted draft tokens / proposed tokens and end-to-end tokens/sec delta, not just kernel-level speed.

5) Kernel layer: FlashAttention-family acceleration

Even with good scheduling, attention kernels can cap utilization. FlashAttention-2 improves work partitioning and can materially raise achieved FLOPs utilization.

Interpretation for operators:

better kernels raise the ceiling,
scheduler + memory policy determine how often you actually hit it.

You need both.

6) Queueing strategy that preserves user experience

Treat serving as a queueing problem, not a single benchmark.

Recommended defaults:

class-based queues: interactive, standard, bulk,
admission control when queue delay budget is exhausted,
max_new_tokens guards per class,
retry budgets at gateway (prevent retry storms),
explicit overload mode (degrade gracefully before total collapse).

If you only optimize average latency, p99 tails will still hurt product UX.

7) Metrics that actually predict incidents

Minimum dashboard:

TTFT p50/p95/p99
ITL (inter-token latency) p50/p95
tokens/sec per GPU (prefill vs decode split)
queue wait distribution by traffic class
KV cache usage, fragmentation proxy, eviction/churn
speculative accept ratio
prefix cache hit ratio
OOM/restart count and cause tags

Alert on sustained tail drift, not just one-minute spikes.

8) Rollout plan (low-risk)

Baseline current engine with fixed workload mix.
Enable continuous batching + policy tuning.
Enable paged KV + verify memory behavior under long contexts.
Turn on prefix reuse for high-repeat routes.
A/B speculative decoding on selected models/traffic only.
Compare by SLO + cost, not throughput alone.

Promotion gate example:

p95 TTFT non-regression (or improved),
p95 ITL non-regression,
= X% throughput gain at equal quality,
no increase in OOM/eviction incident rate.

Quick anti-pattern checklist

Chasing synthetic benchmark TPS with unrealistic prompt length distribution
No separation between interactive and batch traffic
Ignoring KV memory accounting until OOMs appear
Turning on speculative decoding without acceptance telemetry
Reporting only average latency
No overload behavior (system fails "all at once")

References

vLLM / PagedAttention paper (SOSP 2023)
https://arxiv.org/abs/2309.06180
FlashAttention-2 paper
https://arxiv.org/abs/2307.08691
Speculative Decoding (ICML 2023)
https://proceedings.mlr.press/v202/leviathan23a.html
SGLang paper (RadixAttention, structured LM programs)
https://arxiv.org/abs/2312.07104
Hugging Face: Continuous batching from first principles
https://huggingface.co/blog/continuous_batching
Hugging Face TGI docs (features + maintenance-mode note)
https://huggingface.co/docs/text-generation-inference/en/index
TGI PagedAttention conceptual doc
https://huggingface.co/docs/text-generation-inference/en/conceptual/paged_attention
TensorRT-LLM overview (in-flight batching, paged attention, speculative decoding features)
https://nvidia.github.io/TensorRT-LLM/overview.html

One-line takeaway

In production LLM serving, real wins come from co-designing scheduler + KV memory + decoding algorithm, then judging success by tail latency and cost stability—not headline tokens/sec.