LLM Serving Throughput/Latency Playbook: Continuous Batching, Paged KV, and Speculative Decoding

2026-03-09 · software

LLM Serving Throughput/Latency Playbook: Continuous Batching, Paged KV, and Speculative Decoding

Date: 2026-03-09
Category: knowledge
Domain: software / ML systems / inference infrastructure

Why this matters

LLM serving usually fails in one of two ways:

The practical goal is not "max throughput" in isolation. It is:

stable latency SLO under bursty mixed workloads while keeping $/1M tokens predictable.


Core model: what actually bottlenecks

For most production workloads:

  1. Prefill (prompt processing) is compute-heavy.
  2. Decode (next-token loop) is memory-bandwidth + scheduler heavy.
  3. KV cache becomes the dominant capacity limiter as context grows.

So the stack should combine:

Treat these as complementary, not mutually exclusive.


1) Continuous batching first (highest practical ROI)

What it is

Instead of waiting for a batch window and running monolithic batches, the server re-packs work every decode step (iteration-level scheduling).

Why it works

Common failure mode

Teams enable continuous batching but keep naive queue policy, so short prompts are still trapped behind long prefills.

Practical policy


2) Paged KV cache to reclaim capacity

Why it matters

Contiguous KV allocation fragments memory badly for variable sequence lengths. Paged KV stores cache in blocks + indirection table, so allocation/reuse is far more efficient.

Rule of thumb

If context length is large and request lengths vary widely, paged KV is often non-negotiable for good GPU occupancy.

Capacity intuition

Approximate KV bytes/token (decoder-only, MHA/GQA-style):

KV bytes per token ≈ 2 × layers × kv_heads × head_dim × dtype_bytes

Multiply by active tokens in memory to estimate pressure. This quickly dominates VRAM.

Operational tips


3) Prefix reuse (Radix/prefix cache) for repeated prompts

Many real workloads repeat large shared prefixes (system prompt, tool schema, few-shot scaffold).

Prefix-aware caching can cut prefill cost substantially and improve TTFT.

Use when:

Do not assume 100% hit rates; track prefix-hit ratio by route.


4) Speculative decoding for extra decode speed

Concept

A smaller draft model proposes several tokens; the target model verifies/accepts in parallelized fashion.

Best case

When it disappoints

Decision metric

Track accepted draft tokens / proposed tokens and end-to-end tokens/sec delta, not just kernel-level speed.


5) Kernel layer: FlashAttention-family acceleration

Even with good scheduling, attention kernels can cap utilization. FlashAttention-2 improves work partitioning and can materially raise achieved FLOPs utilization.

Interpretation for operators:

You need both.


6) Queueing strategy that preserves user experience

Treat serving as a queueing problem, not a single benchmark.

Recommended defaults:

If you only optimize average latency, p99 tails will still hurt product UX.


7) Metrics that actually predict incidents

Minimum dashboard:

Alert on sustained tail drift, not just one-minute spikes.


8) Rollout plan (low-risk)

  1. Baseline current engine with fixed workload mix.
  2. Enable continuous batching + policy tuning.
  3. Enable paged KV + verify memory behavior under long contexts.
  4. Turn on prefix reuse for high-repeat routes.
  5. A/B speculative decoding on selected models/traffic only.
  6. Compare by SLO + cost, not throughput alone.

Promotion gate example:


Quick anti-pattern checklist


References


One-line takeaway

In production LLM serving, real wins come from co-designing scheduler + KV memory + decoding algorithm, then judging success by tail latency and cost stability—not headline tokens/sec.