Runtime GC Pause Bursts and Dispatch-Phase Slippage (Practical Playbook)

2026-03-16 · finance

Runtime GC Pause Bursts and Dispatch-Phase Slippage (Practical Playbook)

Date: 2026-03-16
Category: research
Audience: small quant teams running low-latency execution stacks (JVM/Go/Node/Python)


Why this matters

Most slippage stacks model market microstructure well but treat application runtime as "outside noise."
In live execution, that assumption fails when garbage collection (GC) pauses and allocator pressure create dispatch bunching:

This is not a rare infra curiosity. It is a repeatable control-loop distortion between strategy intent and wire-time execution.


1) Failure mechanism (from heap pressure to bps leak)

A practical causal chain:

  1. Allocation spikes (market-data bursts, short-lived object churn)
  2. Runtime pauses or mutator throttling (STW or effective pause)
  3. Event-loop / scheduler lag grows
  4. Child-order cadence drifts from planned schedule
  5. Post-pause catch-up sends clustered child slices
  6. Queue-entry timing worsens, adverse selection rises
  7. Router switches to aggressive fallback near deadline
  8. Realized IS tail (q95+) degrades

Treat this as a pipeline incident class, not a one-off anomaly.


2) Metrics that expose GC-driven slippage

Core telemetry

Execution-coupled metrics

  1. Dispatch Drift Ratio (DDR)

[ DDR = \frac{\text{median}(|\Delta t_{actual}-\Delta t_{target}|)}{\Delta t_{target}} ]

  1. Burst Compression Index (BCI)

[ BCI = \frac{#\text{children in }[0,\tau]\text{ post-pause}}{#\text{children expected in }[0,\tau]} ]

  1. Queue Age Loss (QAL)

Approximate expected queue-age erosion from bursty re-entry vs planned cadence.

  1. GC-linked Slippage Uplift (GSLU)

[ GSLU = IS_{bps}^{pause_windows} - IS_{bps}^{matched_nonpause_windows} ]

If DDR/BCI rise while GSLU widens, runtime is now a first-class slippage driver.


3) Modeling branch cost (pause-aware expected cost)

Use a simple branch model for each decision interval:

[ E[\Delta IS] = p_{on_time}C_{on_time}+p_{delayed}C_{delayed}+p_{burst}C_{burst}+p_{deadline_cross}C_{deadline_cross} ]

Where branch probabilities are conditioned on:

Operationally, C_burst and C_deadline_cross are often underestimated in mean-only models. Fit q90/q95 explicitly.


4) Control policy (runtime-aware execution states)

GREEN (stable runtime)

AMBER (pause risk rising)

RED (active pause or burst recovery)

SAFE (tail protection)

Key rule: never let the scheduler "repay missed slices" in one burst without guardrails.


5) Engineering mitigations by layer

Runtime layer

Process architecture

Policy layer


6) Validation plan (7-day practical rollout)

Day 1-2
Instrument gc/event-loop/dispatch metrics and tag execution windows.

Day 3-4
Build matched-window attribution (pause vs non-pause) and estimate GSLU.

Day 5
Enable AMBER/RED policy in shadow mode (no live action, decision logging only).

Day 6
Canary live on small symbol bucket with hard rollback gates.

Day 7
Review q95 IS, completion rate, and burst incidence; expand only if all gates pass.

Rollback trigger example:


7) Common anti-patterns

  1. "Infra issue, not trading issue" mindset
    Runtime incidents still hit execution PnL; model them.

  2. Mean-only monitoring
    Pause damage is tail-heavy; q95/q99 is the main signal.

  3. Aggressive catch-up logic
    Repaying schedule debt instantly is usually queue-priority suicide.

  4. No co-analysis with market state
    Same pause length has different cost in calm vs fragile books.


Bottom line

GC pause behavior is not merely a systems-health metric. In automated execution, it is a microstructure timing variable.
If runtime observability is disconnected from slippage attribution, teams will keep misdiagnosing tail losses as "random market noise."

Treat pause risk as a modeled branch in the execution controller, and most of the avoidable runtime-induced bps leakage becomes governable.


References