Page-Fault Storm Dispatch Jitter Slippage Playbook

2026-03-19 ยท finance

Page-Fault Storm Dispatch Jitter Slippage Playbook

Why this matters

Slippage models often price spread + impact + queue risk, but ignore a host-side cost: memory fault turbulence.

When a strategy process hits bursts of minor/major faults (cold mmap pages, reclaim pressure, THP compaction side-effects), the decision loop stalls, order dispatch cadence dephases, and child orders cluster into worse liquidity moments.

That creates a hidden basis-point leak even when network and venue metrics look normal.


Failure mechanism (infra -> execution)

  1. Working-set miss / reclaim event raises page-fault service time.
  2. Event-loop thread pauses in fault handling and/or reclaim path.
  3. Signal-to-order latency variance spikes (especially p95/p99).
  4. Child-order schedule compresses (pause -> burst recovery).
  5. Queue priority decays and adverse-selection exposure rises.

This is a classic timing-convexity tax: rare stalls disproportionately damage tail execution outcomes.


Observable metrics

Use a dedicated feature bundle rather than a single counter.

1) MFR โ€” Minor Fault Rate

2) MJS95 โ€” Major-Fault Jitter Service p95

3) FBS โ€” Fault-Burst Score

4) DGL โ€” Dispatch Gap Lift

5) QAD โ€” Queue-Age Decay


Modeling pattern

A practical residual model:

For fast operations, maintain both:

  1. Mean residual head (expected cost)
  2. q95 residual head (tail protection)

In many desks, fault features are weak in mean but strong in q95. That is exactly where silent slippage tax hides.


Regime state machine

GREEN_FAULT_CLEAN

AMBER_FAULT_WARMING

RED_FAULT_STRESS

SAFE_CONTAIN

Use hysteresis to avoid rapid state flapping.


Control actions by state

GREEN -> AMBER

AMBER -> RED

RED -> SAFE_CONTAIN


Fast diagnostics checklist

  1. Did MFR/FBS spike before DGL and slippage tail lift?
  2. Is network/venue telemetry stable while host fault metrics degrade?
  3. Are bursts concentrated around opens/news windows (worst convexity)?
  4. Did containment reduce q95 residual within the expected half-life?

If yes, this is likely fault-driven execution degradation, not pure market turbulence.


Deployment playbook (safe rollout)

  1. Shadow phase: log fault bundle + residual attribution only
  2. Advisory phase: produce non-binding state recommendations
  3. Canary phase: apply controls to a small flow slice
  4. Promotion gate: require q95 improvement without completion-rate damage
  5. Rollback rule: auto-disable if miss-rate or opportunity-cost spikes beyond budget

Common mistakes


Bottom line

Page-fault storms are a microstructure-relevant latency regime.

If your model ignores fault-driven dispatch jitter, you will underprice tail slippage, over-trust passive fills during turbulence, and pay hidden bps during exactly the windows that matter most.