Page-Fault Storm Dispatch Jitter Slippage Playbook
Why this matters
Slippage models often price spread + impact + queue risk, but ignore a host-side cost: memory fault turbulence.
When a strategy process hits bursts of minor/major faults (cold mmap pages, reclaim pressure, THP compaction side-effects), the decision loop stalls, order dispatch cadence dephases, and child orders cluster into worse liquidity moments.
That creates a hidden basis-point leak even when network and venue metrics look normal.
Failure mechanism (infra -> execution)
- Working-set miss / reclaim event raises page-fault service time.
- Event-loop thread pauses in fault handling and/or reclaim path.
- Signal-to-order latency variance spikes (especially p95/p99).
- Child-order schedule compresses (pause -> burst recovery).
- Queue priority decays and adverse-selection exposure rises.
This is a classic timing-convexity tax: rare stalls disproportionately damage tail execution outcomes.
Observable metrics
Use a dedicated feature bundle rather than a single counter.
1) MFR โ Minor Fault Rate
- Per-strategy process minor faults/sec (rolling windows: 1s, 10s, 60s)
- Useful for detecting cold-page churn and mmap walk pressure
2) MJS95 โ Major-Fault Jitter Service p95
- p95 major-fault service time (or proxy from stall traces)
- Directly captures worst blocking paths
3) FBS โ Fault-Burst Score
- Burstiness index of faults (e.g., variance-to-mean or p95/median in short windows)
- Distinguishes steady background faults from dangerous clusters
4) DGL โ Dispatch Gap Lift
- Incremental lift in inter-dispatch gap vs clean baseline
- Core bridge metric from infra to execution timing
5) QAD โ Queue-Age Decay
- Delta in expected queue age / rank proxy during fault bursts
- Converts timing turbulence into microstructure damage
Modeling pattern
A practical residual model:
IS_residual_t = f(market_state_t, order_state_t, fault_state_t)fault_state_t = {MFR, MJS95, FBS, DGL}
For fast operations, maintain both:
- Mean residual head (expected cost)
- q95 residual head (tail protection)
In many desks, fault features are weak in mean but strong in q95. That is exactly where silent slippage tax hides.
Regime state machine
GREEN_FAULT_CLEAN
- MFR stable, DGL near baseline
- Normal scheduling
AMBER_FAULT_WARMING
- MFR drift up, mild burstiness
- Pre-emptive soft controls
RED_FAULT_STRESS
- FBS high + DGL spike + q95 residual widening
- Reduce aggression, avoid bursty catch-up behavior
SAFE_CONTAIN
- Fault turbulence persists or escalates
- Hard containment to protect tail risk and execution completion integrity
Use hysteresis to avoid rapid state flapping.
Control actions by state
GREEN -> AMBER
- Prefault likely hot datasets before session windows
- Verify memory pinning/NUMA placement for hot processes
- Tighten memory observability sampling (without adding probe overhead)
AMBER -> RED
- Lower burst size and cap catch-up aggressiveness
- Stretch child-order cadence to reduce pause-then-sweep behavior
- Shift to more conservative venue/routing posture during instability windows
RED -> SAFE_CONTAIN
- Freeze non-essential background tasks on strategy hosts
- Apply strict risk envelope: lower participation caps, stricter tail budget gates
- If persistent, trigger host-level remediation lane before resuming normal policy
Fast diagnostics checklist
- Did MFR/FBS spike before DGL and slippage tail lift?
- Is network/venue telemetry stable while host fault metrics degrade?
- Are bursts concentrated around opens/news windows (worst convexity)?
- Did containment reduce q95 residual within the expected half-life?
If yes, this is likely fault-driven execution degradation, not pure market turbulence.
Deployment playbook (safe rollout)
- Shadow phase: log fault bundle + residual attribution only
- Advisory phase: produce non-binding state recommendations
- Canary phase: apply controls to a small flow slice
- Promotion gate: require q95 improvement without completion-rate damage
- Rollback rule: auto-disable if miss-rate or opportunity-cost spikes beyond budget
Common mistakes
- Treating minor faults as harmless because major faults are low
- Looking only at averages (tail is where damage lives)
- Enabling aggressive catch-up after pauses (creates toxic bursts)
- Ignoring host memory behavior because NIC/latency dashboards look green
Bottom line
Page-fault storms are a microstructure-relevant latency regime.
If your model ignores fault-driven dispatch jitter, you will underprice tail slippage, over-trust passive fills during turbulence, and pay hidden bps during exactly the windows that matter most.