Page-Fault Storm Dispatch Jitter Slippage Playbook

Why this matters

Slippage models often price spread + impact + queue risk, but ignore a host-side cost: memory fault turbulence.

When a strategy process hits bursts of minor/major faults (cold mmap pages, reclaim pressure, THP compaction side-effects), the decision loop stalls, order dispatch cadence dephases, and child orders cluster into worse liquidity moments.

That creates a hidden basis-point leak even when network and venue metrics look normal.

Failure mechanism (infra -> execution)

Working-set miss / reclaim event raises page-fault service time.
Event-loop thread pauses in fault handling and/or reclaim path.
Signal-to-order latency variance spikes (especially p95/p99).
Child-order schedule compresses (pause -> burst recovery).
Queue priority decays and adverse-selection exposure rises.

This is a classic timing-convexity tax: rare stalls disproportionately damage tail execution outcomes.

Observable metrics

Use a dedicated feature bundle rather than a single counter.

1) MFR — Minor Fault Rate

Per-strategy process minor faults/sec (rolling windows: 1s, 10s, 60s)
Useful for detecting cold-page churn and mmap walk pressure

2) MJS95 — Major-Fault Jitter Service p95

p95 major-fault service time (or proxy from stall traces)
Directly captures worst blocking paths

3) FBS — Fault-Burst Score

Burstiness index of faults (e.g., variance-to-mean or p95/median in short windows)
Distinguishes steady background faults from dangerous clusters

4) DGL — Dispatch Gap Lift

Incremental lift in inter-dispatch gap vs clean baseline
Core bridge metric from infra to execution timing

5) QAD — Queue-Age Decay

Delta in expected queue age / rank proxy during fault bursts
Converts timing turbulence into microstructure damage

Modeling pattern

A practical residual model:

IS_residual_t = f(market_state_t, order_state_t, fault_state_t)
fault_state_t = {MFR, MJS95, FBS, DGL}

For fast operations, maintain both:

Mean residual head (expected cost)
q95 residual head (tail protection)

In many desks, fault features are weak in mean but strong in q95. That is exactly where silent slippage tax hides.

Regime state machine

GREEN_FAULT_CLEAN

MFR stable, DGL near baseline
Normal scheduling

AMBER_FAULT_WARMING

MFR drift up, mild burstiness
Pre-emptive soft controls

RED_FAULT_STRESS

FBS high + DGL spike + q95 residual widening
Reduce aggression, avoid bursty catch-up behavior

SAFE_CONTAIN

Fault turbulence persists or escalates
Hard containment to protect tail risk and execution completion integrity

Use hysteresis to avoid rapid state flapping.

Control actions by state

GREEN -> AMBER

Prefault likely hot datasets before session windows
Verify memory pinning/NUMA placement for hot processes
Tighten memory observability sampling (without adding probe overhead)

AMBER -> RED

Lower burst size and cap catch-up aggressiveness
Stretch child-order cadence to reduce pause-then-sweep behavior
Shift to more conservative venue/routing posture during instability windows

RED -> SAFE_CONTAIN

Freeze non-essential background tasks on strategy hosts
Apply strict risk envelope: lower participation caps, stricter tail budget gates
If persistent, trigger host-level remediation lane before resuming normal policy

Fast diagnostics checklist

Did MFR/FBS spike before DGL and slippage tail lift?
Is network/venue telemetry stable while host fault metrics degrade?
Are bursts concentrated around opens/news windows (worst convexity)?
Did containment reduce q95 residual within the expected half-life?

If yes, this is likely fault-driven execution degradation, not pure market turbulence.

Deployment playbook (safe rollout)

Shadow phase: log fault bundle + residual attribution only
Advisory phase: produce non-binding state recommendations
Canary phase: apply controls to a small flow slice
Promotion gate: require q95 improvement without completion-rate damage
Rollback rule: auto-disable if miss-rate or opportunity-cost spikes beyond budget

Common mistakes

Treating minor faults as harmless because major faults are low
Looking only at averages (tail is where damage lives)
Enabling aggressive catch-up after pauses (creates toxic bursts)
Ignoring host memory behavior because NIC/latency dashboards look green

Bottom line

Page-fault storms are a microstructure-relevant latency regime.

If your model ignores fault-driven dispatch jitter, you will underprice tail slippage, over-trust passive fills during turbulence, and pay hidden bps during exactly the windows that matter most.