Transparent Hugepage Compaction Shock Slippage Playbook
Why this exists
Execution stacks can look perfectly healthy at CPU%, NIC, and app-level p50 latency, while still leaking tail slippage.
One overlooked source: Linux THP compaction behavior (direct reclaim/compaction on allocation paths, plus background khugepaged collapse pressure) creating short scheduler stalls and bursty dispatch cadence.
Those micro-stalls can desynchronize child-order timing and queue priority, especially in thin/fast books.
Core failure mode
When memory is fragmented and THP policy is too aggressive for low-latency workloads:
- allocation path hits compaction/reclaim work,
- event loop pauses briefly (or repeatedly),
- child orders bunch after the pause,
- queue-priority age is reset on clustered replaces,
- toxicity rises as fills happen late in micro-regime transitions.
Result: tail implementation shortfall rises without obvious infra alarms.
Slippage decomposition with memory-compaction terms
For parent order (i):
[ IS_i = C_{delay} + C_{impact} + C_{miss} + C_{compaction-shock} ]
Where:
[ C_{compaction-shock} = C_{stall} + C_{burst-bunch} + C_{queue-reset} ]
- Stall: execution-loop pause while memory path is contended
- Burst-bunch: multiple child actions released together after pause
- Queue-reset: replace/cancel clustering that destroys queue age
Feature set (production-ready)
1) Kernel/MM pressure features
thp_enabled_mode(always|madvise|never)thp_defrag_mode(always|defer|defer+madvise|madvise|never)/proc/vmstat:compact_stall,compact_fail,compact_success/proc/vmstat:thp_fault_alloc,thp_fault_fallback,thp_collapse_allocpgscan_kswapd*,pgscan_direct*deltas
2) Runtime stall features
- event-loop gap p95/p99 (µs)
- scheduler delay p95/p99 (runnable -> running)
- allocator latency tails for critical paths
- GC/pause tags (to avoid confounding with language runtime pauses)
3) Dispatch microstructure features
- child inter-send gap distribution
- burstiness index (Fano/overdispersion of child sends)
- replace cluster size per 100ms window
- ACK dispersion before/after detected stall windows
4) Outcome features
- passive fill ratio by latency bucket
- markout ladders (10ms, 100ms, 1s, 5s)
- completion deficit vs deadline
- branch labels:
smooth,stall_then_catchup,stall_then_panic_cross
Model architecture
Use a baseline + overlay structure.
- Baseline slippage model
- existing desk model (impact/fill/toxicity/deadline)
- Compaction-shock overlay
- predicts incremental uplift during MM stress windows:
delta_is_meandelta_is_q95
- predicts incremental uplift during MM stress windows:
Final estimate:
[ \hat{IS}{final} = \hat{IS}{baseline} + \Delta\hat{IS}_{mm-shock} ]
Train overlay with episode labeling around compaction-stall bursts and matched non-burst controls (same symbol/session regime) to separate MM effects from market volatility.
Regime controller
State A: MM_CLEAN
- low compaction activity, stable dispatch gaps
- normal tactic mix
State B: MM_WATCH
- rising compaction stalls or THP fallback pressure
- reduce replace churn, tighten child-size cap
State C: MM_SHOCK
- confirmed stall+bunch signature
- avoid fragile passive tactics, switch to smoother pacing template
State D: SAFE_MM_INTEGRITY
- repeated bursts with completion risk
- conservative fallback policy + symbol-level kill-switch availability
Use hysteresis + minimum dwell times to avoid control flapping.
Desk metrics
- CSI (Compaction Stall Intensity): weighted
compact_stall+ stall-lat tail score - TFR (THP Fallback Ratio):
thp_fault_fallback / (thp_fault_alloc + thp_fault_fallback) - DBI (Dispatch Burst Index): overdispersion of child-send counts per fixed window
- QRL (Queue Reset Load): replace/cancel clusters per minute
- MSU (Memory Shock Uplift): realized IS minus baseline IS during MM-shock windows
Track by host + strategy + symbol-liquidity bucket.
Practical mitigation ladder
- Observe first
- instrument vmstat + dispatch gaps + slippage branch labels
- Policy hardening
- prefer
madviseover globalalwaysfor THP in latency-critical services - avoid aggressive defrag in critical execution paths
- prefer
- Memory layout discipline
- isolate latency-critical processes from heavy memory churn workloads
- pin/control background tasks that trigger compaction pressure
- Execution fallback rails
- when
MM_SHOCK, cap child-size jumps and replace frequency - throttle panic catch-up to prevent queue-reset cascades
- when
Failure drills (must run)
- Synthetic memory-fragmentation drill
- verify
MM_WATCHtriggers before tail slippage explosion
- verify
- Compaction-burst replay drill
- confirm overlay uplift captures q95 deterioration
- Fallback-policy drill
- ensure
SAFE_MM_INTEGRITYaction path is one-command reversible
- ensure
- Confounder drill
- separate MM shock from GC/NIC spikes in incident tagging
Anti-patterns
- Treating THP as globally “on/off good/bad” with no workload segmentation
- Monitoring only CPU utilization while ignoring runnable delay/event-gap tails
- Letting catch-up logic fire unbounded after a short stall
- Explaining all tail slippage as “market noise” when host-level burst signatures are visible
Bottom line
THP can improve throughput/TLB behavior, but in execution systems the wrong compaction profile can quietly inject latency bursts that your router converts into queue-priority loss and tail-cost blowups.
Model memory-compaction shock as a first-class slippage component, and wire explicit regime controls before infra jitter becomes basis-point leakage.