Automatic NUMA Balancing Migration-Shock Slippage Playbook
Why this exists
In low-latency execution stacks, we often blame market volatility for tail slippage spikes.
But some spikes are self-inflicted by host-level memory locality churn:
- Linux automatic NUMA balancing periodically unmaps pages,
- accesses trigger NUMA hinting faults,
- pages and/or tasks are migrated,
- dispatch threads experience bursty latency and cache/TLB disruption,
- queue priority decays while strategy logic still believes it is “on-time”.
This playbook treats that churn as a first-class slippage driver.
Core failure mode
Automatic NUMA balancing is page-fault driven and adaptive. That is useful for generic throughput workloads, but it can hurt deterministic execution when scanner/migration activity aligns with decision bursts.
The practical path to slippage:
- Scanner marks regions for hinting faults
- Fault bursts raise per-thread service-time variance
- Page migration copies memory (overhead-heavy step)
- Dispatch cadence becomes uneven (micro-clustering)
- Cancel/replace timing drifts, queue age resets, adverse selection rises
Result: q95/q99 implementation shortfall rises even when p50 decision latency looks acceptable.
Slippage decomposition with NUMA terms
For parent order (i):
[ IS_i = C_{delay} + C_{impact} + C_{miss} + C_{numa} ]
where
[ C_{numa} = C_{hint-fault} + C_{migration-copy} + C_{dispatch-jitter} + C_{queue-decay} ]
Interpretation:
- Hint-fault term: interrupt/fault handling overhead during active decisions
- Migration-copy term: page-copy cost from misplaced memory correction
- Dispatch-jitter term: irregular child-order release cadence
- Queue-decay term: higher cancel/replace and late-arrival penalty
Observability blueprint (production-safe)
1) Kernel and NUMA control plane
cat /proc/sys/kernel/numa_balancing- Scan-rate knobs:
numa_balancing_scan_delay_msnuma_balancing_scan_period_min_msnuma_balancing_scan_period_max_msnuma_balancing_scan_size_mb
2) NUMA activity counters (/proc/vmstat)
Track deltas per 1s/5s bucket:
numa_pte_updatesnuma_huge_pte_updatesnuma_hint_faultsnuma_hint_faults_localnuma_pages_migrated
3) Execution-path joins
Join NUMA deltas with execution telemetry on a common clock:
- decision→wire latency p50/p95/p99
- child dispatch inter-arrival CV
- cancel/replace burst density
- passive fill ratio by latency bucket
- short-horizon markout ladder
Desk metrics
RHF (Remote Hint Fault share)
(RHF = 1 - \frac{\Delta numa_hint_faults_local}{\Delta numa_hint_faults + \epsilon})MCR (Migration Churn Rate)
(MCR = \frac{\Delta numa_pages_migrated}{\Delta t})NSI (NUMA Scan Intensity)
proxy from scan-size and scan-period settings + hint-fault velocityDCJ (Dispatch Cadence Jitter)
p99 child-gap / p50 child-gapNUS (NUMA Uplift to Slippage)
realized IS minus baseline IS during matched windows with elevated RHF/MCR
Segment all five by host, strategy, symbol-liquidity bucket, and session phase.
Regime controller
State A: NUMA_STABLE
- low RHF, low MCR, tight DCJ
- normal tactics
State B: NUMA_WATCH
- RHF rising and hint-fault velocity inflecting
- soften cancel/replace aggressiveness
- cap child burst size
State C: NUMA_SHOCK
- sustained high MCR + degraded DCJ + queue decay signals
- switch to smoother pacing template
- avoid fragile queue-chasing logic
State D: SAFE_NUMA_CONSERVATIVE
- repeated shock episodes with deadline risk
- prioritize completion reliability over micro-priority games
- stricter participation and retry ceilings
Use hysteresis and minimum dwell to prevent flip-flop.
Mitigation ladder
Placement first, balancing second
- Pin critical execution threads and memory policy where feasible.
- If workload is already statically NUMA-tuned, keep auto-balancing off for that path.
If balancing must stay enabled, slow the scanner for critical hosts
- Increase scan delay / reduce scan aggressiveness within tested guardrails.
- Validate no throughput cliff for non-latency-critical services.
Isolate migration-heavy components
- Separate research/backfill batch jobs from live execution NUMA domains.
Model-aware execution adaptation
- Feed RHF/MCR/DCJ into slippage overlay model and tactic gates in real time.
Validation protocol
A/B host policy test
- Compare baseline vs tuned NUMA settings using same symbols and session buckets.
Counterfactual uplift estimation
- Matched windows: same spread/volatility/participation, different RHF+MCR regimes.
Tail KPI acceptance gates
- Promote only if q95/q99 IS and deadline-miss rate improve without adverse completion drift.
Rollback criteria
- If completion deficit or market-impact term worsens beyond threshold, revert immediately.
Anti-patterns
- Blaming all tail slippage on market regime without host-level attribution
- Watching only CPU% while ignoring hint-fault and migration counters
- Turning off auto-balancing globally without checking static-placement hygiene
- Tuning scan knobs without matching execution outcomes (queue age, markout, IS tails)
Bottom line
Automatic NUMA balancing is neither “always good” nor “always bad.”
For latency-sensitive execution, its hint-fault/migration dynamics can create hidden queue-priority tax. Treat NUMA activity as a modeled slippage factor, not background noise.
References
- Linux kernel sysctl docs (
numa_balancing, memory-tiering mode): https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html - Linux kernel v5.9 sysctl docs (detailed scan knobs): https://www.kernel.org/doc/html/v5.9/admin-guide/sysctl/kernel.html
- Red Hat RHEL 7 virtualization tuning guide (automatic NUMA balancing behavior): https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/7/html/virtualization_tuning_and_optimization_guide/sect-virtualization_tuning_optimization_guide-numa-auto_numa_balancing
- openSUSE tuning guide (NUMA balancing steps, vmstat counters, overhead notes): https://doc.opensuse.org/documentation/leap/tuning/html/book-tuning/cha-tuning-numactl.html
proc_vmstat(5)field reference: https://www.man7.org/linux/man-pages/man5/proc_vmstat.5.html