Automatic NUMA Balancing Migration-Shock Slippage Playbook

2026-03-17 · finance

Automatic NUMA Balancing Migration-Shock Slippage Playbook

Why this exists

In low-latency execution stacks, we often blame market volatility for tail slippage spikes.

But some spikes are self-inflicted by host-level memory locality churn:

This playbook treats that churn as a first-class slippage driver.


Core failure mode

Automatic NUMA balancing is page-fault driven and adaptive. That is useful for generic throughput workloads, but it can hurt deterministic execution when scanner/migration activity aligns with decision bursts.

The practical path to slippage:

  1. Scanner marks regions for hinting faults
  2. Fault bursts raise per-thread service-time variance
  3. Page migration copies memory (overhead-heavy step)
  4. Dispatch cadence becomes uneven (micro-clustering)
  5. Cancel/replace timing drifts, queue age resets, adverse selection rises

Result: q95/q99 implementation shortfall rises even when p50 decision latency looks acceptable.


Slippage decomposition with NUMA terms

For parent order (i):

[ IS_i = C_{delay} + C_{impact} + C_{miss} + C_{numa} ]

where

[ C_{numa} = C_{hint-fault} + C_{migration-copy} + C_{dispatch-jitter} + C_{queue-decay} ]

Interpretation:


Observability blueprint (production-safe)

1) Kernel and NUMA control plane

2) NUMA activity counters (/proc/vmstat)

Track deltas per 1s/5s bucket:

3) Execution-path joins

Join NUMA deltas with execution telemetry on a common clock:


Desk metrics

Segment all five by host, strategy, symbol-liquidity bucket, and session phase.


Regime controller

State A: NUMA_STABLE

State B: NUMA_WATCH

State C: NUMA_SHOCK

State D: SAFE_NUMA_CONSERVATIVE

Use hysteresis and minimum dwell to prevent flip-flop.


Mitigation ladder

  1. Placement first, balancing second

    • Pin critical execution threads and memory policy where feasible.
    • If workload is already statically NUMA-tuned, keep auto-balancing off for that path.
  2. If balancing must stay enabled, slow the scanner for critical hosts

    • Increase scan delay / reduce scan aggressiveness within tested guardrails.
    • Validate no throughput cliff for non-latency-critical services.
  3. Isolate migration-heavy components

    • Separate research/backfill batch jobs from live execution NUMA domains.
  4. Model-aware execution adaptation

    • Feed RHF/MCR/DCJ into slippage overlay model and tactic gates in real time.

Validation protocol

  1. A/B host policy test

    • Compare baseline vs tuned NUMA settings using same symbols and session buckets.
  2. Counterfactual uplift estimation

    • Matched windows: same spread/volatility/participation, different RHF+MCR regimes.
  3. Tail KPI acceptance gates

    • Promote only if q95/q99 IS and deadline-miss rate improve without adverse completion drift.
  4. Rollback criteria

    • If completion deficit or market-impact term worsens beyond threshold, revert immediately.

Anti-patterns


Bottom line

Automatic NUMA balancing is neither “always good” nor “always bad.”

For latency-sensitive execution, its hint-fault/migration dynamics can create hidden queue-priority tax. Treat NUMA activity as a modeled slippage factor, not background noise.


References