MSI-X Vector Affinity Drift as a Hidden Slippage Driver (Practical Playbook)

2026-03-21 · finance

MSI-X Vector Affinity Drift as a Hidden Slippage Driver (Practical Playbook)

Date: 2026-03-21
Category: research
Audience: low-latency execution teams running Linux multi-queue NIC paths


Why this matters

Most execution stacks assume packet-ingest latency is stationary once the host is “tuned.” In practice, MSI-X interrupt vector affinity can drift after events like:

When vector→CPU mapping drifts away from the intended topology, market-data and order-ACK paths pick up hidden timing taxes:

  1. more cross-NUMA memory traffic,
  2. extra cache misses and softirq wakeups,
  3. bursty p95/p99 handler latency,
  4. decision-time vs market-time phase drift.

That often shows up in TCA as “random tail slippage,” even though the root cause is operational and measurable.


Failure mechanism (affinity drift -> execution timing tax)

  1. RX/TX queues are created with MSI-X vectors and expected CPU locality.
  2. Affinity drifts (vector lands on non-target core/NUMA node).
  3. NAPI poll/softirq execution shifts away from execution-thread locality.
  4. Packet handoff and cache-coherency overhead rise under burst flow.
  5. Decision loop sees stale/phase-shifted market state at critical moments.
  6. Child-order timing degrades -> queue position loss and urgency catch-up.

Result: tail-heavy slippage with little change in median latency.


Slippage decomposition with affinity term

For parent order (i):

[ IS_i = C_{impact} + C_{timing} + C_{routing} + C_{affinity} ]

Where:

[ C_{affinity} = C_{numa-cross} + C_{cache-miss} + C_{softirq-jitter} + C_{causal-drift} ]


Operational metrics (new)

1) VAM - Vector Affinity Mismatch

Share of active vectors pinned outside intended CPU mask.

[ VAM = \frac{#(vectors\ not\ in\ target\ mask)}{#(active\ vectors)} ]

2) NRS - NUMA Remote Share

Fraction of packet-processing events executed on non-local NUMA node versus NIC-local target policy.

3) HJ95 - Handler Jitter p95

p95 delta between packet hardware/software timestamp and first userspace-consumable event timestamp.

4) SAD - Softirq Asymmetry Delta

Imbalance score across per-CPU softirq load in the target CPU set.

5) CDT - Causality Drift Tax

Incremental markout/IS during high-VAM regimes versus matched low-VAM windows.


What to log in production

Host/kernel layer

Execution layer


Identification strategy (causal)

  1. Build two regimes:
    • AFFINITY_ALIGNED (low VAM),
    • AFFINITY_DRIFTED (high VAM).
  2. Match windows by spread, volatility, participation, symbol liquidity, venue mix.
  3. Estimate incremental tail cost (CDT) with host/session fixed effects.
  4. Run controlled canary:
    • re-apply deterministic IRQ pinning,
    • freeze execution-thread CPU placement,
    • disable conflicting auto-tuners during test.
  5. Promote only if CDT and p99 handler jitter improve without completion-rate damage.

Regime state machine

AFFINITY_ALIGNED

AFFINITY_WARN

AFFINITY_DRIFTED

SAFE_CONTAIN

Use hysteresis and minimum dwell time to avoid policy flapping.


Control ladder

  1. Declare target topology explicitly
    • maintain versioned vector->CPU policy by host class.
  2. Pin execution threads coherently
    • align app critical threads with intended IRQ/NAPI CPUs.
  3. Apply idempotent affinity reconciler
    • periodic verifier/remediator for /proc/irq/*/smp_affinity_list drift.
  4. Guard against automation conflicts
    • coordinate irqbalance, orchestration agents, and boot scripts.
  5. Integrate host-regime into slippage model
    • include VAM/NRS/HJ95 as live features for tactic gating.

Failure drills

  1. Driver-reset drift drill
    • trigger controlled NIC reset; verify remap automation and alerting.
  2. CPU-set perturbation drill
    • move execution workers; confirm locality alarms and rollback.
  3. Burst replay drill
    • replay peak traffic with induced mismatch; validate CDT sensitivity.
  4. Containment drill
    • force transition to SAFE_CONTAIN and confirm loss containment.

Common mistakes


Bottom line

MSI-X affinity drift is a microstructure-relevant infra risk, not just a systems hygiene issue.

If vector locality drifts, your decision clock drifts from market clock. That leak appears as tail slippage and weak queue entry quality. Treat affinity state as a first-class model feature and attach explicit remediation + containment controls.


References