RSS Indirection Queue-Polarization Slippage Playbook

2026-03-20 · finance

RSS Indirection Queue-Polarization Slippage Playbook

Why this exists

A low-latency execution host can look healthy on headline metrics (CPU average, packet loss alarms, median decision latency) and still leak p95/p99 implementation shortfall.

A common hidden driver is RSS queue polarization:

This is an infra-originated slippage tax that often gets misattributed to "market randomness."


Core failure mode

  1. RSS hashing maps a disproportionate share of heavy flows to a few queues.
  2. Per-queue IRQ load diverges (/proc/interrupts), and corresponding CPUs burn softirq budget.
  3. NAPI polling and backlog pressure rise non-uniformly across cores.
  4. Feed-to-decision age widens for symbols bound to hot queues.
  5. Dispatch cadence becomes phase-shifted versus the true microstructure clock.
  6. Passive fill probability drops; corrective crosses/cancels rise.

Result: tail slippage inflation with deceptively stable medians.


Slippage decomposition with queue-polarization term

For parent order (i):

[ IS_i = C_{impact} + C_{timing} + C_{routing} + C_{rss-pol} ]

Where:

[ C_{rss-pol} = C_{stale-md} + C_{irq-jitter} + C_{queue-miss} ]


Production feature set

1) Queue-balance / kernel features

2) Execution-timing features

3) Outcome features


Practical metrics (new)

Track by host, NIC, kernel version, queue map, NUMA profile, and session segment.


Identification strategy (causal, not just correlation)

Use a matched-window design:

  1. Match on spread, volatility, participation, and time-of-day.
  2. Compare high-QPI windows vs low-QPI windows within same host class.
  3. Add host fixed effects and interaction terms (QPI × volatility, SPI × urgency).
  4. Validate with controlled indirection-table reweights during canary windows.

If tail IS drops after flattening queue weights while market state is held constant, the uplift is infra-causal.


Regime controller

State A: RSS_BALANCED

State B: RSS_DRIFT

State C: RSS_POLARIZED

State D: RSS_CONTAIN

Use hysteresis + minimum dwell times to avoid policy flapping.


Mitigation ladder (ops + model)

  1. Inspect and flatten RSS indirection drift
    • check ethtool -x map symmetry and queue weights
  2. Align IRQ affinity with NUMA/thread pinning
    • audit /proc/irq/*/smp_affinity(_list) vs execution core layout
  3. Tune RPS/RFS where queue count < effective CPU set
    • avoid blind enabling; validate IPI overhead trade-offs
  4. Tune NAPI/softnet budgets for burst envelopes
    • revisit net.core.netdev_budget(_usecs) under canary replay
  5. Promote polarization features into live throttles
    • reduce tactical aggressiveness when QPI/SPI exceed thresholds
  6. Recalibrate after NIC driver/kernel changes
    • queue mapping changes invalidate prior coefficients

Failure drills (must run)

  1. Synthetic hash-skew drill
    • force skewed indirection weights in staging and verify QPI→RPU response
  2. IRQ-affinity drift drill
    • intentionally perturb affinity, confirm detection + auto-containment
  3. Burst replay drill
    • validate that controller transitions reduce HAF95 and tail IS
  4. Rollback drill
    • prove deterministic return to baseline map/policy under incident conditions

Anti-patterns


Bottom line

RSS queue polarization is a tradable infrastructure regime, not a benign kernel detail.

If you model it explicitly (QPI/ISI/SPI/HAF95) and wire it into execution control, you can convert hidden tail slippage into a measurable, suppressible risk budget.


References