RSS Indirection Queue-Polarization Slippage Playbook
Why this exists
A low-latency execution host can look healthy on headline metrics (CPU average, packet loss alarms, median decision latency) and still leak p95/p99 implementation shortfall.
A common hidden driver is RSS queue polarization:
- flow hashes over-concentrate on a subset of RX queues,
- those queues’ IRQ + softirq paths saturate first,
- event age becomes queue-dependent,
- strategy decisions desynchronize from actual book state,
- queue-priority outcomes degrade in bursts.
This is an infra-originated slippage tax that often gets misattributed to "market randomness."
Core failure mode
- RSS hashing maps a disproportionate share of heavy flows to a few queues.
- Per-queue IRQ load diverges (
/proc/interrupts), and corresponding CPUs burn softirq budget. - NAPI polling and backlog pressure rise non-uniformly across cores.
- Feed-to-decision age widens for symbols bound to hot queues.
- Dispatch cadence becomes phase-shifted versus the true microstructure clock.
- Passive fill probability drops; corrective crosses/cancels rise.
Result: tail slippage inflation with deceptively stable medians.
Slippage decomposition with queue-polarization term
For parent order (i):
[ IS_i = C_{impact} + C_{timing} + C_{routing} + C_{rss-pol} ]
Where:
[ C_{rss-pol} = C_{stale-md} + C_{irq-jitter} + C_{queue-miss} ]
- (C_{stale-md}): decisions made on queue-aged market data
- (C_{irq-jitter}): dispatch variance from IRQ/softirq phase noise
- (C_{queue-miss}): higher cancel/reprice tax after missed queue windows
Production feature set
1) Queue-balance / kernel features
- RX IRQ counts by queue and CPU (
/proc/interrupts) - NET_RX softirq skew by CPU (
/proc/softirqs) softnet_statpressure terms (drops, time-squeeze, backlog growth)- NIC per-queue packet/byte counters (
ethtool -S <dev>) - RSS indirection-table shape (
ethtool -x --show-rxfh-indir <dev>)
2) Execution-timing features
- feed age-at-decision by symbol/venue (p50/p95/p99)
- decision→wire latency quantiles by queue bucket
- cancel/replace ACK drift conditioned on queue hotness
- child-order burstiness near queue-polarization spikes
3) Outcome features
- passive fill ratio delta in hot vs cold queue regimes
- short-horizon markout ladder (10ms/100ms/1s/5s)
- incremental IS in
BALANCEDvsPOLARIZEDwindows
Practical metrics (new)
QPI (Queue Polarization Index) [ QPI = \frac{\max_q \lambda_q}{\frac{1}{Q}\sum_q \lambda_q} - 1 ] where (\lambda_q) is per-queue RX packet rate.
ISI (IRQ Skew Index) Herfindahl-style concentration over per-queue IRQ shares.
SPI (Softirq Pressure Index) Weighted score of NET_RX skew +
softnet_statsqueeze/drop growth.HAF95 (Hot-queue Age p95) p95 feed age for symbols predominantly mapped to hot queues.
RPU (Realized Polarization Uplift) tail IS uplift versus matched
BALANCEDbaseline.
Track by host, NIC, kernel version, queue map, NUMA profile, and session segment.
Identification strategy (causal, not just correlation)
Use a matched-window design:
- Match on spread, volatility, participation, and time-of-day.
- Compare high-QPI windows vs low-QPI windows within same host class.
- Add host fixed effects and interaction terms (
QPI × volatility,SPI × urgency). - Validate with controlled indirection-table reweights during canary windows.
If tail IS drops after flattening queue weights while market state is held constant, the uplift is infra-causal.
Regime controller
State A: RSS_BALANCED
- low QPI/ISI/SPI
- standard pacing and routing
State B: RSS_DRIFT
- rising QPI with intermittent softirq skew
- dampen cancel/reprice churn, mild burst smoothing
State C: RSS_POLARIZED
- sustained hot-queue concentration, widening HAF95
- stricter burst caps, more conservative passive placement horizons
State D: RSS_CONTAIN
- persistent polarization plus deadline risk
- route urgent flow to cleaner hosts/queues, prioritize completion certainty over queue-capture optimism
Use hysteresis + minimum dwell times to avoid policy flapping.
Mitigation ladder (ops + model)
- Inspect and flatten RSS indirection drift
- check
ethtool -xmap symmetry and queue weights
- check
- Align IRQ affinity with NUMA/thread pinning
- audit
/proc/irq/*/smp_affinity(_list)vs execution core layout
- audit
- Tune RPS/RFS where queue count < effective CPU set
- avoid blind enabling; validate IPI overhead trade-offs
- Tune NAPI/softnet budgets for burst envelopes
- revisit
net.core.netdev_budget(_usecs)under canary replay
- revisit
- Promote polarization features into live throttles
- reduce tactical aggressiveness when QPI/SPI exceed thresholds
- Recalibrate after NIC driver/kernel changes
- queue mapping changes invalidate prior coefficients
Failure drills (must run)
- Synthetic hash-skew drill
- force skewed indirection weights in staging and verify QPI→RPU response
- IRQ-affinity drift drill
- intentionally perturb affinity, confirm detection + auto-containment
- Burst replay drill
- validate that controller transitions reduce HAF95 and tail IS
- Rollback drill
- prove deterministic return to baseline map/policy under incident conditions
Anti-patterns
- Treating RSS as "set-and-forget" after initial bring-up
- Watching only aggregate CPU while per-queue skew explodes
- Using average feed age and ignoring hot-queue tail age
- Tuning execution logic without measuring IRQ/softirq concentration
Bottom line
RSS queue polarization is a tradable infrastructure regime, not a benign kernel detail.
If you model it explicitly (QPI/ISI/SPI/HAF95) and wire it into execution control, you can convert hidden tail slippage into a measurable, suppressible risk budget.
References
- Linux kernel networking scaling guide (RSS/RPS/RFS/XPS): https://docs.kernel.org/networking/scaling.html
- Legacy
scaling.txt(detailed RSS + IRQ guidance): https://www.kernel.org/doc/Documentation/networking/scaling.txt - SMP IRQ affinity documentation: https://docs.kernel.org/core-api/irq/irq-affinity.html
ethtoolmanual (--show-rxfh-indir,--set-rxfh-indir, stats): https://man7.org/linux/man-pages/man8/ethtool.8.html- RHEL IRQ/SoftIRQ/NAPI tuning guide: https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/10/html/network_troubleshooting_and_performance_tuning/tuning-irq-balancing
- RSS++ paper summary (dynamic indirection-table adaptation, tail-latency impact): https://www.semanticscholar.org/paper/RSS%2B%2B:-load-and-state-aware-receive-side-scaling-Katsikas-Maguire/3abe7c89764da4c252386767f2f68980cf9c095e