XPS TX-Queue Polarization & Wire-Cadence Slippage Playbook

2026-03-20 · finance

XPS TX-Queue Polarization & Wire-Cadence Slippage Playbook

Why this exists

Execution stacks can look healthy on classic dashboards (median decision latency, low drop rate, acceptable CPU headroom) while still bleeding p95/p99 implementation shortfall.

A frequent blind spot is TX-path queue polarization:

This is an infra-originated slippage tax that is often mislabeled as "alpha decay" or "random venue noise."


Core failure mode

  1. XPS maps too many active senders onto a small set of TX queues.
  2. Hot TX queues absorb bursty enqueue pressure; cold queues idle.
  3. Driver/NIC completion locality diverges from application locality, increasing lock/cache contention.
  4. Wire-time spacing becomes uneven (packet bunching + short droughts).
  5. Order/cancel cadence phase-shifts versus true order-book replenishment cadence.
  6. Passive queue-capture probability falls; corrective aggression rises.

Result: tail slippage inflation with deceptively stable medians.


Slippage decomposition with TX-polarization term

For parent order (i):

[ IS_i = C_{impact} + C_{timing} + C_{routing} + C_{tx-pol} ]

Where:

[ C_{tx-pol} = C_{wire-jitter} + C_{completion-drift} + C_{queue-miss} ]


Production feature set

1) TX queue / kernel features

2) execution-timing features

3) outcome features


Practical metrics (new)

Track by host, NIC/driver, kernel version, XPS profile, qdisc profile, and strategy cohort.


Identification strategy (causal, not just correlation)

Use a matched-window design:

  1. Match on spread, volatility, participation, urgency, and session segment.
  2. Compare high-TQCI/XMD windows vs low-TQCI/XMD windows within same host class.
  3. Add host and strategy fixed effects with interactions (TQCI × urgency, XMD × volatility).
  4. Run controlled canaries by rebalancing XPS maps (CPU and/or RXQ-based) while holding strategy logic constant.

If tail IS improves after map rebalance and cadence metrics normalize, the uplift is infra-causal.


Regime controller

State A: TX_BALANCED

State B: TX_DRIFT

State C: TX_POLARIZED

State D: TX_CONTAIN

Use hysteresis + minimum dwell times to avoid policy flapping.


Mitigation ladder (ops + model)

  1. Audit and flatten XPS mapping
    • rebalance xps_cpus / xps_rxqs so active sender sets are not over-collapsed.
  2. Align completion locality
    • verify TX completion IRQ affinity and app thread pinning coherence.
  3. Control qdisc pacing behavior
    • validate fq (or chosen qdisc) settings (quantum, maxrate, pacing mode) against burst envelope.
  4. Validate BQL behavior under burst load
    • detect queue overfill/underfill oscillation and retune driver/stack knobs where possible.
  5. Elevate TX-polarization features into live policy
    • downshift tactical aggressiveness when TQCI/XMD/WCV95 breach guardrails.
  6. Recalibrate after kernel/driver/qdisc changes
    • infra upgrades invalidate old coefficients and thresholds.

Failure drills (must run)

  1. Synthetic TX-map skew drill
    • intentionally collapse many active CPUs into few TX queues in staging.
  2. Completion-affinity drift drill
    • perturb IRQ affinity and validate CDI detection + containment response.
  3. Burst replay drill
    • replay high-burst sessions and verify regime transitions suppress tail IS.
  4. Rollback drill
    • prove deterministic return to baseline XPS/qdisc profile and stable cadence.

Anti-patterns


Bottom line

RX-path balance is only half the story.

If TX queue polarization is left unmodeled, you get hidden cadence distortion that quietly taxes queue priority and markouts.

Make TX concentration and cadence first-class slippage features (TQCI/XMD/WCV95/CDI), and you convert a "mysterious tail" into an observable, controllable execution risk budget.


References