RPS/RFS Steering Churn as a Hidden Slippage Driver (Practical Playbook)

2026-03-21 · finance

RPS/RFS Steering Churn as a Hidden Slippage Driver (Practical Playbook)

Date: 2026-03-21
Category: research
Audience: low-latency execution teams running Linux multi-queue ingest/routing paths


Why this matters

Most slippage stacks model market microstructure and tactic behavior, but ignore a kernel-side timing distortion: receive steering churn from RPS/RFS.

In stable conditions this can improve cache locality and throughput. In unstable conditions (thread migration, undersized flow tables, poor queue/CPU maps), steering churn introduces:

  1. remote backlog queueing + extra IPIs,
  2. per-flow CPU handoff delays,
  3. event-time distortion between market-data ingest and order-state/control feedback.

That distortion often appears in TCA as "random tail slippage" while the root cause is deterministic control-plane behavior.


Failure mechanism (flow-steering churn -> execution timing tax)

  1. Packet enters RX queue and receives RSS/RPS hash.
  2. get_rps_cpu() selects a target CPU from rps_cpus / RFS flow tables.
  3. Packet is enqueued on remote CPU backlog; remote CPU is kicked by IPI.
  4. Scheduler moves consumer thread (or flow-table entry collides/ages), so desired CPU changes.
  5. Flow steering updates lag outstanding packets; CPU ownership flips in bursts.
  6. Market-data and order-state timelines de-synchronize at the application boundary.

Result: causality jitter - trading logic reacts to slightly stale or phase-shifted state, causing poorer queue entry timing and late urgency catch-up.


Slippage decomposition with steering term

For parent order (i):

[ IS_i = C_{impact} + C_{timing} + C_{routing} + C_{steer} ]

Where:

[ C_{steer} = C_{ipi} + C_{backlog} + C_{churn} + C_{causal-drift} ]


Operational metrics (new)

1) SMI - Steering Migration Intensity

Per-flow rate of target-CPU changes over short windows.

[ SMI = \frac{#(cpu_target\ changes)}{\text{flow-time}} ]

2) RBI95 - Remote Backlog Injection p95

p95 delay from NIC receive timestamp (or earliest software ingress stamp) to start of protocol processing on target CPU.

3) IWB - IPI Wakeup Burden

Share of RX-path processing episodes requiring remote IPI wakeups.

4) FCR - Flow-Collision Ratio

Estimated collision pressure in rps_sock_flow_entries / per-queue rps_flow_cnt tables (proxy via high SMI + table occupancy/flow volume).

5) CDT - Causality Drift Tax

Incremental short-horizon markout / tail IS during high-SMI+high-RBI regimes versus matched low-churn regimes.


What to log in production

Kernel / network layer

Execution layer


Identification strategy (causal)

  1. Match windows by spread, volatility, participation, and venue mix.
  2. Segment into STEER_STABLE vs STEER_CHURN using SMI/RBI thresholds.
  3. Estimate incremental tail cost (CDT) with host + symbol + session fixed effects.
  4. Run controlled canaries:
    • stabilize IRQ affinity / queue maps,
    • right-size rps_sock_flow_entries and rps_flow_cnt,
    • narrow rps_cpus to NUMA-local sets,
    • reduce app-thread migration on critical handlers.
  5. Promote only if CDT drops without completion-rate degradation.

Regime state machine

STEER_STABLE

STEER_IMBALANCED

STEER_CHURN

STEER_SAFE_CONTAIN

Use hysteresis + minimum dwell to avoid controller flapping.


Control ladder

  1. Fix topology first
    • Align IRQ affinity, rps_cpus, and app CPU pinning by NUMA/cache locality.
  2. Right-size flow tables
    • Increase rps_sock_flow_entries / rps_flow_cnt when collision pressure is visible.
  3. Avoid redundant steering layers
    • If RSS already gives clean 1:1 queue/CPU mapping, aggressive RPS may add churn with little upside.
  4. Bound scheduler migration on critical consumers
    • Thread movement can create avoidable RFS target churn.
  5. Promote steering-health features into live execution logic
    • Treat SMI/RBI as first-class slippage features, not just infra telemetry.

Failure drills

  1. Flow-fanout stress drill
    • replay high concurrent-flow load; verify SMI/FCR alarms.
  2. Queue-map swap drill
    • test controlled IRQ/RPS remaps with rollback triggers.
  3. Migration stress drill
    • intentionally perturb app pinning; validate CDT sensitivity.
  4. Tail-protection drill
    • force transition to STEER_SAFE_CONTAIN on repeated p95 breach.

Common mistakes


Bottom line

RPS/RFS are not just throughput knobs - they are slippage-relevant timing controls.

When steering churn rises, execution clocks dephase from market clocks, and tail costs inflate. Model steering regime explicitly and attach controls to it; otherwise infra-side causality drift will keep leaking hidden basis points.


References