RX Interrupt Coalescing as a Hidden Slippage Engine (Practical Playbook)
Date: 2026-03-16
Category: research
Audience: low-latency execution teams running Linux NIC stacks (kernel or user-space) between market data ingest and routing
Why this matters
Execution teams often model slippage with market-state features (spread, volatility, imbalance, queue signals) and strategy-state features (urgency, residual, participation).
But many desks still ignore a control-plane source of cost drift:
- market-data packets arrive at the NIC,
- the NIC coalesces interrupts (
rx-usecs,rx-frames, adaptive modes), - packets are delivered to software in bursts,
- decision loops and child-order emits phase-lock to those bursts,
- queue entry timing degrades and tail markouts widen.
This appears as “random latency noise” in TCA, while the root cause is often deterministic batching in the receive path.
1) Mechanism: how coalescing leaks into execution cost
Interrupt coalescing delays IRQ delivery until either:
- a time threshold is met, or
- a frame-count threshold is met.
So the application sees packet arrivals in clusters instead of near-original micro-timing.
Let true packet arrivals be ({t_i}), and app-visible arrivals be ({\tilde t_i}). Under coalescing:
[ \tilde t_i = t_i + \delta_i, ]
where (\delta_i) is state-dependent (traffic intensity, queue occupancy, coalescing settings, NAPI/softirq contention).
The desk then optimizes on (\tilde t_i), not (t_i):
- signal freshness is overstated inside bursts,
- urgency updates synchronize with software delivery cadence,
- child orders bunch,
- queue quality and adverse selection worsen.
2) Slippage branch model with coalescing distortion
For each parent order, model three execution branches:
- TIMELY branch: low delivery distortion, normal queue interaction, cost (C_T)
- BURST-SYNC branch: clustered decisions/dispatch, queue competition rises, cost (C_B)
- LATE-RECOVERY branch: residual catch-up after stale windows, aggressive cleanup, cost (C_L)
Expected cost:
[ E[C] = p_T C_T + p_B C_B + p_L C_L, ]
with typical ordering (C_T < C_B < C_L).
Most desks only optimize (C_T)-centric behavior. The practical gain comes from reducing (p_B) and (p_L).
3) Detection metrics (new KPI set)
3.1 Market Data Distortion Ratio (MDDR)
Compare ingress-level inter-arrival structure (packet capture / NIC timestamp) vs app-level event spacing:
[ MDDR = \frac{Q95(\Delta \tilde t)}{Q95(\Delta t)}. ]
Persistent (MDDR \gg 1) indicates receive-path time dilation.
3.2 Burst Delivery Concentration (BDC)
In short windows (w) (e.g., 100–500 (\mu s)):
[ BDC = \frac{\max_w N_w}{\sum_w N_w}, ]
where (N_w) is delivered market-data events in window (w). High BDC means software sees compressed bursts.
3.3 Decision-Burst Coupling (DBC)
Correlation between delivery bursts and child-order emit bursts. High DBC means execution cadence is being driven by NIC/software batching rather than market intent.
3.4 Coalescing Tax Estimate (CTE)
[ CTE_{\tau} = E[M_{\tau}\mid \text{high BDC}] - E[M_{\tau}\mid \text{low BDC}], ]
for (\tau\in{1s,5s,30s}) markout horizons.
3.5 Data/Control Causality Drift (DCCD)
Fraction of episodes where control acknowledgements (ACK/drop-copy/risk updates) arrive with inconsistent ordering relative to market-data transition timing under heavy coalescing.
4) Feature contract additions for slippage models
Add control-plane features explicitly:
rx_usecs,rx_frames, adaptive RX/TX state,- per-queue IRQ rate and coalesced event counters,
- NAPI budget hits / softirq backlog / ksoftirqd utilization,
- NIC queue occupancy or drop counters,
- app ingest burstiness statistics (p50/p95/p99 inter-arrival),
- dispatch burst statistics,
- ACK latency tails and sequencing drift indicators,
- residual urgency and deadline headroom.
Without these, models misattribute control artifacts to “market regime shifts.”
5) Live state machine
Use a coalescing-aware execution controller:
FLOW_CLEAN
- low MDDR/BDC/DBC
- standard tactic menu
DELIVERY_CLUSTERED
- rising BDC + mild CTE
- reduce cancel/replace churn, smooth dispatch
PHASE_LOCKED
- high DBC + adverse CTE
- anti-burst pacing, tighter aggression gates, reserve capacity for cancels/exits
SAFE_PACING
- sustained high CTE or DCCD breaches
- prioritize completion reliability and tail containment over marginal spread capture
Require hysteresis and minimum dwell to avoid state flapping.
6) Controls that usually work
Control A — Separate low-latency feed paths from non-critical traffic
Avoid sharing queue/CPU paths between latency-critical market data and background/control traffic where possible.
Control B — Coalescing policy by traffic class
For low-latency feed handlers, test lower rx-usecs / rx-frames or non-adaptive settings. Throughput-optimized defaults are often too batchy for micro-timing-sensitive execution.
Control C — NAPI/CPU affinity hygiene
Pin IRQ and poll-heavy paths to stable CPU sets; reduce contention from unrelated workloads.
Control D — Anti-burst dispatch shaping
If delivery bursts are unavoidable, shape child-order emission with bounded micro-jitter and debt-aware pacing to break phase-lock loops.
Control E — Causality guardrails
When DCCD breaches threshold, down-weight fragile microstructure features and shift to robust fallback tactics until ordering confidence recovers.
7) Validation protocol
Offline replay
- Rebuild ingress and app timelines.
- Label high-BDC intervals.
- Estimate CTE by symbol/venue/time bucket.
Shadow mode
- Run coalescing-aware controller with no live action changes.
- Compare projected branch outcomes against production baseline.
Canary rollout
- 5–10% traffic slice,
- strict rollback on completion degradation or q95 slippage deterioration.
Promotion gates:
- lower BDC/DBC,
- non-inferior fill completion,
- improved q95/q99 slippage,
- stable or improved reject/retry rates.
8) Failure patterns to avoid
Single throughput-tuned NIC profile for all workloads
Great for bulk traffic, harmful for micro-timing-sensitive routing.Adaptive coalescing with no observability
Auto-tuned settings can drift silently across regimes.Treating packet batching as harmless jitter
It can systematically alter queue-entry timing and branch probabilities.No dual-timeline instrumentation
If you only log app timestamps, coalescing effects are nearly invisible.Mean-only optimization
Tail costs (q95/q99) are where batching damage concentrates.
9) 10-day implementation plan
Days 1–2
Instrument ingress-vs-app timing and queue-level coalescing metadata.
Days 3–4
Build MDDR/BDC/DBC/CTE dashboards by strategy-symbol-venue.
Days 5–6
Estimate branch model (TIMELY / BURST-SYNC / LATE-RECOVERY).
Days 7–8
Enable controller in shadow mode with anti-burst shaping and causality guards.
Day 9
Canary with hard rollback criteria.
Day 10
Publish v1 runbook; schedule weekly recalibration by regime.
Bottom line
RX interrupt coalescing is not just a systems tuning detail. In low-latency execution, it can become a first-class slippage driver by reshaping event timing, synchronizing decision bursts, and inflating tail cleanup cost.
If you measure and control coalescing-induced clustering explicitly, you can often recover meaningful basis points without touching alpha logic.
References
Linux ethtool manual (
--show-coalesce,--coalesce)
https://man7.org/linux/man-pages/man8/ethtool.8.htmlLinux Kernel Documentation: NAPI
https://docs.kernel.org/networking/napi.htmlLinux Kernel Documentation: Scaling in the Networking Stack (RSS/RPS/RFS/XPS)
https://docs.kernel.org/networking/scaling.htmlRed Hat RHEL 10: Tuning interrupt coalescence settings
https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/10/html/network_troubleshooting_and_performance_tuning/tuning-interrupt-coalescence-settingsDPDK release notes (interrupt mode PMD context)
https://dpdk.readthedocs.io/en/v16.04/rel_notes/release_2_1.html