RX Interrupt Coalescing as a Hidden Slippage Engine (Practical Playbook)

Date: 2026-03-16
Category: research
Audience: low-latency execution teams running Linux NIC stacks (kernel or user-space) between market data ingest and routing

Why this matters

Execution teams often model slippage with market-state features (spread, volatility, imbalance, queue signals) and strategy-state features (urgency, residual, participation).

But many desks still ignore a control-plane source of cost drift:

market-data packets arrive at the NIC,
the NIC coalesces interrupts (rx-usecs, rx-frames, adaptive modes),
packets are delivered to software in bursts,
decision loops and child-order emits phase-lock to those bursts,
queue entry timing degrades and tail markouts widen.

This appears as “random latency noise” in TCA, while the root cause is often deterministic batching in the receive path.

1) Mechanism: how coalescing leaks into execution cost

Interrupt coalescing delays IRQ delivery until either:

a time threshold is met, or
a frame-count threshold is met.

So the application sees packet arrivals in clusters instead of near-original micro-timing.

Let true packet arrivals be ({t_i}), and app-visible arrivals be ({\tilde t_i}). Under coalescing:

[ \tilde t_i = t_i + \delta_i, ]

where (\delta_i) is state-dependent (traffic intensity, queue occupancy, coalescing settings, NAPI/softirq contention).

The desk then optimizes on (\tilde t_i), not (t_i):

signal freshness is overstated inside bursts,
urgency updates synchronize with software delivery cadence,
child orders bunch,
queue quality and adverse selection worsen.

2) Slippage branch model with coalescing distortion

For each parent order, model three execution branches:

TIMELY branch: low delivery distortion, normal queue interaction, cost (C_T)
BURST-SYNC branch: clustered decisions/dispatch, queue competition rises, cost (C_B)
LATE-RECOVERY branch: residual catch-up after stale windows, aggressive cleanup, cost (C_L)

Expected cost:

[ E[C] = p_T C_T + p_B C_B + p_L C_L, ]

with typical ordering (C_T < C_B < C_L).

Most desks only optimize (C_T)-centric behavior. The practical gain comes from reducing (p_B) and (p_L).

3) Detection metrics (new KPI set)

3.1 Market Data Distortion Ratio (MDDR)

Compare ingress-level inter-arrival structure (packet capture / NIC timestamp) vs app-level event spacing:

[ MDDR = \frac{Q95(\Delta \tilde t)}{Q95(\Delta t)}. ]

Persistent (MDDR \gg 1) indicates receive-path time dilation.

3.2 Burst Delivery Concentration (BDC)

In short windows (w) (e.g., 100–500 (\mu s)):

[ BDC = \frac{\max_w N_w}{\sum_w N_w}, ]

where (N_w) is delivered market-data events in window (w). High BDC means software sees compressed bursts.

3.3 Decision-Burst Coupling (DBC)

Correlation between delivery bursts and child-order emit bursts. High DBC means execution cadence is being driven by NIC/software batching rather than market intent.

3.4 Coalescing Tax Estimate (CTE)

[ CTE_{\tau} = E[M_{\tau}\mid \text{high BDC}] - E[M_{\tau}\mid \text{low BDC}], ]

for (\tau\in{1s,5s,30s}) markout horizons.

3.5 Data/Control Causality Drift (DCCD)

Fraction of episodes where control acknowledgements (ACK/drop-copy/risk updates) arrive with inconsistent ordering relative to market-data transition timing under heavy coalescing.

4) Feature contract additions for slippage models

Add control-plane features explicitly:

rx_usecs, rx_frames, adaptive RX/TX state,
per-queue IRQ rate and coalesced event counters,
NAPI budget hits / softirq backlog / ksoftirqd utilization,
NIC queue occupancy or drop counters,
app ingest burstiness statistics (p50/p95/p99 inter-arrival),
dispatch burst statistics,
ACK latency tails and sequencing drift indicators,
residual urgency and deadline headroom.

Without these, models misattribute control artifacts to “market regime shifts.”

5) Live state machine

Use a coalescing-aware execution controller:

FLOW_CLEAN
- low MDDR/BDC/DBC
- standard tactic menu
DELIVERY_CLUSTERED
- rising BDC + mild CTE
- reduce cancel/replace churn, smooth dispatch
PHASE_LOCKED
- high DBC + adverse CTE
- anti-burst pacing, tighter aggression gates, reserve capacity for cancels/exits
SAFE_PACING
- sustained high CTE or DCCD breaches
- prioritize completion reliability and tail containment over marginal spread capture

Require hysteresis and minimum dwell to avoid state flapping.

6) Controls that usually work

Control A — Separate low-latency feed paths from non-critical traffic

Avoid sharing queue/CPU paths between latency-critical market data and background/control traffic where possible.

Control B — Coalescing policy by traffic class

For low-latency feed handlers, test lower rx-usecs / rx-frames or non-adaptive settings. Throughput-optimized defaults are often too batchy for micro-timing-sensitive execution.

Control C — NAPI/CPU affinity hygiene

Pin IRQ and poll-heavy paths to stable CPU sets; reduce contention from unrelated workloads.

Control D — Anti-burst dispatch shaping

If delivery bursts are unavoidable, shape child-order emission with bounded micro-jitter and debt-aware pacing to break phase-lock loops.

Control E — Causality guardrails

When DCCD breaches threshold, down-weight fragile microstructure features and shift to robust fallback tactics until ordering confidence recovers.

7) Validation protocol

Offline replay

Rebuild ingress and app timelines.
Label high-BDC intervals.
Estimate CTE by symbol/venue/time bucket.

Shadow mode

Run coalescing-aware controller with no live action changes.
Compare projected branch outcomes against production baseline.

Canary rollout

5–10% traffic slice,
strict rollback on completion degradation or q95 slippage deterioration.

Promotion gates:

lower BDC/DBC,
non-inferior fill completion,
improved q95/q99 slippage,
stable or improved reject/retry rates.

8) Failure patterns to avoid

Single throughput-tuned NIC profile for all workloads
Great for bulk traffic, harmful for micro-timing-sensitive routing.
Adaptive coalescing with no observability
Auto-tuned settings can drift silently across regimes.
Treating packet batching as harmless jitter
It can systematically alter queue-entry timing and branch probabilities.
No dual-timeline instrumentation
If you only log app timestamps, coalescing effects are nearly invisible.
Mean-only optimization
Tail costs (q95/q99) are where batching damage concentrates.

9) 10-day implementation plan

Days 1–2
Instrument ingress-vs-app timing and queue-level coalescing metadata.

Days 3–4
Build MDDR/BDC/DBC/CTE dashboards by strategy-symbol-venue.

Days 5–6
Estimate branch model (TIMELY / BURST-SYNC / LATE-RECOVERY).

Days 7–8
Enable controller in shadow mode with anti-burst shaping and causality guards.

Day 9
Canary with hard rollback criteria.

Day 10
Publish v1 runbook; schedule weekly recalibration by regime.

Bottom line

RX interrupt coalescing is not just a systems tuning detail. In low-latency execution, it can become a first-class slippage driver by reshaping event timing, synchronizing decision bursts, and inflating tail cleanup cost.

If you measure and control coalescing-induced clustering explicitly, you can often recover meaningful basis points without touching alpha logic.

References

Linux ethtool manual (--show-coalesce, --coalesce)
https://man7.org/linux/man-pages/man8/ethtool.8.html
Linux Kernel Documentation: NAPI
https://docs.kernel.org/networking/napi.html
Linux Kernel Documentation: Scaling in the Networking Stack (RSS/RPS/RFS/XPS)
https://docs.kernel.org/networking/scaling.html
Red Hat RHEL 10: Tuning interrupt coalescence settings
https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/10/html/network_troubleshooting_and_performance_tuning/tuning-interrupt-coalescence-settings
DPDK release notes (interrupt mode PMD context)
https://dpdk.readthedocs.io/en/v16.04/rel_notes/release_2_1.html