NAPI Busy-Poll Latency-Cliff Slippage Playbook

2026-04-10 · finance

NAPI Busy-Poll Latency-Cliff Slippage Playbook

Date: 2026-04-10
Category: research
Audience: low-latency execution teams running Linux socket/epoll-based market-data or order-entry paths


Why this matters

Busy polling is one of those tuning knobs that looks magical in a benchmark:

But the same mechanism can quietly become a slippage engine in production.

Linux NAPI busy polling explicitly trades CPU cycles for lower latency. It can be enabled per socket with SO_BUSY_POLL, globally with net.core.busy_read / net.core.busy_poll, and in more advanced forms with SO_PREFER_BUSY_POLL, per-epoll busy-poll parameters, and IRQ suspension.

That is where the hidden tax shows up:

  1. a hot worker spins and gets beautiful latency,
  2. the polling thread steals CPU from strategy / risk / logging / sibling network work,
  3. IRQ delivery is deferred or suppressed while the application keeps up,
  4. then one scheduling miss, load spike, or NAPI-ID mismatch kicks the path back toward IRQ-driven delivery,
  5. and the receive path becomes bimodal — sometimes very fast, sometimes suddenly batchy.

For a trading system, bimodal timing is worse than slightly slow-but-stable timing:

So the practical risk is not “busy polling is bad.” The risk is busy polling that looks great at p50 while destabilizing p95/p99 execution quality.


1) Mechanism: how busy polling turns into a slippage tax

Linux NAPI documentation describes busy polling as letting a user process check for incoming packets before the device interrupt fires. That can reduce latency, but it burns CPU and changes how packet delivery interacts with IRQ masking, NAPI budgets, and epoll dispatch.

At a high level, the path becomes:

  1. packets arrive at the NIC,
  2. instead of waiting purely for IRQ-driven wakeups, the userspace thread spins for up to busy_read / busy_poll / SO_BUSY_POLL microseconds,
  3. if packets show up during that spin window, the thread consumes them with low wakeup latency,
  4. if not, CPU cycles are spent without useful work,
  5. if SO_PREFER_BUSY_POLL / IRQ-suspension-style behavior is in play, the system may defer IRQs longer to favor polling,
  6. once the app misses cadence or traffic shape changes, delivery can snap back toward IRQ-driven or deferred-IRQ behavior.

That creates a latency cliff rather than a smooth latency curve.

Let:

Then a useful operational decomposition is:

[ \tilde t_i = t_i + \min(p_i, \text{fast hit}) + s_i + f_i. ]

The seductive part is that fast-hit episodes shrink the median. The expensive part is that CPU-steal plus fallback episodes fatten the tail.

And tails matter more than medians when the desk pays for:


2) Three execution branches created by busy polling

For slippage modeling, busy polling is best treated as a branch process, not a scalar latency feature.

Branch A — HOT_POLL

Busy poll finds packets quickly.

Characteristics:

Cost: (C_H)

Branch B — CPU_STEAL

Busy poll spins often, but packet hits are sparse or bursty.

Characteristics:

Cost: (C_S)

Branch C — FALLBACK_IRQ

Busy poll misses cadence; system falls back to IRQ/deferred-IRQ delivery or mixed mode.

Characteristics:

Cost: (C_F)

Expected execution cost becomes:

[ E[C] = p_H C_H + p_S C_S + p_F C_F, ]

with typical ordering:

[ C_H < C_S < C_F. ]

The operational goal is not to maximize busy polling everywhere. It is to:


3) Observable signatures in production

3.1 Beautiful p50, worsening p99 slippage

You see lower median ingress-to-decision latency, but:

3.2 CPU rises without matching packet-yield gain

3.3 Mixed poll/IRQ timing clusters

3.4 Execution cadence phase-locks to worker polling behavior

3.5 Performance collapses when thread placement changes

Busy poll looked great on an isolated core, but degrades when:

3.6 Tail damage appears during “moderate” load, not only during peaks

This is classic latency-cliff behavior: not outright saturation, just enough disturbance to break the hot-poll assumption.


4) KPI set for busy-poll-aware slippage modeling

4.1 Poll Hit Ratio (PHR)

Fraction of receive attempts where busy polling retrieves packets before fallback.

[ PHR = \frac{\text{busy-poll hit events}}{\text{busy-poll opportunities}} ]

Low PHR with high CPU means you are paying for spin without getting the latency benefit.

4.2 Poll Waste Ratio (PWR)

Share of busy-poll time spent with no useful packet harvest.

[ PWR = \frac{\text{empty or low-yield busy-poll time}}{\text{total busy-poll time}} ]

4.3 Poll-to-IRQ Fallback Rate (PIFR)

How often the receive path exits hot polling and re-enters IRQ/deferred-IRQ handling.

High PIFR is a direct warning that the path is no longer stable enough to model as “fast.”

4.4 Receive Bimodality Index (RBI)

Measure separation between the fast busy-poll mode and the slower fallback mode in app-visible inter-arrival or ingress-to-decision latency.

A practical version is mixture separation or a simple ratio such as:

[ RBI = \frac{Q99(L)}{Q50(L)} ]

combined with cluster detection on latency histograms.

4.5 Busy-Poll Slippage Tax (BPST)

Estimate the incremental execution cost attributable to unstable busy-poll regimes:

[ BPST_{\tau} = E[M_{\tau} \mid \text{high PWR or high PIFR}] - E[M_{\tau} \mid \text{stable high PHR}] ]

for markout horizon (\tau \in {1s, 5s, 30s}).

4.6 CPU Steal Coupling (CSC)

Correlation between busy-poll intensity and degradation in strategy-loop / risk-loop / serializer timing.

This is the metric that reveals whether the network path is improving itself by hurting the rest of the stack.


5) Feature contract additions

If a slippage model ignores busy-poll state, it will misattribute infra-driven timing distortion to market regime changes.

Add features such as:

Socket / kernel settings

NAPI / IRQ path state

CPU scheduling state

App-level timing state

Execution-state variables


6) Highest-risk situations

6.1 Global enablement instead of selective enablement

Kernel docs explicitly note that selective SO_BUSY_POLL is the preferred method. Global sysctls are a blunt instrument. If the whole box busy-polls, the desk may improve one hot path while quietly degrading everything else.

6.2 Shared-core worker design

Busy polling on a core shared with strategy, risk, logging, compression, storage, or GC work is an invitation to CPU-steal slippage.

6.3 epoll workers mixing different NAPI IDs

Kernel docs call out that epoll-based busy polling assumes file descriptors in the same epoll context share the same NAPI ID. If they do not, “fast path” expectations become unstable.

6.4 Over-large polling windows

A larger busy_poll / busy_read window may improve hit rate in light tests but can turn empty spins into a real CPU tax in production.

6.5 Prefer-busy-poll without disciplined cadence

If IRQ deferral / suspension is configured assuming the application will poll regularly, then stalls or pacing drift become more expensive than in a normal IRQ-driven setup.

6.6 p50-first tuning culture

Busy polling is a classic trap for teams that celebrate median latency wins while ignoring p95/p99 fill quality and schedule completion.


7) Live state machine

CLEAN_IRQ

STABLE_HOT_POLL

Trigger:

Actions:

POLL_WASTE

Trigger:

Actions:

FALLBACK_MIXED

Trigger:

Actions:

SAFE_BATCHED

Trigger:

Actions:


8) Controls that usually work

Control A — Use busy polling only on the exact sockets that benefit

Prefer per-socket enablement over box-wide sysctl enablement. Treat busy polling as a scalpel, not a theme.

Control B — Keep worker/NAPI affinity disciplined

If you want stable busy-poll behavior, pair:

Control C — Tune for tail, not median

A setting that improves p50 ingress latency but worsens q95 markout or completion deficit is a losing setting.

Control D — Separate usecs from budget

Time window and packet budget solve different problems. A higher polling budget can help under sustained load, but it can also deepen CPU monopolization if not bounded carefully.

Control E — Treat gro_flush_timeout / deferred IRQ knobs as part of the same system

Kernel docs explicitly describe tradeoffs here:

Do not tune busy poll in isolation from deferred-IRQ behavior.

Control F — Add CPU-protection guardrails

If busy poll is on, hard-monitor:

Control G — Canary by symbol/venue/time bucket

Do not trust one synthetic benchmark or one quiet symbol. Busy-poll gain is regime-sensitive.


9) Validation protocol

Offline reconstruction

Build dual timelines:

Label episodes into:

Estimate branch-specific slippage and completion outcomes.

Shadow mode

Keep production settings unchanged, but compute what the controller would have done under busy-poll-aware state classification.

Canary rollout

Start with a narrow traffic slice:

Promotion gates:

Rollback triggers:


10) Failure patterns to avoid

  1. Turning on global busy_read / busy_poll because a benchmark looked nice
    This often converts one local win into a system-wide resource fight.

  2. Assuming busy polling is a pure receive-path improvement
    It is also a CPU scheduling policy in disguise.

  3. Ignoring same-NAPI-ID discipline in epoll worker design
    Then the supposed fast path is structurally inconsistent.

  4. Letting latency mode toggle silently with load
    Bimodal timing must be modeled explicitly, not averaged away.

  5. Using only median latency dashboards
    Slippage lives in the tails and in branch switches.

  6. Failing to separate polling gain from fallback pain
    The desk needs both branch benefits and branch failure costs.


11) 10-day implementation plan

Days 1–2
Instrument busy-poll settings, worker CPU state, NAPI/IRQ timing, and app receive distributions.

Days 3–4
Build PHR / PWR / PIFR / RBI / CSC dashboards.

Days 5–6
Estimate branch model: HOT_POLL / CPU_STEAL / FALLBACK_IRQ.

Days 7–8
Run shadow controller with state-aware signal down-weighting and pacing changes.

Day 9
Canary on a narrow venue-symbol slice with hard rollback triggers.

Day 10
Publish operating ranges for busy_poll, busy_read, budget, and IRQ-defer companions; schedule weekly recalibration.


Bottom line

Busy polling is not just a low-latency trick. In an execution stack, it is a timing-regime selector.

When it works, it pulls packets closer to the strategy. When it half-works, it steals CPU. When it breaks cadence, it creates a latency cliff and tail slippage.

So the right question is not:

“Did busy polling reduce receive latency?”

The right question is:

“Did busy polling improve execution quality after charging for CPU steal, fallback timing, and tail cleanup?”

That is the standard that saves basis points instead of winning benchmarks.


References