NAPI Busy-Poll Latency-Cliff Slippage Playbook

Date: 2026-04-10
Category: research
Audience: low-latency execution teams running Linux socket/epoll-based market-data or order-entry paths

Why this matters

Busy polling is one of those tuning knobs that looks magical in a benchmark:

median receive latency drops,
packets arrive before the device interrupt fires,
and the strategy feels “closer” to the wire.

But the same mechanism can quietly become a slippage engine in production.

Linux NAPI busy polling explicitly trades CPU cycles for lower latency. It can be enabled per socket with SO_BUSY_POLL, globally with net.core.busy_read / net.core.busy_poll, and in more advanced forms with SO_PREFER_BUSY_POLL, per-epoll busy-poll parameters, and IRQ suspension.

That is where the hidden tax shows up:

a hot worker spins and gets beautiful latency,
the polling thread steals CPU from strategy / risk / logging / sibling network work,
IRQ delivery is deferred or suppressed while the application keeps up,
then one scheduling miss, load spike, or NAPI-ID mismatch kicks the path back toward IRQ-driven delivery,
and the receive path becomes bimodal — sometimes very fast, sometimes suddenly batchy.

For a trading system, bimodal timing is worse than slightly slow-but-stable timing:

queue-entry timing becomes inconsistent,
signal freshness becomes regime-dependent inside the same symbol session,
microstructure features drift because the app’s visibility path changes,
and parent schedulers overreact to artificial underfills or stale state.

So the practical risk is not “busy polling is bad.” The risk is busy polling that looks great at p50 while destabilizing p95/p99 execution quality.

1) Mechanism: how busy polling turns into a slippage tax

Linux NAPI documentation describes busy polling as letting a user process check for incoming packets before the device interrupt fires. That can reduce latency, but it burns CPU and changes how packet delivery interacts with IRQ masking, NAPI budgets, and epoll dispatch.

At a high level, the path becomes:

packets arrive at the NIC,
instead of waiting purely for IRQ-driven wakeups, the userspace thread spins for up to busy_read / busy_poll / SO_BUSY_POLL microseconds,
if packets show up during that spin window, the thread consumes them with low wakeup latency,
if not, CPU cycles are spent without useful work,
if SO_PREFER_BUSY_POLL / IRQ-suspension-style behavior is in play, the system may defer IRQs longer to favor polling,
once the app misses cadence or traffic shape changes, delivery can snap back toward IRQ-driven or deferred-IRQ behavior.

That creates a latency cliff rather than a smooth latency curve.

Let:

(t_i): true packet arrival time at the NIC / NAPI-visible ingress,
(\tilde t_i): application-visible receive time,
(p_i): time spent in busy-poll spin before receive attempt resolves,
(s_i): scheduler / CPU-steal delay suffered by the strategy thread or its siblings,
(f_i): fallback delay when busy poll fails and delivery returns to IRQ/deferred-IRQ handling.

Then a useful operational decomposition is:

[ \tilde t_i = t_i + \min(p_i, \text{fast hit}) + s_i + f_i. ]

The seductive part is that fast-hit episodes shrink the median. The expensive part is that CPU-steal plus fallback episodes fatten the tail.

And tails matter more than medians when the desk pays for:

late passive entry,
missed fade windows,
catch-up marketable flow,
residual urgency spikes,
or control-loop instability near deadlines.

2) Three execution branches created by busy polling

For slippage modeling, busy polling is best treated as a branch process, not a scalar latency feature.

Branch A — HOT_POLL

Busy poll finds packets quickly.

Characteristics:

low app-visible receive latency,
stable per-thread cadence,
minimal IRQ interference,
improved queue-entry timing.

Cost: (C_H)

Branch B — CPU_STEAL

Busy poll spins often, but packet hits are sparse or bursty.

Characteristics:

CPU time is burned on empty or low-yield polling,
strategy / risk / serialization / GC / logging threads lose cycles,
downstream dispatch becomes uneven.

Cost: (C_S)

Branch C — FALLBACK_IRQ

Busy poll misses cadence; system falls back to IRQ/deferred-IRQ delivery or mixed mode.

Characteristics:

receive timing becomes bimodal,
packets arrive in clumps,
signal-to-action delay widens just when the app is already stressed,
residual catch-up becomes more aggressive.

Cost: (C_F)

Expected execution cost becomes:

[ E[C] = p_H C_H + p_S C_S + p_F C_F, ]

with typical ordering:

[ C_H < C_S < C_F. ]

The operational goal is not to maximize busy polling everywhere. It is to:

increase (p_H) only where it is structurally sustainable,
keep (p_S) low by controlling CPU waste,
and prevent (p_F) from appearing as a hidden tail regime.

3) Observable signatures in production

3.1 Beautiful p50, worsening p99 slippage

You see lower median ingress-to-decision latency, but:

q95/q99 markout worsens,
catch-up flow increases,
completion quality degrades near bursts.

3.2 CPU rises without matching packet-yield gain

worker core utilization rises sharply,
packets-per-busy-poll opportunity falls,
application jitter grows despite lower median socket latency.

3.3 Mixed poll/IRQ timing clusters

app-level inter-arrival times become bimodal,
one mode looks “wire-close,”
the other looks suspiciously batchy.

3.4 Execution cadence phase-locks to worker polling behavior

child-order bursts line up with epoll / recv cycles,
quote reaction timing looks unnaturally quantized,
cancel/replace bursts cluster after polling misses.

3.5 Performance collapses when thread placement changes

Busy poll looked great on an isolated core, but degrades when:

the worker migrates,
sibling work lands on the same core,
NAPI ownership / queue mapping changes,
or multiple NAPI IDs sneak into one epoll worker design.

3.6 Tail damage appears during “moderate” load, not only during peaks

This is classic latency-cliff behavior: not outright saturation, just enough disturbance to break the hot-poll assumption.

4) KPI set for busy-poll-aware slippage modeling

4.1 Poll Hit Ratio (PHR)

Fraction of receive attempts where busy polling retrieves packets before fallback.

[ PHR = \frac{\text{busy-poll hit events}}{\text{busy-poll opportunities}} ]

Low PHR with high CPU means you are paying for spin without getting the latency benefit.

4.2 Poll Waste Ratio (PWR)

Share of busy-poll time spent with no useful packet harvest.

[ PWR = \frac{\text{empty or low-yield busy-poll time}}{\text{total busy-poll time}} ]

4.3 Poll-to-IRQ Fallback Rate (PIFR)

How often the receive path exits hot polling and re-enters IRQ/deferred-IRQ handling.

High PIFR is a direct warning that the path is no longer stable enough to model as “fast.”

4.4 Receive Bimodality Index (RBI)

Measure separation between the fast busy-poll mode and the slower fallback mode in app-visible inter-arrival or ingress-to-decision latency.

A practical version is mixture separation or a simple ratio such as:

[ RBI = \frac{Q99(L)}{Q50(L)} ]

combined with cluster detection on latency histograms.

4.5 Busy-Poll Slippage Tax (BPST)

Estimate the incremental execution cost attributable to unstable busy-poll regimes:

[ BPST_{\tau} = E[M_{\tau} \mid \text{high PWR or high PIFR}] - E[M_{\tau} \mid \text{stable high PHR}] ]

for markout horizon (\tau \in {1s, 5s, 30s}).

4.6 CPU Steal Coupling (CSC)

Correlation between busy-poll intensity and degradation in strategy-loop / risk-loop / serializer timing.

This is the metric that reveals whether the network path is improving itself by hurting the rest of the stack.

5) Feature contract additions

If a slippage model ignores busy-poll state, it will misattribute infra-driven timing distortion to market regime changes.

Add features such as:

Socket / kernel settings

net.core.busy_read
net.core.busy_poll
so_busy_poll_usecs
so_busy_poll_budget
so_prefer_busy_poll
epoll busy-poll params (busy_poll_usecs, busy_poll_budget, prefer_busy_poll)

NAPI / IRQ path state

gro_flush_timeout
napi_defer_hard_irqs
irq_suspend_timeout_ns
per-NAPI ID assignment consistency
IRQ masked duration
softirq / ksoftirqd activity
NAPI budget exhaustion indicators

CPU scheduling state

worker-core utilization
run-queue depth
involuntary context switches
core migrations
sibling-thread CPU share
isolated-core vs shared-core flag

App-level timing state

ingress-to-app latency distribution
app inter-arrival burstiness
epoll wakeup cadence
child-order emit burstiness
cancel/replace clustering

Execution-state variables

residual urgency
deadline headroom
queue-entry loss estimate
completion deficit after fallback episodes

6) Highest-risk situations

6.1 Global enablement instead of selective enablement

Kernel docs explicitly note that selective SO_BUSY_POLL is the preferred method. Global sysctls are a blunt instrument. If the whole box busy-polls, the desk may improve one hot path while quietly degrading everything else.

6.2 Shared-core worker design

Busy polling on a core shared with strategy, risk, logging, compression, storage, or GC work is an invitation to CPU-steal slippage.

6.3 epoll workers mixing different NAPI IDs

Kernel docs call out that epoll-based busy polling assumes file descriptors in the same epoll context share the same NAPI ID. If they do not, “fast path” expectations become unstable.

6.4 Over-large polling windows

A larger busy_poll / busy_read window may improve hit rate in light tests but can turn empty spins into a real CPU tax in production.

6.5 Prefer-busy-poll without disciplined cadence

If IRQ deferral / suspension is configured assuming the application will poll regularly, then stalls or pacing drift become more expensive than in a normal IRQ-driven setup.

6.6 p50-first tuning culture

Busy polling is a classic trap for teams that celebrate median latency wins while ignoring p95/p99 fill quality and schedule completion.

7) Live state machine

CLEAN_IRQ

Busy polling disabled or irrelevant.
Stable IRQ-driven behavior.
Baseline model.

STABLE_HOT_POLL

Trigger:

high PHR,
low PWR,
low PIFR,
acceptable CSC.

Actions:

allow latency-sensitive passive tactics,
trust short-horizon microstructure signals more,
keep monitoring tail metrics.

POLL_WASTE

Trigger:

rising PWR,
CPU cost climbs faster than hit rate.

Actions:

reduce aggression on marginal micro-signals,
shrink or disable busy polling on non-critical sockets,
protect sibling strategy/risk threads.

FALLBACK_MIXED

Trigger:

PIFR breach,
RBI widens,
app timing becomes bimodal.

Actions:

down-weight fragile queue-timing features,
smooth child-order pacing,
tighten cancel/replace churn budget,
switch to more robust completion logic.

SAFE_BATCHED

Trigger:

sustained high CSC or high BPST,
tail slippage deteriorates beyond limit.

Actions:

prioritize stability over raw median latency,
disable or sharply constrain busy polling,
revert to known-good IRQ/deferred-IRQ configuration,
re-enter only after hysteresis thresholds clear.

8) Controls that usually work

Control A — Use busy polling only on the exact sockets that benefit

Prefer per-socket enablement over box-wide sysctl enablement. Treat busy polling as a scalpel, not a theme.

Control B — Keep worker/NAPI affinity disciplined

If you want stable busy-poll behavior, pair:

a dedicated worker thread,
a stable CPU placement policy,
and a clean NAPI/queue ownership story.

Control C — Tune for tail, not median

A setting that improves p50 ingress latency but worsens q95 markout or completion deficit is a losing setting.

Control D — Separate `usecs` from `budget`

Time window and packet budget solve different problems. A higher polling budget can help under sustained load, but it can also deepen CPU monopolization if not bounded carefully.

Control E — Treat `gro_flush_timeout` / deferred IRQ knobs as part of the same system

Kernel docs explicitly describe tradeoffs here:

too large → great batching, bad unloaded latency,
too small → IRQ interference breaks user busy-poll flow.

Do not tune busy poll in isolation from deferred-IRQ behavior.

Control F — Add CPU-protection guardrails

If busy poll is on, hard-monitor:

sibling-thread starvation,
run-queue depth,
involuntary context switches,
and strategy-loop deadline misses.

Control G — Canary by symbol/venue/time bucket

Do not trust one synthetic benchmark or one quiet symbol. Busy-poll gain is regime-sensitive.

9) Validation protocol

Offline reconstruction

Build dual timelines:

NIC / NAPI-adjacent ingress time where possible,
app-visible receive / decision / dispatch time.

Label episodes into:

STABLE_HOT_POLL,
POLL_WASTE,
FALLBACK_MIXED.

Estimate branch-specific slippage and completion outcomes.

Shadow mode

Keep production settings unchanged, but compute what the controller would have done under busy-poll-aware state classification.

Canary rollout

Start with a narrow traffic slice:

specific worker,
specific venue,
specific symbol class,
bounded hours.

Promotion gates:

improved or neutral q95/q99 slippage,
stable completion rate,
lower BPST,
acceptable CPU overhead,
no CSC breach.

Rollback triggers:

rising PIFR,
widening RBI,
strategy-loop starvation,
completion deficit growth.

10) Failure patterns to avoid

Turning on global busy_read / busy_poll because a benchmark looked nice
This often converts one local win into a system-wide resource fight.
Assuming busy polling is a pure receive-path improvement
It is also a CPU scheduling policy in disguise.
Ignoring same-NAPI-ID discipline in epoll worker design
Then the supposed fast path is structurally inconsistent.
Letting latency mode toggle silently with load
Bimodal timing must be modeled explicitly, not averaged away.
Using only median latency dashboards
Slippage lives in the tails and in branch switches.
Failing to separate polling gain from fallback pain
The desk needs both branch benefits and branch failure costs.

11) 10-day implementation plan

Days 1–2
Instrument busy-poll settings, worker CPU state, NAPI/IRQ timing, and app receive distributions.

Days 3–4
Build PHR / PWR / PIFR / RBI / CSC dashboards.

Days 5–6
Estimate branch model: HOT_POLL / CPU_STEAL / FALLBACK_IRQ.

Days 7–8
Run shadow controller with state-aware signal down-weighting and pacing changes.

Day 9
Canary on a narrow venue-symbol slice with hard rollback triggers.

Day 10
Publish operating ranges for busy_poll, busy_read, budget, and IRQ-defer companions; schedule weekly recalibration.

Bottom line

Busy polling is not just a low-latency trick. In an execution stack, it is a timing-regime selector.

When it works, it pulls packets closer to the strategy. When it half-works, it steals CPU. When it breaks cadence, it creates a latency cliff and tail slippage.

So the right question is not:

“Did busy polling reduce receive latency?”

The right question is:

“Did busy polling improve execution quality after charging for CPU steal, fallback timing, and tail cleanup?”

That is the standard that saves basis points instead of winning benchmarks.

References

Linux Kernel Documentation: NAPI
https://docs.kernel.org/networking/napi.html
Linux Kernel Documentation: /proc/sys/net/core (busy_read, busy_poll, NAPI budgets)
https://docs.kernel.org/admin-guide/sysctl/net.html
Amazon EC2: Improve network latency for Linux based EC2 instances
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ena-improve-network-latency-linux.html

NAPI Busy-Poll Latency-Cliff Slippage Playbook

NAPI Busy-Poll Latency-Cliff Slippage Playbook

Why this matters

1) Mechanism: how busy polling turns into a slippage tax

2) Three execution branches created by busy polling

Branch A — HOT_POLL

Branch B — CPU_STEAL

Branch C — FALLBACK_IRQ

3) Observable signatures in production

3.1 Beautiful p50, worsening p99 slippage

3.2 CPU rises without matching packet-yield gain

3.3 Mixed poll/IRQ timing clusters

3.4 Execution cadence phase-locks to worker polling behavior

3.5 Performance collapses when thread placement changes

3.6 Tail damage appears during “moderate” load, not only during peaks

4) KPI set for busy-poll-aware slippage modeling

4.1 Poll Hit Ratio (PHR)

4.2 Poll Waste Ratio (PWR)

4.3 Poll-to-IRQ Fallback Rate (PIFR)

4.4 Receive Bimodality Index (RBI)

4.5 Busy-Poll Slippage Tax (BPST)

4.6 CPU Steal Coupling (CSC)

5) Feature contract additions

Socket / kernel settings

NAPI / IRQ path state

CPU scheduling state

App-level timing state

Execution-state variables

6) Highest-risk situations

6.1 Global enablement instead of selective enablement

6.2 Shared-core worker design

6.3 epoll workers mixing different NAPI IDs

6.4 Over-large polling windows

6.5 Prefer-busy-poll without disciplined cadence

6.6 p50-first tuning culture

7) Live state machine

CLEAN_IRQ

STABLE_HOT_POLL

POLL_WASTE

FALLBACK_MIXED

SAFE_BATCHED

8) Controls that usually work

Control A — Use busy polling only on the exact sockets that benefit

Control B — Keep worker/NAPI affinity disciplined

Control C — Tune for tail, not median

Control D — Separate usecs from budget

Control E — Treat gro_flush_timeout / deferred IRQ knobs as part of the same system

Control F — Add CPU-protection guardrails

Control G — Canary by symbol/venue/time bucket

9) Validation protocol

Offline reconstruction

Shadow mode

Canary rollout

10) Failure patterns to avoid

11) 10-day implementation plan

Bottom line

References

Control D — Separate `usecs` from `budget`

Control E — Treat `gro_flush_timeout` / deferred IRQ knobs as part of the same system