Linux NAPI Busy-Poll Playbook (Lowering RX Wakeup Latency without Full Kernel Bypass)
Date: 2026-03-18
Category: knowledge
Why this matters
If your service is latency-sensitive, a lot of tail pain is simply wakeup timing:
- packet arrives,
- interrupt/softirq scheduling happens,
- userspace wakes later than ideal,
- request latency jumps even when average CPU looks fine.
NAPI busy polling is a Linux mechanism that lets userspace spend CPU cycles to pull packets earlier, often shrinking p99/p999 network-path latency.
It is not a free win:
- CPU usage rises,
- power usage rises,
- poor tuning can hurt system fairness.
Treat it as a controlled tail-latency lever, not a default-on knob.
1) Mental model: trading idle efficiency for wakeup speed
Default path (interrupt-driven):
- NIC raises interrupt.
- NAPI poll runs in softirq context.
- Socket becomes ready.
- Userspace gets scheduled and reads.
Busy-poll path:
- userspace (recv/poll/epoll path) spins for a bounded time to poll for packets before waiting for IRQ-driven wakeups.
Practical effect:
- lower waiting variance in fast paths,
- higher CPU burn to buy that lower variance.
2) Core controls you should know
Per-socket: SO_BUSY_POLL
- Sets approximate busy-poll time in microseconds on blocking receive.
- From
socket(7): increasing value requiresCAP_NET_ADMIN. - Available since Linux 3.11.
Global defaults: net.core.busy_read and net.core.busy_poll
net.core.busy_read
- Default busy-poll timeout for socket reads.
- Kernel docs recommend 50us if enabling globally.
- Default is 0 (off).
net.core.busy_poll
- Busy-poll timeout for
poll/selectpaths on sockets withSO_BUSY_POLL. - Kernel docs note rough guidance:
- ~50us for several sockets,
- ~100us for several hundred sockets,
- if much larger fan-in, epoll-based designs are usually better.
- Default is 0 (off).
NAPI/epoll path knobs (advanced)
From kernel NAPI docs:
SO_PREFER_BUSY_POLL/ epollprefer_busy_pollSO_BUSY_POLL_BUDGET/ epollbusy_poll_budget- per-NAPI software coalescing knobs:
gro_flush_timeoutnapi_defer_hard_irqs
- optional per-NAPI
irq-suspend-timeoutfor IRQ suspension mode
These are powerful but easier to misconfigure; treat as stage-2 optimization.
3) Preconditions and quick checks
Kernel capability check
Busy polling requires CONFIG_NET_RX_BUSY_POLL.
# one of these, depending on distro
zgrep CONFIG_NET_RX_BUSY_POLL /proc/config.gz
grep CONFIG_NET_RX_BUSY_POLL /boot/config-$(uname -r)
Current sysctl state
sysctl net.core.busy_read
sysctl net.core.busy_poll
NIC/queue awareness (for advanced epoll busy poll)
If doing per-thread queue affinity patterns, inspect queue/channel layout:
ethtool -l <iface>
ethtool -x <iface> # if RSS indirection is supported
4) Rollout strategy (safe order)
Phase A (recommended start): per-socket only
Keep global sysctls at 0 initially. Enable busy poll only on target sockets in your app.
Why:
- limits blast radius,
- easy A/B on one service,
- avoids accidental CPU burn by unrelated workloads.
Phase B: small global baseline (if justified)
If many sockets/components need it and behavior is stable:
sudo sysctl -w net.core.busy_read=50
sudo sysctl -w net.core.busy_poll=50
Then retest. Increase slowly only if tail improvements justify CPU/power cost.
Phase C: epoll/NAPI advanced tuning
Only after A/B evidence from A/B phases:
- tune
busy_poll_budget, - consider
SO_PREFER_BUSY_POLL, - evaluate
gro_flush_timeout+napi_defer_hard_irqstradeoff, - optionally evaluate
irq-suspend-timeoutfor high-RPS dedicated workers.
5) Observability: what to watch during canary
Pair latency outcomes with CPU and packet-path health.
Latency outcomes
- request p50/p95/p99/p999
- socket recv-to-app handling delay (if instrumented)
- tail jitter under bursty traffic
CPU/power cost
- per-core utilization (especially worker cores)
ksoftirqdbehavior- package power/thermal if relevant to host class
Example spot checks:
mpstat -P ALL 1
cat /proc/softirqs
Packet backlog/drop symptoms
cat /proc/net/softnet_stat
netstat -s
Business guardrails
- throughput unchanged or acceptable,
- no starvation of non-target workloads,
- no unacceptable power budget regression.
6) Common footguns
Global enable too early
- Turns a targeted latency tweak into a machine-wide CPU tax.
Timeouts set too high
- Can reduce fairness and waste cycles for little extra tail benefit.
Ignoring queue/thread affinity
- Busy polling works best when thread, socket flow locality, and RX queue mapping are coherent.
Mixing unrelated NAPI IDs in one epoll worker (advanced path)
- Kernel NAPI docs highlight epoll busy-poll assumptions around shared NAPI context.
Measuring averages only
- Busy polling is mainly a tail-latency tool; p50 may barely move.
7) Minimal canary recipe
- Pick one latency-critical service and one traffic slice.
- Baseline 24h (or equivalent peak+off-peak windows).
- Enable per-socket
SO_BUSY_POLLwith a small value (e.g., 25–50us). - Compare:
- p99/p999 improvement,
- CPU/core pressure,
- backlog/drop counters,
- power budget impact.
- If benefit is real and stable, test 50us -> 75us ladder.
- Promote only if tail gains are persistent and infra cost is acceptable.
8) When to use this vs kernel-bypass stacks
Use NAPI busy poll first when:
- you need incremental latency reduction,
- you want to keep kernel networking semantics/tooling,
- operational simplicity matters.
Move to AF_XDP/DPDK-style paths when:
- latency targets are beyond what tuned kernel path can deliver,
- CPU/power tradeoffs are still acceptable,
- team can operate more specialized datapath complexity.
Busy polling is often the “middle ground” before full bypass.
References
- Linux kernel sysctl networking docs (
busy_read,busy_poll):
https://www.kernel.org/doc/html/latest/admin-guide/sysctl/net.html - Linux kernel NAPI docs (busy polling, epoll path, IRQ mitigation/suspension):
https://www.kernel.org/doc/html/latest/networking/napi.html socket(7)(SO_BUSY_POLLbehavior/capability notes):
https://man7.org/linux/man-pages/man7/socket.7.html