Linux NAPI Busy-Poll Playbook (Lowering RX Wakeup Latency without Full Kernel Bypass)

Date: 2026-03-18
Category: knowledge

Why this matters

If your service is latency-sensitive, a lot of tail pain is simply wakeup timing:

packet arrives,
interrupt/softirq scheduling happens,
userspace wakes later than ideal,
request latency jumps even when average CPU looks fine.

NAPI busy polling is a Linux mechanism that lets userspace spend CPU cycles to pull packets earlier, often shrinking p99/p999 network-path latency.

It is not a free win:

CPU usage rises,
power usage rises,
poor tuning can hurt system fairness.

Treat it as a controlled tail-latency lever, not a default-on knob.

1) Mental model: trading idle efficiency for wakeup speed

Default path (interrupt-driven):

NIC raises interrupt.
NAPI poll runs in softirq context.
Socket becomes ready.
Userspace gets scheduled and reads.

Busy-poll path:

userspace (recv/poll/epoll path) spins for a bounded time to poll for packets before waiting for IRQ-driven wakeups.

Practical effect:

lower waiting variance in fast paths,
higher CPU burn to buy that lower variance.

2) Core controls you should know

Per-socket: `SO_BUSY_POLL`

Sets approximate busy-poll time in microseconds on blocking receive.
From socket(7): increasing value requires CAP_NET_ADMIN.
Available since Linux 3.11.

Global defaults: `net.core.busy_read` and `net.core.busy_poll`

net.core.busy_read

Default busy-poll timeout for socket reads.
Kernel docs recommend 50us if enabling globally.
Default is 0 (off).

net.core.busy_poll

Busy-poll timeout for poll/select paths on sockets with SO_BUSY_POLL.
Kernel docs note rough guidance:
- ~50us for several sockets,
- ~100us for several hundred sockets,
- if much larger fan-in, epoll-based designs are usually better.
Default is 0 (off).

NAPI/epoll path knobs (advanced)

From kernel NAPI docs:

SO_PREFER_BUSY_POLL / epoll prefer_busy_poll
SO_BUSY_POLL_BUDGET / epoll busy_poll_budget
per-NAPI software coalescing knobs:
- gro_flush_timeout
- napi_defer_hard_irqs
optional per-NAPI irq-suspend-timeout for IRQ suspension mode

These are powerful but easier to misconfigure; treat as stage-2 optimization.

3) Preconditions and quick checks

Kernel capability check

Busy polling requires CONFIG_NET_RX_BUSY_POLL.

# one of these, depending on distro
zgrep CONFIG_NET_RX_BUSY_POLL /proc/config.gz
grep CONFIG_NET_RX_BUSY_POLL /boot/config-$(uname -r)

Current sysctl state

sysctl net.core.busy_read
sysctl net.core.busy_poll

NIC/queue awareness (for advanced epoll busy poll)

If doing per-thread queue affinity patterns, inspect queue/channel layout:

ethtool -l <iface>
ethtool -x <iface>   # if RSS indirection is supported

4) Rollout strategy (safe order)

Phase A (recommended start): per-socket only

Keep global sysctls at 0 initially. Enable busy poll only on target sockets in your app.

Why:

limits blast radius,
easy A/B on one service,
avoids accidental CPU burn by unrelated workloads.

Phase B: small global baseline (if justified)

If many sockets/components need it and behavior is stable:

sudo sysctl -w net.core.busy_read=50
sudo sysctl -w net.core.busy_poll=50

Then retest. Increase slowly only if tail improvements justify CPU/power cost.

Phase C: epoll/NAPI advanced tuning

Only after A/B evidence from A/B phases:

tune busy_poll_budget,
consider SO_PREFER_BUSY_POLL,
evaluate gro_flush_timeout + napi_defer_hard_irqs tradeoff,
optionally evaluate irq-suspend-timeout for high-RPS dedicated workers.

5) Observability: what to watch during canary

Pair latency outcomes with CPU and packet-path health.

Latency outcomes

request p50/p95/p99/p999
socket recv-to-app handling delay (if instrumented)
tail jitter under bursty traffic

CPU/power cost

per-core utilization (especially worker cores)
ksoftirqd behavior
package power/thermal if relevant to host class

Example spot checks:

mpstat -P ALL 1
cat /proc/softirqs

Packet backlog/drop symptoms

cat /proc/net/softnet_stat
netstat -s

Business guardrails

throughput unchanged or acceptable,
no starvation of non-target workloads,
no unacceptable power budget regression.

6) Common footguns

Global enable too early
- Turns a targeted latency tweak into a machine-wide CPU tax.
Timeouts set too high
- Can reduce fairness and waste cycles for little extra tail benefit.
Ignoring queue/thread affinity
- Busy polling works best when thread, socket flow locality, and RX queue mapping are coherent.
Mixing unrelated NAPI IDs in one epoll worker (advanced path)
- Kernel NAPI docs highlight epoll busy-poll assumptions around shared NAPI context.
Measuring averages only
- Busy polling is mainly a tail-latency tool; p50 may barely move.

7) Minimal canary recipe

Pick one latency-critical service and one traffic slice.
Baseline 24h (or equivalent peak+off-peak windows).
Enable per-socket SO_BUSY_POLL with a small value (e.g., 25–50us).
Compare:
- p99/p999 improvement,
- CPU/core pressure,
- backlog/drop counters,
- power budget impact.
If benefit is real and stable, test 50us -> 75us ladder.
Promote only if tail gains are persistent and infra cost is acceptable.

8) When to use this vs kernel-bypass stacks

Use NAPI busy poll first when:

you need incremental latency reduction,
you want to keep kernel networking semantics/tooling,
operational simplicity matters.

Move to AF_XDP/DPDK-style paths when:

latency targets are beyond what tuned kernel path can deliver,
CPU/power tradeoffs are still acceptable,
team can operate more specialized datapath complexity.

Busy polling is often the “middle ground” before full bypass.

References

Linux kernel sysctl networking docs (busy_read, busy_poll):
https://www.kernel.org/doc/html/latest/admin-guide/sysctl/net.html
Linux kernel NAPI docs (busy polling, epoll path, IRQ mitigation/suspension):
https://www.kernel.org/doc/html/latest/networking/napi.html
socket(7) (SO_BUSY_POLL behavior/capability notes):
https://man7.org/linux/man-pages/man7/socket.7.html

Linux NAPI Busy-Poll Playbook (Lowering RX Wakeup Latency without Full Kernel Bypass)

Linux NAPI Busy-Poll Playbook (Lowering RX Wakeup Latency without Full Kernel Bypass)

Why this matters

1) Mental model: trading idle efficiency for wakeup speed

2) Core controls you should know

Per-socket: SO_BUSY_POLL

Global defaults: net.core.busy_read and net.core.busy_poll

NAPI/epoll path knobs (advanced)

3) Preconditions and quick checks

Kernel capability check

Current sysctl state

NIC/queue awareness (for advanced epoll busy poll)

4) Rollout strategy (safe order)

Phase A (recommended start): per-socket only

Phase B: small global baseline (if justified)

Phase C: epoll/NAPI advanced tuning

5) Observability: what to watch during canary

Latency outcomes

CPU/power cost

Packet backlog/drop symptoms

Business guardrails

6) Common footguns

7) Minimal canary recipe

8) When to use this vs kernel-bypass stacks

References

Per-socket: `SO_BUSY_POLL`

Global defaults: `net.core.busy_read` and `net.core.busy_poll`