Linux NAPI Busy-Poll Playbook (Lowering RX Wakeup Latency without Full Kernel Bypass)

2026-03-18 · software

Linux NAPI Busy-Poll Playbook (Lowering RX Wakeup Latency without Full Kernel Bypass)

Date: 2026-03-18
Category: knowledge

Why this matters

If your service is latency-sensitive, a lot of tail pain is simply wakeup timing:

NAPI busy polling is a Linux mechanism that lets userspace spend CPU cycles to pull packets earlier, often shrinking p99/p999 network-path latency.

It is not a free win:

Treat it as a controlled tail-latency lever, not a default-on knob.


1) Mental model: trading idle efficiency for wakeup speed

Default path (interrupt-driven):

  1. NIC raises interrupt.
  2. NAPI poll runs in softirq context.
  3. Socket becomes ready.
  4. Userspace gets scheduled and reads.

Busy-poll path:

Practical effect:


2) Core controls you should know

Per-socket: SO_BUSY_POLL

Global defaults: net.core.busy_read and net.core.busy_poll

net.core.busy_read

net.core.busy_poll

NAPI/epoll path knobs (advanced)

From kernel NAPI docs:

These are powerful but easier to misconfigure; treat as stage-2 optimization.


3) Preconditions and quick checks

Kernel capability check

Busy polling requires CONFIG_NET_RX_BUSY_POLL.

# one of these, depending on distro
zgrep CONFIG_NET_RX_BUSY_POLL /proc/config.gz
grep CONFIG_NET_RX_BUSY_POLL /boot/config-$(uname -r)

Current sysctl state

sysctl net.core.busy_read
sysctl net.core.busy_poll

NIC/queue awareness (for advanced epoll busy poll)

If doing per-thread queue affinity patterns, inspect queue/channel layout:

ethtool -l <iface>
ethtool -x <iface>   # if RSS indirection is supported

4) Rollout strategy (safe order)

Phase A (recommended start): per-socket only

Keep global sysctls at 0 initially. Enable busy poll only on target sockets in your app.

Why:

Phase B: small global baseline (if justified)

If many sockets/components need it and behavior is stable:

sudo sysctl -w net.core.busy_read=50
sudo sysctl -w net.core.busy_poll=50

Then retest. Increase slowly only if tail improvements justify CPU/power cost.

Phase C: epoll/NAPI advanced tuning

Only after A/B evidence from A/B phases:


5) Observability: what to watch during canary

Pair latency outcomes with CPU and packet-path health.

Latency outcomes

CPU/power cost

Example spot checks:

mpstat -P ALL 1
cat /proc/softirqs

Packet backlog/drop symptoms

cat /proc/net/softnet_stat
netstat -s

Business guardrails


6) Common footguns

  1. Global enable too early

    • Turns a targeted latency tweak into a machine-wide CPU tax.
  2. Timeouts set too high

    • Can reduce fairness and waste cycles for little extra tail benefit.
  3. Ignoring queue/thread affinity

    • Busy polling works best when thread, socket flow locality, and RX queue mapping are coherent.
  4. Mixing unrelated NAPI IDs in one epoll worker (advanced path)

    • Kernel NAPI docs highlight epoll busy-poll assumptions around shared NAPI context.
  5. Measuring averages only

    • Busy polling is mainly a tail-latency tool; p50 may barely move.

7) Minimal canary recipe

  1. Pick one latency-critical service and one traffic slice.
  2. Baseline 24h (or equivalent peak+off-peak windows).
  3. Enable per-socket SO_BUSY_POLL with a small value (e.g., 25–50us).
  4. Compare:
    • p99/p999 improvement,
    • CPU/core pressure,
    • backlog/drop counters,
    • power budget impact.
  5. If benefit is real and stable, test 50us -> 75us ladder.
  6. Promote only if tail gains are persistent and infra cost is acceptable.

8) When to use this vs kernel-bypass stacks

Use NAPI busy poll first when:

Move to AF_XDP/DPDK-style paths when:

Busy polling is often the “middle ground” before full bypass.


References