Linux io_uring Networking (Multishot + Provided Buffers + Zero-Copy Send) — Practical Playbook
Date: 2026-03-24
Category: knowledge
Scope: Production-oriented guidance for building low-latency network servers with io_uring without accidental reordering, buffer-lifetime bugs, or CQ overflow surprises.
1) Why this matters
Most teams adopt io_uring for one reason: fewer syscalls and better tail latency under load.
For networking specifically, the big unlocks are:
- Multishot accept/recv: fewer repost loops.
- Provided buffer rings: explicit buffer ownership and better in-flight scaling.
- Send bundles + send buffer select: more efficient TX with less kernel round-trip overhead.
- Zero-copy send (
IORING_OP_SEND_ZC): reduced copy cost when payloads are big enough and lifetime is managed correctly.
But these features are easy to misuse. The dangerous pattern is chasing throughput and accidentally breaking ordering or reuse safety.
2) Non-negotiable mental model
io_uring is async and completion order is not guaranteed by submission order.
For stream sockets, treat this as an operational rule:
- Never overlap multiple sends on the same socket unless your design explicitly preserves order (e.g., provided-buffer send pipeline).
- Never overlap multiple receives on the same socket unless using a safe multishot design with well-defined buffer ownership.
This is not theory; io_uring(7) explicitly warns that background poll arming and internal behavior can reorder execution/completion.
3) Feature baseline (kernel-aware planning)
Use this as a minimum compatibility checklist:
- Provided buffer ring registration (
io_uring_register_buf_ring) — available since 5.19. - Multishot accept — available since 5.19.
- Multishot recv — available since 6.0.
- Recv/send bundles + send buffer selection feature flag (
IORING_FEAT_SEND_BUF_SELECT) — available since 6.10. - Incremental provided-buffer consumption (
IOU_PBUF_RING_INC) — available since 6.12.
Operationally: treat kernel version as part of your runtime feature flags, not just an infra detail.
4) Recommended architecture
4.1 Ring model
Prefer per-core (or per-worker) ring ownership over a giant shared ring. You get:
- less cross-thread coordination,
- clearer ownership for connection/buffer lifetimes,
- simpler backpressure.
4.2 Submission mode
IORING_SETUP_SQPOLL can reduce syscall overhead, but it is not a universal “faster” switch.
Use it when:
- request rate is high and steady,
- you can keep the poll thread busy,
- CPU budget can absorb dedicated polling behavior.
Avoid defaulting to it for bursty/idle-heavy workloads where wake/sleep churn dominates.
4.3 Buffer ownership
Adopt provided buffers early for RX/TX pipelines. It makes ownership explicit and supports more in-flight operations than naive per-request heap buffers.
5) Safe receive pipeline (practical)
5.1 Multishot recv requirements that are easy to miss
For io_uring_prep_recv_multishot():
lenmust be 0,IOSQE_BUFFER_SELECTmust be set,MSG_WAITALLmust not be set,- each CQE must be checked for
IORING_CQE_F_MORE; if absent, repost a new multishot recv.
5.2 Provided buffer ring hygiene
For each buffer group (bgid):
- allocate page-aligned ring memory,
- initialize ring with helper APIs,
- track ring head/tail carefully,
- alert on ring starvation (available buffers near zero).
If CQE has IORING_CQE_F_BUFFER, extract the selected buffer ID and return/recycle only after app-level processing is done.
5.3 CQ overflow risk
Multishot designs can flood CQ under burst traffic if consumer pace lags. Budget CQ depth with burst headroom, not average traffic.
6) Safe send pipeline (practical)
6.1 Classic send vs bundle send
Bundle send (io_uring_prep_send_bundle) with provided buffers can reduce per-chunk overhead and preserve stronger sequencing in a pipeline model.
Key details:
- set
IOSQE_BUFFER_SELECT, - set
buf_group, - for current behavior, pass
len=0(otherwise-EINVAL), - CQE
res= total bytes sent, CQE buffer flag points to starting buffer ID.
6.2 Zero-copy send (SEND_ZC) lifecycle
io_uring_prep_send_zc() typically emits two CQEs:
- send result CQE, often with
IORING_CQE_F_MORE, - notification CQE with
IORING_CQE_F_NOTIFmeaning buffer memory is now safe to reuse.
Critical rule:
- Do not reuse/free payload memory until notification CQE confirms completion.
If you enable IORING_SEND_ZC_REPORT_USAGE, notification CQE reports how much was copied vs truly zero-copy (great for real-world effectiveness tracking).
7) Version-gated rollout strategy
Phase A — correctness first
- single outstanding recv + single outstanding send per socket,
- no ZC,
- explicit state machine and deterministic tests.
Phase B — multishot receive
- enable multishot accept/recv,
- add CQE flag assertions (
F_MORE,F_BUFFER), - verify no buffer leaks under disconnect storms.
Phase C — provided-buffer send / bundle send
- gate by
IORING_FEAT_SEND_BUF_SELECT, - compare syscall rate, p99 latency, and reorder defects.
Phase D — SEND_ZC
- start on large payload classes only,
- track notif lag, copied-vs-zc ratio, ENOMEM frequency,
- fallback to normal send automatically on adverse kernels/NIC paths.
8) Observability checklist (must-have)
At minimum, export:
- CQ depth usage %, CQ overflow/error counters,
- multishot repost count (unexpected drops of
F_MORE), - provided-buffer ring available count by
bgid, - send_zc notification lag (submit → notif),
- copied bytes ratio when
IORING_SEND_ZC_REPORT_USAGEenabled, - per-socket outstanding send/recv invariant violations.
Good SLO guardrail:
- “No CQ overflow” and “No illegal buffer reuse” should be hard errors, not soft warnings.
9) Common failure patterns
Assuming FIFO completion equals FIFO wire behavior
Fix: explicit per-socket sequencing discipline.Reusing ZC buffer after first CQE
Fix: wait for notification CQE (F_NOTIF).Multishot silently stopped
Fix: always checkF_MORE, repost immediately.Provided-buffer starvation
Fix: ring refill thresholds + alerts + backpressure.Kernel-feature mismatch
Fix: runtime capability probing and feature gates.
10) Bottom line
io_uring networking wins come less from one magic flag and more from disciplined ownership:
- socket-level sequencing,
- buffer-lifetime correctness,
- version-aware feature gating,
- CQ/ring observability.
If you treat multishot/provided-buffer/ZC as a cohesive pipeline design instead of independent toggles, you can usually get both lower overhead and safer tail behavior.
References
- io_uring overview and ordering cautions: https://man7.org/linux/man-pages/man7/io_uring.7.html
- Ring setup and SQPOLL behavior/privilege notes: https://man7.org/linux/man-pages/man2/io_uring_setup.2.html
- Multishot recv semantics and constraints: https://man7.org/linux/man-pages/man3/io_uring_prep_recv_multishot.3.html
- Multishot accept semantics: https://man7.org/linux/man-pages/man3/io_uring_prep_multishot_accept.3.html
- Provided buffer ring registration/details: https://man7.org/linux/man-pages/man3/io_uring_register_buf_ring.3.html
- Send / send bundle with provided buffers: https://man7.org/linux/man-pages/man3/io_uring_prep_send_bundle.3.html
- Zero-copy send semantics (
SEND_ZC): https://man7.org/linux/man-pages/man3/io_uring_prep_send_zc.3.html - io_uring zero-copy RX (kernel docs): https://docs.kernel.org/networking/iou-zcrx.html