Kernel Bypass on Linux for Low-Latency Trading: DPDK vs AF_XDP vs io_uring Playbook

Date: 2026-03-04
Category: knowledge

Why this matters

For execution systems, “network stack choice” is not an infra detail. It directly changes:

wire-to-app latency distribution (especially p99/p99.9),
jitter under bursty market-data load,
CPU efficiency per packet,
operational blast radius during incidents.

If you pick the wrong path, you either:

overpay in latency/jitter (missed queue priority, worse fills), or
overpay in operational complexity (fragile deploys, harder debugging/on-call).

This playbook is a practical chooser between DPDK, AF_XDP, and io_uring-based networking.

60-second mental model

DPDK

Full user-space packet I/O model with poll-mode drivers.
Maximizes control and performance, minimizes kernel path usage.
Usually highest complexity (driver binding, hugepages, CPU pinning, NUMA discipline).

AF_XDP

XDP + XSK sockets to redirect packets from NIC RX queue into user-space UMEM.
“Kernel-assisted bypass”: closer to Linux ecosystem than full DPDK-only path.
Strong middle ground for many low-latency systems.

io_uring networking

Completion-based async I/O with shared SQ/CQ rings.
Not a full packet-plane replacement like DPDK/AF_XDP, but strong for scalable socket I/O.
Best when you need lower syscall overhead + modern async model without deep NIC pipeline surgery.

What the kernel/docs explicitly imply

From Linux AF_XDP docs:

AF_XDP sockets use RX/TX rings + UMEM with FILL/COMPLETION rings.
Packets are redirected via XDP program + XSKMAP.
Supports XDP_SKB (fallback/copy path) and XDP_DRV (driver path).

From DPDK PMD docs:

PMDs are designed around polling, burst APIs, and no interrupt-driven hot path (except specific events).
Architecture assumes explicit core/queue ownership and NUMA-aware memory layout.

From io_uring docs/wiki:

Shared rings reduce syscall overhead and allow submit+wait patterns.
Multi-shot accept/recv and provided-buffer models are key for high-connection/packetized workloads.

Decision matrix (practical)

Choose DPDK when

You need the lowest possible jitter and strictest CPU/NIC queue control.
You can dedicate cores aggressively (busy polling is acceptable).
You can own the ops complexity (hugepages, vfio-pci, NIC binding discipline).
Your team is comfortable with deep packet-path observability and tuning.

Typical fit:

ultra-latency-sensitive market-data handlers,
specialized feed handlers,
colocated execution gateways with tight SLOs.

Choose AF_XDP when

You need near-bypass performance but want to stay closer to Linux-native ops.
You want queue-level user-space packet access without fully abandoning kernel ecosystem tooling.
You can run XDP programs and manage per-queue mapping intentionally.
You need a staged path from conventional Linux networking toward lower latency.

Typical fit:

low-latency market-data consumers with stronger operability constraints,
mixed workloads where some services still rely on normal kernel networking.

Choose io_uring networking when

Your bottleneck is syscall/event-loop overhead in socket workloads (not full packet-plane bypass).
You need high connection scale and efficient async semantics.
You want modernized networking I/O without committing to full NIC bypass architecture.

Typical fit:

order gateways and control-plane services with high concurrency,
internal service mesh edges where tail latency matters but full bypass is overkill.

Latency-vs-operability tradeoff (rule of thumb)

DPDK: best raw control/perf ceiling, highest integration burden.
AF_XDP: strong middle point, generally better operability/perf balance.
io_uring: easiest migration from classic socket apps, best for async efficiency rather than absolute packet-path minimization.

Don’t optimize only mean latency. Decide on:

p99.9 target,
jitter budget during open/close bursts,
on-call recovery time when packet path misbehaves.

Non-negotiable engineering checklist (any option)

Core and queue ownership are explicit
- No accidental queue sharing in hot paths.
NUMA locality is enforced
- NIC queue, CPU core, and memory pool align on same NUMA node whenever possible.
Busy-poll budget is measured, not guessed
- Trading lower latency for runaway CPU without capacity planning is a hidden outage.
Backpressure policy is designed upfront
- Drop, coalesce, or degrade mode must be deterministic.
Tail metrics are first-class
- p50 improvements with p99.9 regressions are usually a net loss for execution quality.
Rollback path is one command
- Network-path experiments without instant rollback are incident debt.

Migration ladder (safe)

Baseline current socket path
- Capture p50/p95/p99/p99.9 + CPU + drop/retry behavior per session window.
Introduce io_uring for selected socket services
- Validate event-loop simplification and syscall reduction.
Pilot AF_XDP on one feed/one venue segment
- Keep strict canary and route isolation.
Escalate to DPDK only where justified by tail SLO gap
- Not as default religion; only where measured gains beat complexity tax.

This sequencing avoids “jump to hardest architecture first” mistakes.

Benchmark design that avoids self-deception

Benchmark each candidate under:

normal load,
open/close burst profile,
packet-size mix representative of actual feeds,
induced microbursts,
CPU pressure from non-network tasks.

Track at minimum:

one-way app ingress latency (if timestamping permits),
inter-arrival jitter at app boundary,
packet loss/drop/reorder,
core saturation and thermal throttling risk,
recovery time after burst ends.

If your benchmark excludes bursty regimes, it is not useful for trading systems.

Common footguns

Comparing copy-mode AF_XDP vs tuned DPDK and calling it final
- Ensure mode/driver assumptions are explicit.
Ignoring IOMMU/VFIO implications for DPDK rollout
- Driver binding and security/permissions model are part of production design.
Treating io_uring as a drop-in “always faster epoll”
- Gains depend on event-loop redesign, batching, and buffer strategy.
Overfitting to lab RTT/traffic
- Use session-aware market burst replay, not synthetic steady streams only.
No observability parity across options
- If one path has weaker telemetry, postmortems become guesswork.

One-page recommendation for most desks

If your team is building practical low-latency trading infra (not HFT nanosecond wars), a robust default sequence is:

start with solid socket architecture,
adopt io_uring where async efficiency is bottlenecked,
move hot market-data paths to AF_XDP,
reserve DPDK for paths where measured p99.9 gains materially improve execution outcomes.

Treat packet-path architecture as a portfolio of choices by service class, not a single ideology.

References

Linux Kernel Docs — AF_XDP
https://docs.kernel.org/networking/af_xdp.html
Linux Kernel 4.18 notes (AF_XDP introduction context)
https://kernelnewbies.org/Linux_4.18
DPDK Programmer’s Guide — Poll Mode Driver
https://doc.dpdk.org/guides-24.03/prog_guide/poll_mode_drv.html
DPDK Linux GSG — System Requirements (hugepages, kernel baseline)
https://doc.dpdk.org/guides/linux_gsg/sys_reqs.html
DPDK Linux GSG — Linux Drivers (vfio-pci, binding model)
https://doc.dpdk.org/guides/linux_gsg/linux_drivers.html
io_uring Linux man page
https://man7.org/linux/man-pages/man7/io_uring.7.html
liburing wiki — io_uring and networking in 2023
https://github.com/axboe/liburing/wiki/io_uring-and-networking-in-2023
LWN — Accelerating networking with AF_XDP
https://lwn.net/Articles/750845/

One-sentence takeaway

Pick DPDK / AF_XDP / io_uring per service-class tail SLO and ops maturity: optimize p99.9 and recovery behavior, not just headline microbench latency.