Kernel Bypass on Linux for Low-Latency Trading: DPDK vs AF_XDP vs io_uring Playbook
Date: 2026-03-04
Category: knowledge
Why this matters
For execution systems, “network stack choice” is not an infra detail. It directly changes:
- wire-to-app latency distribution (especially p99/p99.9),
- jitter under bursty market-data load,
- CPU efficiency per packet,
- operational blast radius during incidents.
If you pick the wrong path, you either:
- overpay in latency/jitter (missed queue priority, worse fills), or
- overpay in operational complexity (fragile deploys, harder debugging/on-call).
This playbook is a practical chooser between DPDK, AF_XDP, and io_uring-based networking.
60-second mental model
DPDK
- Full user-space packet I/O model with poll-mode drivers.
- Maximizes control and performance, minimizes kernel path usage.
- Usually highest complexity (driver binding, hugepages, CPU pinning, NUMA discipline).
AF_XDP
- XDP + XSK sockets to redirect packets from NIC RX queue into user-space UMEM.
- “Kernel-assisted bypass”: closer to Linux ecosystem than full DPDK-only path.
- Strong middle ground for many low-latency systems.
io_uring networking
- Completion-based async I/O with shared SQ/CQ rings.
- Not a full packet-plane replacement like DPDK/AF_XDP, but strong for scalable socket I/O.
- Best when you need lower syscall overhead + modern async model without deep NIC pipeline surgery.
What the kernel/docs explicitly imply
From Linux AF_XDP docs:
- AF_XDP sockets use RX/TX rings + UMEM with FILL/COMPLETION rings.
- Packets are redirected via XDP program + XSKMAP.
- Supports XDP_SKB (fallback/copy path) and XDP_DRV (driver path).
From DPDK PMD docs:
- PMDs are designed around polling, burst APIs, and no interrupt-driven hot path (except specific events).
- Architecture assumes explicit core/queue ownership and NUMA-aware memory layout.
From io_uring docs/wiki:
- Shared rings reduce syscall overhead and allow submit+wait patterns.
- Multi-shot accept/recv and provided-buffer models are key for high-connection/packetized workloads.
Decision matrix (practical)
Choose DPDK when
- You need the lowest possible jitter and strictest CPU/NIC queue control.
- You can dedicate cores aggressively (busy polling is acceptable).
- You can own the ops complexity (hugepages, vfio-pci, NIC binding discipline).
- Your team is comfortable with deep packet-path observability and tuning.
Typical fit:
- ultra-latency-sensitive market-data handlers,
- specialized feed handlers,
- colocated execution gateways with tight SLOs.
Choose AF_XDP when
- You need near-bypass performance but want to stay closer to Linux-native ops.
- You want queue-level user-space packet access without fully abandoning kernel ecosystem tooling.
- You can run XDP programs and manage per-queue mapping intentionally.
- You need a staged path from conventional Linux networking toward lower latency.
Typical fit:
- low-latency market-data consumers with stronger operability constraints,
- mixed workloads where some services still rely on normal kernel networking.
Choose io_uring networking when
- Your bottleneck is syscall/event-loop overhead in socket workloads (not full packet-plane bypass).
- You need high connection scale and efficient async semantics.
- You want modernized networking I/O without committing to full NIC bypass architecture.
Typical fit:
- order gateways and control-plane services with high concurrency,
- internal service mesh edges where tail latency matters but full bypass is overkill.
Latency-vs-operability tradeoff (rule of thumb)
- DPDK: best raw control/perf ceiling, highest integration burden.
- AF_XDP: strong middle point, generally better operability/perf balance.
- io_uring: easiest migration from classic socket apps, best for async efficiency rather than absolute packet-path minimization.
Don’t optimize only mean latency. Decide on:
- p99.9 target,
- jitter budget during open/close bursts,
- on-call recovery time when packet path misbehaves.
Non-negotiable engineering checklist (any option)
- Core and queue ownership are explicit
- No accidental queue sharing in hot paths.
- NUMA locality is enforced
- NIC queue, CPU core, and memory pool align on same NUMA node whenever possible.
- Busy-poll budget is measured, not guessed
- Trading lower latency for runaway CPU without capacity planning is a hidden outage.
- Backpressure policy is designed upfront
- Drop, coalesce, or degrade mode must be deterministic.
- Tail metrics are first-class
- p50 improvements with p99.9 regressions are usually a net loss for execution quality.
- Rollback path is one command
- Network-path experiments without instant rollback are incident debt.
Migration ladder (safe)
- Baseline current socket path
- Capture p50/p95/p99/p99.9 + CPU + drop/retry behavior per session window.
- Introduce io_uring for selected socket services
- Validate event-loop simplification and syscall reduction.
- Pilot AF_XDP on one feed/one venue segment
- Keep strict canary and route isolation.
- Escalate to DPDK only where justified by tail SLO gap
- Not as default religion; only where measured gains beat complexity tax.
This sequencing avoids “jump to hardest architecture first” mistakes.
Benchmark design that avoids self-deception
Benchmark each candidate under:
- normal load,
- open/close burst profile,
- packet-size mix representative of actual feeds,
- induced microbursts,
- CPU pressure from non-network tasks.
Track at minimum:
- one-way app ingress latency (if timestamping permits),
- inter-arrival jitter at app boundary,
- packet loss/drop/reorder,
- core saturation and thermal throttling risk,
- recovery time after burst ends.
If your benchmark excludes bursty regimes, it is not useful for trading systems.
Common footguns
- Comparing copy-mode AF_XDP vs tuned DPDK and calling it final
- Ensure mode/driver assumptions are explicit.
- Ignoring IOMMU/VFIO implications for DPDK rollout
- Driver binding and security/permissions model are part of production design.
- Treating io_uring as a drop-in “always faster epoll”
- Gains depend on event-loop redesign, batching, and buffer strategy.
- Overfitting to lab RTT/traffic
- Use session-aware market burst replay, not synthetic steady streams only.
- No observability parity across options
- If one path has weaker telemetry, postmortems become guesswork.
One-page recommendation for most desks
If your team is building practical low-latency trading infra (not HFT nanosecond wars), a robust default sequence is:
- start with solid socket architecture,
- adopt io_uring where async efficiency is bottlenecked,
- move hot market-data paths to AF_XDP,
- reserve DPDK for paths where measured p99.9 gains materially improve execution outcomes.
Treat packet-path architecture as a portfolio of choices by service class, not a single ideology.
References
- Linux Kernel Docs — AF_XDP
https://docs.kernel.org/networking/af_xdp.html - Linux Kernel 4.18 notes (AF_XDP introduction context)
https://kernelnewbies.org/Linux_4.18 - DPDK Programmer’s Guide — Poll Mode Driver
https://doc.dpdk.org/guides-24.03/prog_guide/poll_mode_drv.html - DPDK Linux GSG — System Requirements (hugepages, kernel baseline)
https://doc.dpdk.org/guides/linux_gsg/sys_reqs.html - DPDK Linux GSG — Linux Drivers (vfio-pci, binding model)
https://doc.dpdk.org/guides/linux_gsg/linux_drivers.html - io_uring Linux man page
https://man7.org/linux/man-pages/man7/io_uring.7.html - liburing wiki — io_uring and networking in 2023
https://github.com/axboe/liburing/wiki/io_uring-and-networking-in-2023 - LWN — Accelerating networking with AF_XDP
https://lwn.net/Articles/750845/
One-sentence takeaway
Pick DPDK / AF_XDP / io_uring per service-class tail SLO and ops maturity: optimize p99.9 and recovery behavior, not just headline microbench latency.