NUMA-Aware Process Placement for Low-Latency Systems: Practical Playbook
Date: 2026-03-06
Category: software
Purpose: A field guide for reducing tail latency and jitter by aligning CPU placement, memory policy, and allocation behavior on NUMA hosts.
Why this matters
On NUMA machines, memory access time depends on where the CPU and memory live. Local-node memory is faster than remote-node memory, and remote access can amplify p99/p999 latency under load.
For latency-sensitive services (matching/risk engines, gateways, real-time analytics), NUMA mistakes typically show up as:
- unexplained tail spikes despite low average CPU
- poor scaling after moving from 1 socket to multi-socket
- noisy benchmark variance across identical runs
- “works fine off-peak, degrades at busy times” behavior
NUMA tuning is usually not about one magic flag. It is about consistent placement policy end-to-end.
One-screen mental model
- Scheduler chooses where threads run (CPU affinity, cpuset/cgroup constraints).
- Kernel chooses where pages are allocated (NUMA memory policy).
- If thread and page end up on different nodes, you pay remote access costs.
- Automatic NUMA balancing may migrate pages/tasks to improve locality, but that itself adds overhead.
Goal: keep hot threads + hot memory on the same node by design, not by luck.
Placement building blocks (Linux)
1) Discover topology first
numactl --hardware
lscpu -e=cpu,node,socket,core
You need a stable map of sockets, nodes, and cores before pinning anything.
2) Pin CPU execution domain
numactl supports:
--cpunodebind=<nodes>: run on CPUs of specific NUMA nodes--physcpubind=<cpus>: pin to exact CPU IDs
3) Set memory policy intentionally
numactl/kernel policies you’ll use most:
--membind=<nodes>/MPOL_BIND: allocate only from selected node set--preferred=<node>/MPOL_PREFERRED: prefer one node, fallback allowed--preferred-many=<mask>/MPOL_PREFERRED_MANY: prefer a set of nodes--interleave=<nodes>/MPOL_INTERLEAVE: spread pages across nodes (bandwidth-oriented)--weighted-interleave=<nodes>/MPOL_WEIGHTED_INTERLEAVE: weighted distribution (newer kernels)--localalloc/MPOL_LOCAL: allocate on current CPU’s local node
Important: cpuset restrictions take precedence over memory policy.
Automatic NUMA balancing: when to use it
Kernel numa_balancing can sample access patterns (via unmap + fault) and migrate for locality. This can improve placement for drifting workloads, but introduces extra overhead.
Practical rule:
- Enable / keep default when thread-memory locality is naturally dynamic.
- Disable or constrain when you already pin threads tightly and want deterministic tails.
Kernel docs explicitly note there is no universal guarantee: balancing overhead may or may not be offset by improved locality.
Battle-tested operating patterns
Pattern A — Strict single-node isolation (lowest jitter)
Use for ultra-latency-critical workers.
numactl --cpunodebind=1 --membind=1 ./engine
Pros:
- strong locality
- predictable tails
Risks:
- allocation failure or reclaim pressure if node memory is tight
- operational fragility during bursts
Pattern B — Preferred local with controlled fallback (safer default)
numactl --cpunodebind=1 --preferred=1 ./engine
Pros:
- mostly local under normal conditions
- less failure-prone than strict bind
Risks:
- fallback can hide locality regressions unless monitored
Pattern C — Interleave for memory bandwidth workloads
Use for scans/large in-memory analytics, not ultra-low-latency critical paths.
numactl --interleave=all ./batch_analytics
Pros:
- better bandwidth utilization
Risks:
- single hot access path may still suffer higher latency
Verification loop (what to measure, not guess)
1) Confirm effective policy and page placement
/proc/<pid>/numa_mapsfor per-range policy and node page countsnumastat -p <pid>for per-node process memory distribution
2) Watch allocator-level NUMA health
numastat default counters:
numa_hit: allocated as intended on nodenuma_miss/numa_foreign: locality mismatch signalslocal_nodevsother_node: immediate locality quality check
3) Correlate with latency/tail metrics
Track before/after:
- p95/p99/p999 latency
- jitter (stddev / MAD)
- throughput-per-core
- major page faults and reclaim activity
4) Optional deep dive
Use perf c2c when cacheline ping-pong or cross-node contention is suspected. It can surface remote/local HITM and peer-load patterns for hot cachelines.
DPDK / packet path note (important in practice)
For DPDK-style data planes, reserve hugepages per node deliberately and align memory to socket layout. DPDK docs emphasize NUMA-aware hugepage reservation and recommend socket-specific memory controls (e.g., --socket-mem) over coarse global memory sizing.
If queues/threads are pinned per socket but memory pools are not, cross-socket traffic can destroy deterministic latency.
Common anti-patterns
- Pinning CPU but leaving memory policy implicit.
- Using strict
membindwithout capacity headroom. - Turning on every “NUMA optimization” flag at once (no attribution).
- Benchmarking without fixed affinity and then trusting results.
- Ignoring cpuset/cgroup limits that silently override intended policy.
Rollout strategy (production-safe)
- Baseline current tail + NUMA counters.
- Apply one policy change to one service shard (canary).
- Hold load profile constant (traffic shape + concurrency).
- Compare 24h window: tails, misses/foreign, reclaim, error rate.
- Keep/revert, then move to next shard.
Treat NUMA tuning like SLO surgery: isolate one variable per experiment.
Quick decision matrix
- Need absolute lowest jitter? →
cpunodebind + membind(with headroom). - Need robustness over strictness? →
cpunodebind + preferred. - Need aggregate memory bandwidth? → interleave/weighted-interleave.
- Already tightly pinned and stable? → test with NUMA balancing off.
- Workload drifts over time? → test with balancing on + observe overhead.
Practical takeaway
NUMA is not a micro-optimization on multi-socket hosts. It is part of core correctness for predictable latency.
Make CPU affinity, memory policy, and observability agree with each other. If one is implicit, tails will find it.
References
- Linux man page:
numa(7)— overview and/proc/<pid>/numa_mapsbehavior
https://man7.org/linux/man-pages/man7/numa.7.html - Linux kernel docs: NUMA memory policy (scope, modes, cpuset interaction)
https://docs.kernel.org/admin-guide/mm/numa_memory_policy.html - Linux man page:
numactl(8)— policy and affinity options
https://man7.org/linux/man-pages/man8/numactl.8.html - Linux man page:
set_mempolicy(2)— policy modes and newer flags
https://man7.org/linux/man-pages/man2/set_mempolicy.2.html - Linux kernel sysctl docs:
numa_balancingand promote rate limit
https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html - Linux man page:
numastat(8)— per-node system/process NUMA stats
https://man7.org/linux/man-pages/man8/numastat.8.html - Linux man page:
perf-c2c(1)— cache-to-cache contention analysis
https://man7.org/linux/man-pages/man1/perf-c2c.1.html - DPDK getting started/system requirements (hugepages + NUMA)
https://doc.dpdk.org/guides/linux_gsg/sys_reqs.html - DPDK EAL guide (memory modes and NUMA implications)
https://doc.dpdk.org/guides-20.11/prog_guide/env_abstraction_layer.html