NUMA-Aware Process Placement for Low-Latency Systems: Practical Playbook

2026-03-06 · software

NUMA-Aware Process Placement for Low-Latency Systems: Practical Playbook

Date: 2026-03-06
Category: software
Purpose: A field guide for reducing tail latency and jitter by aligning CPU placement, memory policy, and allocation behavior on NUMA hosts.


Why this matters

On NUMA machines, memory access time depends on where the CPU and memory live. Local-node memory is faster than remote-node memory, and remote access can amplify p99/p999 latency under load.

For latency-sensitive services (matching/risk engines, gateways, real-time analytics), NUMA mistakes typically show up as:

NUMA tuning is usually not about one magic flag. It is about consistent placement policy end-to-end.


One-screen mental model

  1. Scheduler chooses where threads run (CPU affinity, cpuset/cgroup constraints).
  2. Kernel chooses where pages are allocated (NUMA memory policy).
  3. If thread and page end up on different nodes, you pay remote access costs.
  4. Automatic NUMA balancing may migrate pages/tasks to improve locality, but that itself adds overhead.

Goal: keep hot threads + hot memory on the same node by design, not by luck.


Placement building blocks (Linux)

1) Discover topology first

numactl --hardware
lscpu -e=cpu,node,socket,core

You need a stable map of sockets, nodes, and cores before pinning anything.

2) Pin CPU execution domain

numactl supports:

3) Set memory policy intentionally

numactl/kernel policies you’ll use most:

Important: cpuset restrictions take precedence over memory policy.


Automatic NUMA balancing: when to use it

Kernel numa_balancing can sample access patterns (via unmap + fault) and migrate for locality. This can improve placement for drifting workloads, but introduces extra overhead.

Practical rule:

Kernel docs explicitly note there is no universal guarantee: balancing overhead may or may not be offset by improved locality.


Battle-tested operating patterns

Pattern A — Strict single-node isolation (lowest jitter)

Use for ultra-latency-critical workers.

numactl --cpunodebind=1 --membind=1 ./engine

Pros:

Risks:

Pattern B — Preferred local with controlled fallback (safer default)

numactl --cpunodebind=1 --preferred=1 ./engine

Pros:

Risks:

Pattern C — Interleave for memory bandwidth workloads

Use for scans/large in-memory analytics, not ultra-low-latency critical paths.

numactl --interleave=all ./batch_analytics

Pros:

Risks:


Verification loop (what to measure, not guess)

1) Confirm effective policy and page placement

2) Watch allocator-level NUMA health

numastat default counters:

3) Correlate with latency/tail metrics

Track before/after:

4) Optional deep dive

Use perf c2c when cacheline ping-pong or cross-node contention is suspected. It can surface remote/local HITM and peer-load patterns for hot cachelines.


DPDK / packet path note (important in practice)

For DPDK-style data planes, reserve hugepages per node deliberately and align memory to socket layout. DPDK docs emphasize NUMA-aware hugepage reservation and recommend socket-specific memory controls (e.g., --socket-mem) over coarse global memory sizing.

If queues/threads are pinned per socket but memory pools are not, cross-socket traffic can destroy deterministic latency.


Common anti-patterns

  1. Pinning CPU but leaving memory policy implicit.
  2. Using strict membind without capacity headroom.
  3. Turning on every “NUMA optimization” flag at once (no attribution).
  4. Benchmarking without fixed affinity and then trusting results.
  5. Ignoring cpuset/cgroup limits that silently override intended policy.

Rollout strategy (production-safe)

  1. Baseline current tail + NUMA counters.
  2. Apply one policy change to one service shard (canary).
  3. Hold load profile constant (traffic shape + concurrency).
  4. Compare 24h window: tails, misses/foreign, reclaim, error rate.
  5. Keep/revert, then move to next shard.

Treat NUMA tuning like SLO surgery: isolate one variable per experiment.


Quick decision matrix


Practical takeaway

NUMA is not a micro-optimization on multi-socket hosts. It is part of core correctness for predictable latency.

Make CPU affinity, memory policy, and observability agree with each other. If one is implicit, tails will find it.


References