Lock-Free Ring Buffer Memory Ordering Playbook (SPSC/MPMC, Practical)
Date: 2026-03-19
Category: knowledge
Why this matters
Ring buffers are the default hot-path queue in low-latency systems (market gateways, telemetry pipelines, streaming engines).
Most production failures are not “algorithm is wrong.” They are:
- wrong memory ordering (rare stale/duplicate reads),
- false-sharing cache fights,
- no backpressure contract,
- benchmark numbers that collapse under real contention.
Treat ring buffers as a concurrency contract, not just an array with two counters.
1) Topology first: pick the simplest queue that fits
Before touching atomics, lock in producer/consumer topology:
- SPSC (single-producer/single-consumer): simplest, fastest, easiest to reason about.
- MPSC: producer contention exists; correctness cost rises.
- SPMC: consumer contention exists.
- MPMC: hardest (usually needs per-slot sequence protocol).
Rule: if architecture allows SPSC lanes per core/shard, do that first.
2) Mental model: ownership vs visibility
Two separate truths:
- Ownership: who may mutate which index/state.
- Visibility: when another thread is guaranteed to observe those writes.
Typical bug: “index changed” is visible, but payload write is not safely published yet.
Use atomics to publish state transitions (empty -> ready -> consumed), not every field write.
3) SPSC memory-order recipe (the baseline)
Assume:
- producer owns
head, - consumer owns
tail, - both read each other’s index atomically.
Producer (enqueue)
- Read local
head(often relaxed). - Read consumer
tailwith acquire when checking full. - Write payload into slot.
- Publish new
headwith release.
Consumer (dequeue)
- Read local
tail(often relaxed). - Read producer
headwith acquire when checking empty. - Read payload from slot.
- Publish new
tailwith release.
Interpretation:
- Producer’s payload writes happen-before consumer sees advanced
head. - Consumer’s slot-release happens-before producer sees advanced
tail.
If you use only relaxed everywhere, it may “work in tests” and fail on different CPUs/compilers.
4) MPMC: use per-slot sequence numbers, not blind index racing
For MPMC, simple head/tail ownership disappears.
Practical pattern (Vyukov-style bounded MPMC queue):
- each slot has a sequence number,
- producer reserves position via CAS,
- producer writes payload, then release-stores slot sequence to “ready,”
- consumer CAS-reserves position,
- consumer acquire-loads slot readiness, reads payload,
- consumer release-stores sequence to next cycle state.
Why this scales better than naive global locks:
- avoids one giant critical section,
- encodes slot lifecycle directly,
- separates reservation from publication.
If you need MPMC and cannot prove your protocol on paper, use a battle-tested implementation.
5) Non-negotiable performance hygiene
5.1 Cache-line padding
Pad hot atomics (head, tail, producer/consumer-local counters) to separate cache lines.
Without padding, cross-core ping-pong can dominate latency even when logic is correct.
5.2 Power-of-two capacity + masking
Use index & (capacity - 1) instead of modulo, and size capacity for burst tolerance.
5.3 Avoid mixed ownership writes
Each role should mostly write its own counters. Shared write ownership = coherence tax.
5.4 Batch where acceptable
Publishing in micro-batches can reduce coherence traffic, but increases visibility delay. Tie this to SLO, not “looks faster.”
6) Backpressure is part of correctness
Queue-full behavior must be explicit:
- block/spin/yield,
- drop newest,
- drop oldest,
- route to overflow lane,
- escalate upstream throttle.
No policy means hidden data loss or tail-latency explosions.
Track at minimum:
- enqueue fail/full count,
- occupancy high-watermark,
- producer stall time,
- consumer lag histogram.
7) Common footguns
Publishing index before payload write is visible
(release/acquire contract broken)Using
volatileas concurrency control
(volatileis not a substitute for atomic ordering)One benchmark thread pinned, real workload unpinned
“Fast in lab, unstable in prod” patternIgnoring NUMA placement
remote memory can dominate queue costNo overflow policy in incident mode
queue becomes unbounded latency amplifier
8) Minimal rollout checklist
- Start with SPSC reference implementation and correctness tests.
- Add stress tests with randomized producer/consumer pacing.
- Add architecture diversity tests (different CPU families if possible).
- Validate with TSAN/model checks where applicable.
- Measure p50/p95/p99 under realistic contention, not microbench only.
- Add queue observability (depth/fulls/lag) before production.
- Define overflow policy and incident fallback.
- Only then consider MPSC/MPMC upgrades.
References
- C++ memory order reference:
https://en.cppreference.com/w/cpp/atomic/memory_order - LMAX Disruptor (sequencing/ring-buffer design):
https://lmax-exchange.github.io/disruptor/ - Linux kernel lockless ring buffer design notes:
https://www.kernel.org/doc/Documentation/trace/ring-buffer-design.txt - Bounded MPMC queue (D. Vyukov):
https://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue - Rust atomics and memory ordering (nomicon):
https://doc.rust-lang.org/nomicon/atomics.html