Lock-Free Ring Buffer Memory Ordering Playbook (SPSC/MPMC, Practical)

Date: 2026-03-19
Category: knowledge

Why this matters

Ring buffers are the default hot-path queue in low-latency systems (market gateways, telemetry pipelines, streaming engines).

Most production failures are not “algorithm is wrong.” They are:

wrong memory ordering (rare stale/duplicate reads),
false-sharing cache fights,
no backpressure contract,
benchmark numbers that collapse under real contention.

Treat ring buffers as a concurrency contract, not just an array with two counters.

1) Topology first: pick the simplest queue that fits

Before touching atomics, lock in producer/consumer topology:

SPSC (single-producer/single-consumer): simplest, fastest, easiest to reason about.
MPSC: producer contention exists; correctness cost rises.
SPMC: consumer contention exists.
MPMC: hardest (usually needs per-slot sequence protocol).

Rule: if architecture allows SPSC lanes per core/shard, do that first.

2) Mental model: ownership vs visibility

Two separate truths:

Ownership: who may mutate which index/state.
Visibility: when another thread is guaranteed to observe those writes.

Typical bug: “index changed” is visible, but payload write is not safely published yet.

Use atomics to publish state transitions (empty -> ready -> consumed), not every field write.

3) SPSC memory-order recipe (the baseline)

Assume:

producer owns head,
consumer owns tail,
both read each other’s index atomically.

Producer (enqueue)

Read local head (often relaxed).
Read consumer tail with acquire when checking full.
Write payload into slot.
Publish new head with release.

Consumer (dequeue)

Read local tail (often relaxed).
Read producer head with acquire when checking empty.
Read payload from slot.
Publish new tail with release.

Interpretation:

Producer’s payload writes happen-before consumer sees advanced head.
Consumer’s slot-release happens-before producer sees advanced tail.

If you use only relaxed everywhere, it may “work in tests” and fail on different CPUs/compilers.

4) MPMC: use per-slot sequence numbers, not blind index racing

For MPMC, simple head/tail ownership disappears.

Practical pattern (Vyukov-style bounded MPMC queue):

each slot has a sequence number,
producer reserves position via CAS,
producer writes payload, then release-stores slot sequence to “ready,”
consumer CAS-reserves position,
consumer acquire-loads slot readiness, reads payload,
consumer release-stores sequence to next cycle state.

Why this scales better than naive global locks:

avoids one giant critical section,
encodes slot lifecycle directly,
separates reservation from publication.

If you need MPMC and cannot prove your protocol on paper, use a battle-tested implementation.

5) Non-negotiable performance hygiene

5.1 Cache-line padding

Pad hot atomics (head, tail, producer/consumer-local counters) to separate cache lines.

Without padding, cross-core ping-pong can dominate latency even when logic is correct.

5.2 Power-of-two capacity + masking

Use index & (capacity - 1) instead of modulo, and size capacity for burst tolerance.

5.3 Avoid mixed ownership writes

Each role should mostly write its own counters. Shared write ownership = coherence tax.

5.4 Batch where acceptable

Publishing in micro-batches can reduce coherence traffic, but increases visibility delay. Tie this to SLO, not “looks faster.”

6) Backpressure is part of correctness

Queue-full behavior must be explicit:

block/spin/yield,
drop newest,
drop oldest,
route to overflow lane,
escalate upstream throttle.

No policy means hidden data loss or tail-latency explosions.

Track at minimum:

enqueue fail/full count,
occupancy high-watermark,
producer stall time,
consumer lag histogram.

7) Common footguns

Publishing index before payload write is visible
(release/acquire contract broken)
Using volatile as concurrency control
(volatile is not a substitute for atomic ordering)
One benchmark thread pinned, real workload unpinned
“Fast in lab, unstable in prod” pattern
Ignoring NUMA placement
remote memory can dominate queue cost
No overflow policy in incident mode
queue becomes unbounded latency amplifier

8) Minimal rollout checklist

Start with SPSC reference implementation and correctness tests.
Add stress tests with randomized producer/consumer pacing.
Add architecture diversity tests (different CPU families if possible).
Validate with TSAN/model checks where applicable.
Measure p50/p95/p99 under realistic contention, not microbench only.
Add queue observability (depth/fulls/lag) before production.
Define overflow policy and incident fallback.
Only then consider MPSC/MPMC upgrades.

References

C++ memory order reference:
https://en.cppreference.com/w/cpp/atomic/memory_order
LMAX Disruptor (sequencing/ring-buffer design):
https://lmax-exchange.github.io/disruptor/
Linux kernel lockless ring buffer design notes:
https://www.kernel.org/doc/Documentation/trace/ring-buffer-design.txt
Bounded MPMC queue (D. Vyukov):
https://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
Rust atomics and memory ordering (nomicon):
https://doc.rust-lang.org/nomicon/atomics.html