Lock-Free Ring Buffer Memory Ordering Playbook (SPSC/MPMC, Practical)

2026-03-19 · software

Lock-Free Ring Buffer Memory Ordering Playbook (SPSC/MPMC, Practical)

Date: 2026-03-19
Category: knowledge

Why this matters

Ring buffers are the default hot-path queue in low-latency systems (market gateways, telemetry pipelines, streaming engines).

Most production failures are not “algorithm is wrong.” They are:

Treat ring buffers as a concurrency contract, not just an array with two counters.


1) Topology first: pick the simplest queue that fits

Before touching atomics, lock in producer/consumer topology:

Rule: if architecture allows SPSC lanes per core/shard, do that first.


2) Mental model: ownership vs visibility

Two separate truths:

  1. Ownership: who may mutate which index/state.
  2. Visibility: when another thread is guaranteed to observe those writes.

Typical bug: “index changed” is visible, but payload write is not safely published yet.

Use atomics to publish state transitions (empty -> ready -> consumed), not every field write.


3) SPSC memory-order recipe (the baseline)

Assume:

Producer (enqueue)

  1. Read local head (often relaxed).
  2. Read consumer tail with acquire when checking full.
  3. Write payload into slot.
  4. Publish new head with release.

Consumer (dequeue)

  1. Read local tail (often relaxed).
  2. Read producer head with acquire when checking empty.
  3. Read payload from slot.
  4. Publish new tail with release.

Interpretation:

If you use only relaxed everywhere, it may “work in tests” and fail on different CPUs/compilers.


4) MPMC: use per-slot sequence numbers, not blind index racing

For MPMC, simple head/tail ownership disappears.

Practical pattern (Vyukov-style bounded MPMC queue):

Why this scales better than naive global locks:

If you need MPMC and cannot prove your protocol on paper, use a battle-tested implementation.


5) Non-negotiable performance hygiene

5.1 Cache-line padding

Pad hot atomics (head, tail, producer/consumer-local counters) to separate cache lines.

Without padding, cross-core ping-pong can dominate latency even when logic is correct.

5.2 Power-of-two capacity + masking

Use index & (capacity - 1) instead of modulo, and size capacity for burst tolerance.

5.3 Avoid mixed ownership writes

Each role should mostly write its own counters. Shared write ownership = coherence tax.

5.4 Batch where acceptable

Publishing in micro-batches can reduce coherence traffic, but increases visibility delay. Tie this to SLO, not “looks faster.”


6) Backpressure is part of correctness

Queue-full behavior must be explicit:

No policy means hidden data loss or tail-latency explosions.

Track at minimum:


7) Common footguns

  1. Publishing index before payload write is visible
    (release/acquire contract broken)

  2. Using volatile as concurrency control
    (volatile is not a substitute for atomic ordering)

  3. One benchmark thread pinned, real workload unpinned
    “Fast in lab, unstable in prod” pattern

  4. Ignoring NUMA placement
    remote memory can dominate queue cost

  5. No overflow policy in incident mode
    queue becomes unbounded latency amplifier


8) Minimal rollout checklist

  1. Start with SPSC reference implementation and correctness tests.
  2. Add stress tests with randomized producer/consumer pacing.
  3. Add architecture diversity tests (different CPU families if possible).
  4. Validate with TSAN/model checks where applicable.
  5. Measure p50/p95/p99 under realistic contention, not microbench only.
  6. Add queue observability (depth/fulls/lag) before production.
  7. Define overflow policy and incident fallback.
  8. Only then consider MPSC/MPMC upgrades.

References