Linux Restartable Sequences (rseq): Per-CPU Fast Path Playbook

2026-04-11 · software

Linux Restartable Sequences (rseq): Per-CPU Fast Path Playbook

Date: 2026-04-11
Category: knowledge
Domain: linux / systems / libc / performance engineering / concurrency

1) Why rseq exists

A lot of hot-path userspace code wants the same thing:

That is exactly the niche restartable sequences fill.

rseq lets each thread register a small ABI structure shared with the kernel. User space can then run a tiny assembly critical section that assumes:

This is useful for things like:

The payoff can be large. EfficiOS reported roughly 20x faster current-CPU lookup on x86 and 35x on ARM versus slower traditional paths in their measurements, and Red Hat noted that glibc’s rseq-backed sched_getcpu() was slightly faster than the previous vDSO-based route while also solving an AArch64 portability problem.


2) The mental model: optimistic CPU-local work, abort on interruption

The core idea is simple:

  1. A thread registers a per-thread struct rseq with the kernel.
  2. Right before entering a critical section, user space points rseq_cs at a descriptor (struct rseq_cs).
  3. The critical section runs as a short assembly sequence.
  4. If the thread is preempted, migrated, or receives a signal before the critical section finishes, the kernel redirects execution to the abort handler.
  5. If the sequence reaches its commit point cleanly, the update is considered successful.

So rseq is not “make arbitrary code atomic.” It is a very specific contract:

That is why the classic shape is:

Think of it less like a transaction engine and more like a kernel-assisted “finish on this CPU or start over” primitive.


3) What it is good for

A) Fast sched_getcpu() / CPU-local indexing

This is the easiest win.

The kernel updates CPU information in the shared rseq area, so user space can often find the current CPU without paying syscall overhead. That is why glibc adopted rseq support: even software that never writes its own rseq assembly can benefit indirectly through faster sched_getcpu().

B) Per-CPU counters

Instead of hammering one shared cache line with atomics, each CPU updates its own shard. Aggregation can happen later on a slower path.

Good fit when:

C) Allocator fast paths

Per-CPU freelists are one of the canonical use cases. tcmalloc is a real-world example often cited in rseq discussions.

The appeal is obvious:

D) Per-CPU ring buffers / tracing buffers

Tracing and telemetry systems often want CPU-local append paths. rseq is attractive when the “reserve slot then commit” flow can be expressed as a tiny CPU-local critical section.

E) CPU-local data structures in general

Examples:

The common theme is always the same:

one thread, one current CPU, one local shard, very short critical section.


4) What it is not good for

rseq is a bad fit when you need:

If your algorithm fundamentally needs:

then rseq is usually not a replacement.

It removes overhead for a very specific kind of CPU-local fast path. It does not repeal the rest of concurrency theory.


5) ABI facts that matter in production

Linux kernel support

The rseq() system call arrived in Linux 4.18.

Only one rseq area per thread

This is the biggest integration gotcha.

The kernel supports only one registered rseq ABI area per thread. That means libraries, runtimes, and applications cannot all freeload independently by each registering their own private area.

This is why glibc coordination matters.

glibc ownership changed the story

The man page notes that glibc handles rseq allocation/registration since 2.35. Red Hat’s explanation is the clearest practical summary:

glibc exposes ABI symbols for this coordination:

If you are building a library that wants rseq, this matters a lot more than the raw syscall details.

No glibc wrapper for rseq()

The rseq() syscall itself has no glibc wrapper; raw use goes through syscall(2). In practice, that is another reason to prefer glibc-coordinated usage or librseq over open-coding everything yourself.

Structure size and alignment are versioned

struct rseq is extensible. The man page documents that size/alignment can depend on auxiliary vector values such as:

So if you hand-roll raw rseq support, do not hard-code a simplistic “32 bytes forever” assumption unless you are deliberately targeting the initial ABI subset.

Newer fields exist beyond CPU ID

Modern rseq ABI documentation includes fields such as:

That is a hint that rseq has become more than just “current CPU number plus abortable critical section.”

In particular, mm_cid is interesting: it is a concurrency ID unique within an mm and tends to stay close to zero when concurrency is limited, which can be more memory-efficient than provisioning full nr_possible_cpus-sized sharding for some designs.

Deprecated restart flags

The old no-restart flags (NO_RESTART_ON_PREEMPT, ...SIGNAL, ...MIGRATE) are documented as deprecated since Linux 6.1. Treat them as legacy/debugging details, not as a fresh design surface.


6) The critical-section shape that usually works

A good rseq critical section has this personality:

Good shape

Bad shape

The kernel docs and man page both reinforce this: the critical section is an assembly instruction sequence, not a place to get ambitious.


7) The rule people forget: rseq is only CPU-local atomicity

rseq protects you against interruption relative to the current CPU-local sequence. It does not magically solve inter-CPU publication or memory ordering for the rest of your program.

So if your CPU-local fast path later interacts with shared global state, you still need the usual discipline:

A good mental rule is:

rseq is for updating the local shard cheaply; something else still has to define how the rest of the system sees that shard.

That “something else” might be:


8) Safety rules that keep rseq from becoming a footgun

A) No syscalls inside the critical section

This one is explicit in the docs.

If a restartable sequence performs a syscall, the process may be terminated with SIGSEGV.

So do not let “tiny helper” abstractions sneak kernel entries into the section.

B) Keep retry logic intentional

An abort is not a freak cosmic event.

It happens on:

So the fallback path has to be a real part of the design:

“Just loop forever” is not automatically correct.

C) Read CPU identity carefully

The classic guidance is to read CPU identity in a way the compiler cannot silently duplicate or reorder beyond your design intent. The EfficiOS example uses an ACCESS_ONCE-style approach for exactly this reason.

D) Lifetime matters

If memory containing the rseq_cs descriptor is going away, user space must ensure rseq_cs is cleared before reclaiming that memory. Otherwise the kernel may still treat the descriptor as active.

E) Register/unregister per thread

Each thread owns its own registration lifetime. Do not assume “the process enabled rseq” is enough. Thread creation and teardown are part of the feature’s correctness story.

F) Plan for unsupported kernels

The man page is explicit: user space must have a fallback story when rseq() is unavailable.

That usually means:


9) Prefer librseq or libc-coordinated use over heroic custom assembly

You can hand-write rseq critical sections. You probably should not unless you truly need to.

Reasons to prefer librseq or a mature libc-coordinated implementation:

The raw syscall and raw assembly path is best treated like:

For ordinary application code, a library wrapper is the sane default.


10) Where rseq shines operationally

A) High core counts with hot shared counters

If a workload is wasting cycles on cache-line ping-pong from atomics, rseq-style per-CPU sharding can help a lot.

B) Fast-path memory allocation

Allocator hot paths are allergic to locks and shared-line bouncing. This is the kind of environment where rseq can earn its complexity budget.

C) Tracing/telemetry paths that must stay cheap

Per-CPU trace buffers and append-only local telemetry structures are rseq territory.

D) Architectures where old “fast getcpu” tricks are awkward

Red Hat’s AArch64 explanation is a good reminder that sometimes rseq is not just faster; it is the cleanest portable way to provide a cheap current-CPU lookup across architectures.


11) Where rollout pain usually comes from

Tooling friction

Checkpoint/restore, some instrumentation, and other runtime tooling may need explicit rseq awareness. Red Hat called this out in the context of CRIU compatibility.

Runtime/library ownership confusion

If one component assumes it can directly register rseq while glibc already owns the per-thread area, you get ABI collisions instead of performance wins.

Overengineering the critical section

The temptation is always to “just move a little more work into the rseq path.” That is how a clean fast path becomes a fragile one.

Measuring only microbenchmarks

rseq can look beautiful in an isolated benchmark and underdeliver in a real service if:


12) A practical adoption checklist

Phase 0 — confirm it is the right problem

Use rseq only if the hot path is actually suffering from one of these:

If the bottleneck is somewhere else, rseq is just complexity cosplay.

Phase 1 — pick the right abstraction level

Choose in this order:

  1. existing runtime/library support,
  2. librseq,
  3. custom raw rseq only if absolutely necessary.

Phase 2 — design the slow fallback first

Define what happens when:

A fallback path is not optional.

Phase 3 — benchmark the real win

Measure:

Do not stop at “counter increment is 7x faster.” Ask whether the whole service got meaningfully better.

Phase 4 — validate thread lifecycle correctness

Test:

Phase 5 — keep an escape hatch

Have a runtime switch or build-time option to disable rseq-backed fast paths if they misbehave in production or collide with tooling.


13) Advanced note: mm_cid is quietly interesting

Most rseq discussions stop at “per-CPU counters.” The newer ABI field mm_cid deserves more attention.

Why?

Because raw per-CPU sharding scales memory to possible CPUs system-wide, which can be wasteful on big boxes or containers with limited effective concurrency.

mm_cid offers a memory-map-local concurrency ID that stays within CPU range but can remain compact when actual concurrent execution is limited by:

That suggests a useful design split:

It is not a universal replacement, but it is a strong hint that “rseq == only CPU number” is already an outdated mental model.


14) Short version

Use rseq when you need very cheap CPU-local fast paths and are willing to pay some implementation complexity for it.

The five most important truths are:

  1. rseq is about CPU-local optimistic updates, not general transactions.
  2. Only one rseq ABI area exists per thread, so libc/runtime coordination matters.
  3. Critical sections must be tiny, syscall-free, and abort-friendly.
  4. rseq removes some atomic/lookup overhead, but not the need for normal memory-ordering discipline elsewhere.
  5. Prefer mature wrappers (librseq, glibc-coordinated use) unless you are building low-level infrastructure.

When the fit is right, rseq is one of those deeply Linux-flavored primitives that feels almost unfair:

When the fit is wrong, it is just fancy assembly wrapped around the wrong bottleneck.


References