Linux Restartable Sequences (rseq): Per-CPU Fast Path Playbook

Date: 2026-04-11
Category: knowledge
Domain: linux / systems / libc / performance engineering / concurrency

1) Why rseq exists

A lot of hot-path userspace code wants the same thing:

touch CPU-local state,
avoid a syscall,
avoid a global atomic,
and still not corrupt data if the thread gets preempted or migrated.

That is exactly the niche restartable sequences fill.

rseq lets each thread register a small ABI structure shared with the kernel. User space can then run a tiny assembly critical section that assumes:

“I stay on this CPU long enough to finish,”
and “if I do not stay on this CPU, the kernel will abort me before I commit.”

This is useful for things like:

per-CPU counters,
allocator freelists,
per-CPU ring buffers,
very fast current-CPU lookup,
other sharded fast paths where a global atomic would be needless tax.

The payoff can be large. EfficiOS reported roughly 20x faster current-CPU lookup on x86 and 35x on ARM versus slower traditional paths in their measurements, and Red Hat noted that glibc’s rseq-backed sched_getcpu() was slightly faster than the previous vDSO-based route while also solving an AArch64 portability problem.

2) The mental model: optimistic CPU-local work, abort on interruption

The core idea is simple:

A thread registers a per-thread struct rseq with the kernel.
Right before entering a critical section, user space points rseq_cs at a descriptor (struct rseq_cs).
The critical section runs as a short assembly sequence.
If the thread is preempted, migrated, or receives a signal before the critical section finishes, the kernel redirects execution to the abort handler.
If the sequence reaches its commit point cleanly, the update is considered successful.

So rseq is not “make arbitrary code atomic.” It is a very specific contract:

keep the critical section tiny,
keep the real externally visible mutation at the end,
and make abort/retry cheap.

That is why the classic shape is:

read current CPU,
compute target slot in per-CPU storage,
do preparatory work,
perform one tiny commit-store,
otherwise jump to fallback or retry.

Think of it less like a transaction engine and more like a kernel-assisted “finish on this CPU or start over” primitive.

3) What it is good for

A) Fast `sched_getcpu()` / CPU-local indexing

This is the easiest win.

The kernel updates CPU information in the shared rseq area, so user space can often find the current CPU without paying syscall overhead. That is why glibc adopted rseq support: even software that never writes its own rseq assembly can benefit indirectly through faster sched_getcpu().

B) Per-CPU counters

Instead of hammering one shared cache line with atomics, each CPU updates its own shard. Aggregation can happen later on a slower path.

Good fit when:

exact instant global visibility is unnecessary,
write frequency is high,
read/aggregate frequency is low or moderate.

C) Allocator fast paths

Per-CPU freelists are one of the canonical use cases. tcmalloc is a real-world example often cited in rseq discussions.

The appeal is obvious:

local free-list pop/push,
no global lock on the hot path,
less cross-core cache-line bouncing.

D) Per-CPU ring buffers / tracing buffers

Tracing and telemetry systems often want CPU-local append paths. rseq is attractive when the “reserve slot then commit” flow can be expressed as a tiny CPU-local critical section.

E) CPU-local data structures in general

Examples:

per-CPU linked lists,
per-CPU lock metadata,
local batching queues,
NUMA- or CPU-sharded stats/state.

The common theme is always the same:

one thread, one current CPU, one local shard, very short critical section.

4) What it is not good for

rseq is a bad fit when you need:

multi-step updates across multiple CPUs,
general-purpose shared-memory atomicity,
long critical sections,
code that might block or call into the kernel,
“all-or-nothing” semantics across a big object graph,
easy portable implementation in high-level code without arch-specific help.

If your algorithm fundamentally needs:

compare-and-swap on global shared state,
a mutex,
RCU,
sequence counters,
lock-free cross-CPU publication,

then rseq is usually not a replacement.

It removes overhead for a very specific kind of CPU-local fast path. It does not repeal the rest of concurrency theory.

5) ABI facts that matter in production

Linux kernel support

The rseq() system call arrived in Linux 4.18.

Only one rseq area per thread

This is the biggest integration gotcha.

The kernel supports only one registered rseq ABI area per thread. That means libraries, runtimes, and applications cannot all freeload independently by each registering their own private area.

This is why glibc coordination matters.

glibc ownership changed the story

The man page notes that glibc handles rseq allocation/registration since 2.35. Red Hat’s explanation is the clearest practical summary:

once glibc uses rseq internally,
an application should not assume it can just register its own separate rseq area,
instead it should cooperate with the libc-managed ABI.

glibc exposes ABI symbols for this coordination:

__rseq_size
__rseq_flags
__rseq_offset

If you are building a library that wants rseq, this matters a lot more than the raw syscall details.

No glibc wrapper for `rseq()`

The rseq() syscall itself has no glibc wrapper; raw use goes through syscall(2). In practice, that is another reason to prefer glibc-coordinated usage or librseq over open-coding everything yourself.

Structure size and alignment are versioned

struct rseq is extensible. The man page documents that size/alignment can depend on auxiliary vector values such as:

AT_RSEQ_ALIGN
AT_RSEQ_FEATURE_SIZE

So if you hand-roll raw rseq support, do not hard-code a simplistic “32 bytes forever” assumption unless you are deliberately targeting the initial ABI subset.

Newer fields exist beyond CPU ID

Modern rseq ABI documentation includes fields such as:

node_id
mm_cid

That is a hint that rseq has become more than just “current CPU number plus abortable critical section.”

In particular, mm_cid is interesting: it is a concurrency ID unique within an mm and tends to stay close to zero when concurrency is limited, which can be more memory-efficient than provisioning full nr_possible_cpus-sized sharding for some designs.

Deprecated restart flags

The old no-restart flags (NO_RESTART_ON_PREEMPT, ...SIGNAL, ...MIGRATE) are documented as deprecated since Linux 6.1. Treat them as legacy/debugging details, not as a fresh design surface.

6) The critical-section shape that usually works

A good rseq critical section has this personality:

tiny,
CPU-local,
no syscalls,
minimal branches,
no externally visible partial side effects,
cheap abort path,
cheap retry path.

Good shape

read CPU identity once,
derive pointer to per-CPU slot,
validate assumptions,
do preparatory arithmetic,
finish with one compact store / update.

Bad shape

call helper code that might hide a syscall,
touch lots of unrelated memory,
allocate memory,
invoke complex runtime machinery,
perform multiple externally meaningful writes that are hard to roll back,
assume retries are rare enough to ignore.

The kernel docs and man page both reinforce this: the critical section is an assembly instruction sequence, not a place to get ambitious.

7) The rule people forget: rseq is only CPU-local atomicity

rseq protects you against interruption relative to the current CPU-local sequence. It does not magically solve inter-CPU publication or memory ordering for the rest of your program.

So if your CPU-local fast path later interacts with shared global state, you still need the usual discipline:

atomics where required,
memory barriers where required,
proper aggregation/publish semantics,
clear ownership boundaries.

A good mental rule is:

rseq is for updating the local shard cheaply; something else still has to define how the rest of the system sees that shard.

That “something else” might be:

periodic aggregation,
RCU publication,
seqcount-style snapshots,
per-CPU readers with later merge,
or just a slower locked path.

8) Safety rules that keep rseq from becoming a footgun

A) No syscalls inside the critical section

This one is explicit in the docs.

If a restartable sequence performs a syscall, the process may be terminated with SIGSEGV.

So do not let “tiny helper” abstractions sneak kernel entries into the section.

B) Keep retry logic intentional

An abort is not a freak cosmic event.

It happens on:

preemption,
migration,
signal delivery.

So the fallback path has to be a real part of the design:

retry the sequence,
or use a slower safe path,
or give up gracefully.

“Just loop forever” is not automatically correct.

C) Read CPU identity carefully

The classic guidance is to read CPU identity in a way the compiler cannot silently duplicate or reorder beyond your design intent. The EfficiOS example uses an ACCESS_ONCE-style approach for exactly this reason.

D) Lifetime matters

If memory containing the rseq_cs descriptor is going away, user space must ensure rseq_cs is cleared before reclaiming that memory. Otherwise the kernel may still treat the descriptor as active.

E) Register/unregister per thread

Each thread owns its own registration lifetime. Do not assume “the process enabled rseq” is enough. Thread creation and teardown are part of the feature’s correctness story.

F) Plan for unsupported kernels

The man page is explicit: user space must have a fallback story when rseq() is unavailable.

That usually means:

use rseq when supported,
fall back to atomics / locks / slower sched_getcpu() path when not.

9) Prefer librseq or libc-coordinated use over heroic custom assembly

You can hand-write rseq critical sections. You probably should not unless you truly need to.

Reasons to prefer librseq or a mature libc-coordinated implementation:

only one rseq ABI exists per thread,
the ABI evolves,
the details are architecture-specific,
compilers, assemblers, and debuggers all add friction,
correctness bugs are subtle and expensive.

The raw syscall and raw assembly path is best treated like:

kernel-adjacent systems work,
allocator/runtime implementation work,
tracing-library internals,
or very specialized latency engineering.

For ordinary application code, a library wrapper is the sane default.

10) Where rseq shines operationally

A) High core counts with hot shared counters

If a workload is wasting cycles on cache-line ping-pong from atomics, rseq-style per-CPU sharding can help a lot.

B) Fast-path memory allocation

Allocator hot paths are allergic to locks and shared-line bouncing. This is the kind of environment where rseq can earn its complexity budget.

C) Tracing/telemetry paths that must stay cheap

Per-CPU trace buffers and append-only local telemetry structures are rseq territory.

D) Architectures where old “fast getcpu” tricks are awkward

Red Hat’s AArch64 explanation is a good reminder that sometimes rseq is not just faster; it is the cleanest portable way to provide a cheap current-CPU lookup across architectures.

11) Where rollout pain usually comes from

Tooling friction

Checkpoint/restore, some instrumentation, and other runtime tooling may need explicit rseq awareness. Red Hat called this out in the context of CRIU compatibility.

Runtime/library ownership confusion

If one component assumes it can directly register rseq while glibc already owns the per-thread area, you get ABI collisions instead of performance wins.

Overengineering the critical section

The temptation is always to “just move a little more work into the rseq path.” That is how a clean fast path becomes a fragile one.

Measuring only microbenchmarks

rseq can look beautiful in an isolated benchmark and underdeliver in a real service if:

abort frequency is high,
thread migration is noisy,
the surrounding system still serializes elsewhere,
aggregation costs dominate,
or the hot path was not actually atomic-bound in the first place.

12) A practical adoption checklist

Phase 0 — confirm it is the right problem

Use rseq only if the hot path is actually suffering from one of these:

atomic contention,
current-CPU lookup overhead,
cross-core cache-line bouncing,
shared fast-path lock pressure.

If the bottleneck is somewhere else, rseq is just complexity cosplay.

Phase 1 — pick the right abstraction level

Choose in this order:

existing runtime/library support,
librseq,
custom raw rseq only if absolutely necessary.

Phase 2 — design the slow fallback first

Define what happens when:

kernel lacks rseq,
thread aborts frequently,
tool compatibility forces rseq off,
or the feature flag is disabled.

A fallback path is not optional.

Phase 3 — benchmark the real win

Measure:

throughput,
p95/p99 latency,
CPU utilization,
context switches,
abort / retry rate,
migration rate,
and total end-to-end effect.

Do not stop at “counter increment is 7x faster.” Ask whether the whole service got meaningfully better.

Phase 4 — validate thread lifecycle correctness

Test:

process start,
main thread init,
thread create/destroy churn,
signal-heavy paths,
CPU affinity changes,
cgroup/cpuset changes,
rolling deploy / restart behavior.

Phase 5 — keep an escape hatch

Have a runtime switch or build-time option to disable rseq-backed fast paths if they misbehave in production or collide with tooling.

13) Advanced note: `mm_cid` is quietly interesting

Most rseq discussions stop at “per-CPU counters.” The newer ABI field mm_cid deserves more attention.

Why?

Because raw per-CPU sharding scales memory to possible CPUs system-wide, which can be wasteful on big boxes or containers with limited effective concurrency.

mm_cid offers a memory-map-local concurrency ID that stays within CPU range but can remain compact when actual concurrent execution is limited by:

cpuset constraints,
affinity,
fewer active threads than system CPUs.

That suggests a useful design split:

use cpu_id when you truly need CPU-local identity,
consider mm_cid when what you really need is a compact per-concurrency-lane shard inside one mm.

It is not a universal replacement, but it is a strong hint that “rseq == only CPU number” is already an outdated mental model.

14) Short version

Use rseq when you need very cheap CPU-local fast paths and are willing to pay some implementation complexity for it.

The five most important truths are:

rseq is about CPU-local optimistic updates, not general transactions.
Only one rseq ABI area exists per thread, so libc/runtime coordination matters.
Critical sections must be tiny, syscall-free, and abort-friendly.
rseq removes some atomic/lookup overhead, but not the need for normal memory-ordering discipline elsewhere.
Prefer mature wrappers (librseq, glibc-coordinated use) unless you are building low-level infrastructure.

When the fit is right, rseq is one of those deeply Linux-flavored primitives that feels almost unfair:

less lock traffic,
less atomic overhead,
cheaper CPU-local indexing,
and faster hot paths without kernel round trips.

When the fit is wrong, it is just fancy assembly wrapped around the wrong bottleneck.

References

Linux kernel documentation — Restartable Sequences (docs.kernel.org/userspace-api/rseq.html)
rseq(2) manual page (manpages.opensuse.org/.../librseq-devel/rseq.2.en.html)
Red Hat Developer — Why we added restartable sequences support to glibc in RHEL 9
EfficiOS — The 5-year journey to bring restartable sequences to Linux
GNU C Library manual — Restartable Sequences ABI symbols and coordination notes

Linux Restartable Sequences (rseq): Per-CPU Fast Path Playbook

Linux Restartable Sequences (rseq): Per-CPU Fast Path Playbook

1) Why rseq exists

2) The mental model: optimistic CPU-local work, abort on interruption

3) What it is good for

A) Fast sched_getcpu() / CPU-local indexing

B) Per-CPU counters

C) Allocator fast paths

D) Per-CPU ring buffers / tracing buffers

E) CPU-local data structures in general

4) What it is not good for

5) ABI facts that matter in production

Linux kernel support

Only one rseq area per thread

glibc ownership changed the story

No glibc wrapper for rseq()

Structure size and alignment are versioned

Newer fields exist beyond CPU ID

Deprecated restart flags

6) The critical-section shape that usually works

Good shape

Bad shape

7) The rule people forget: rseq is only CPU-local atomicity

8) Safety rules that keep rseq from becoming a footgun

A) No syscalls inside the critical section

B) Keep retry logic intentional

C) Read CPU identity carefully

D) Lifetime matters

E) Register/unregister per thread

F) Plan for unsupported kernels

9) Prefer librseq or libc-coordinated use over heroic custom assembly

10) Where rseq shines operationally

A) High core counts with hot shared counters

B) Fast-path memory allocation

C) Tracing/telemetry paths that must stay cheap

D) Architectures where old “fast getcpu” tricks are awkward

11) Where rollout pain usually comes from

Tooling friction

Runtime/library ownership confusion

Overengineering the critical section

Measuring only microbenchmarks

12) A practical adoption checklist

Phase 0 — confirm it is the right problem

Phase 1 — pick the right abstraction level

Phase 2 — design the slow fallback first

Phase 3 — benchmark the real win

Phase 4 — validate thread lifecycle correctness

Phase 5 — keep an escape hatch

13) Advanced note: mm_cid is quietly interesting

14) Short version

References

A) Fast `sched_getcpu()` / CPU-local indexing

No glibc wrapper for `rseq()`

13) Advanced note: `mm_cid` is quietly interesting