Linux seccomp User Notification Playbook (`SECCOMP_RET_USER_NOTIF`, brokered syscalls, `CONTINUE`, `ADDFD`)

Date: 2026-04-06
Category: knowledge

Why this matters

Classic seccomp is great when the answer is simple:

allow this syscall,
deny it with EPERM / ENOSYS,
or kill / trap / log.

But some production cases are more awkward:

a container should usually be unable to mount(2), except for a very narrow host-approved case;
an unprivileged workload should not mknod(2) freely, but a supervisor may safely materialize a small allowlisted device node on its behalf;
a sandbox wants to broker a restricted openat(2) flow instead of giving the process broad filesystem reach;
you need more context than classic seccomp can see, especially when pointer arguments matter.

That is where seccomp user notification fits.

The core idea:

Let the seccomp filter pause a selected syscall and hand the decision to a userspace supervisor / broker.

This is powerful, but it is also easy to misuse. The kernel documentation is very explicit: user notification is not itself a security policy engine. It is a syscall interception-and-broker mechanism that must sit inside a larger design:

seccomp allow/deny policy,
namespaces / cgroups / capabilities,
LSMs such as AppArmor / SELinux / Landlock where relevant,
and a very careful broker.

If you treat it like a magical authorization layer, you will build something clever and unsafe.

1) Quick mental model

Normal seccomp flow:

task makes syscall,
BPF filter runs,
kernel immediately decides allow / errno / kill / trap / trace / log.

User-notify flow:

task makes syscall,
BPF filter returns SECCOMP_RET_USER_NOTIF for that case,
kernel blocks the target thread in that syscall,
a listener fd receives a notification,
supervisor inspects the syscall and context,
supervisor replies with one of these outcomes:
- spoof success,
- spoof failure (errno),
- tell kernel to continue the original syscall via SECCOMP_USER_NOTIF_FLAG_CONTINUE,
- or inject an fd into the target via SECCOMP_IOCTL_NOTIF_ADDFD and return that.

The right mental model is:

seccomp user notification is a brokered syscall slow path.

That implies three consequences:

use it for narrow, high-value exceptions, not for everything;
design for latency and failure handling, because the target thread is blocked;
assume race conditions matter, especially when reading target memory or using CONTINUE.

2) What it is good for — and what it is not

Good fits

container managers brokering a small set of privileged operations (mount, mknod, etc.)
sandboxed file-open brokers, especially when later kernels with ADDFD are available
deep inspection before deciding whether a syscall should proceed
host-side emulation / delegation for operations the target cannot perform directly

Bad fits

general syscall authorization for an entire application
hot-path high-frequency syscalls (read, write, futex, common networking fast paths)
broad “open anything through the broker” designs unless you are deliberately building a browser-style or container-style sandbox architecture
security policy by itself without LSMs / namespace / capability boundaries

Kernel docs say this plainly: seccomp filtering reduces kernel attack surface, but it is not a complete sandbox. User notification extends seccomp’s flexibility; it does not change that fact.

3) Feature / kernel-version map to remember

Useful baseline milestones:

Linux 5.0
- SECCOMP_FILTER_FLAG_NEW_LISTENER
- SECCOMP_GET_NOTIF_SIZES
- SECCOMP_IOCTL_NOTIF_RECV
- SECCOMP_IOCTL_NOTIF_SEND
- SECCOMP_IOCTL_NOTIF_ID_VALID
Linux 5.5
- SECCOMP_USER_NOTIF_FLAG_CONTINUE
Linux 5.9
- SECCOMP_IOCTL_NOTIF_ADDFD

Operational rule:

check the running kernel, not just headers / build docs;
for action support, use SECCOMP_GET_ACTION_AVAIL style checks where appropriate;
always allocate notification structures using SECCOMP_GET_NOTIF_SIZES, not compile-time assumptions.

4) Core mechanics

A) Install the filter with a listener

The target uses seccomp filter mode and includes:

SECCOMP_FILTER_FLAG_NEW_LISTENER

This causes a successful install to return a listener file descriptor.

Preconditions still matter:

the task needs no_new_privs=1 or suitable privilege (CAP_SYS_ADMIN in its namespace),
the filter still needs correct architecture checks,
and there can be at most one listening seccomp filter per thread.

B) The filter chooses which syscalls go to the broker

For targeted cases, the BPF program returns:

SECCOMP_RET_USER_NOTIF

Everything else should usually remain ordinary seccomp policy:

ALLOW for normal safe syscalls,
ERRNO / KILL / TRAP for clear denials.

Do not turn user notification into the default outcome for large syscall surfaces unless you intentionally want a very expensive broker design.

C) Pass the listener fd to the supervisor

The listener fd is useful to the supervisor, not the target itself. Typical transfer paths:

SCM_RIGHTS over a Unix domain socket,
or pidfd_getfd() in suitable designs.

Important nuance from kernel docs:

listener fds correspond to the filter, not a single task;
if the filtered task later forks, notifications from multiple tasks may arrive on the same listener fd;
reads and writes on the listener are synchronized, so multiple readers can share it safely.

D) Supervisor waits for notifications

The supervisor should:

call SECCOMP_GET_NOTIF_SIZES,
allocate seccomp_notif / seccomp_notif_resp with kernel-reported sizes,
use poll / epoll / blocking SECCOMP_IOCTL_NOTIF_RECV to receive events.

Each notification includes roughly:

a unique notification ID,
target TID (pid field; this may be 0 if not visible from the listener’s pid namespace),
syscall number / arch / args in seccomp_data.

E) Supervisor inspects and decides

The supervisor can inspect:

syscall number,
raw register arguments,
target memory for pointer arguments (with care),
/proc state and other host-visible metadata,
broker-side policy such as path allowlists, namespace mapping, capability context, mount plan, device allowlist, etc.

F) Supervisor replies

Main reply paths:

spoof success
- resp.error = 0, resp.val = <return value>
spoof failure
- resp.error = -EPERM style negative errno, resp.val = 0
continue the original syscall
- set SECCOMP_USER_NOTIF_FLAG_CONTINUE
- only when you intentionally want kernel execution after deeper inspection
inject an fd
- via SECCOMP_IOCTL_NOTIF_ADDFD
- optionally atomically with SECCOMP_ADDFD_FLAG_SEND

5) The three most useful design patterns

Pattern 1: Narrow privileged emulation

Use when:

the target lacks host privilege,
the supervisor can safely perform the action,
and the action volume is low.

Examples:

selected mount(2) cases in container management,
selected mknod(2) cases for a small approved device list.

Good properties:

simple policy boundary,
explicit allowlist,
easy to measure / audit.

Main risks:

doing too much on behalf of the target,
forgetting that host context and target context are not identical.

Pattern 2: Brokered open via `ADDFD`

Use when:

the target should not have direct ambient filesystem access,
but a broker can open a narrow set of files and hand back an fd.

This is where SECCOMP_IOCTL_NOTIF_ADDFD becomes especially valuable.

Why it matters:

pure “pretend open succeeded” is useless if the target actually needs a live fd,
ADDFD lets the supervisor open the resource and install that fd into the target.

This is usually the cleanest way to implement:

path-brokered reads,
narrow runtime asset access,
selective file handoff from a privileged manager.

Operational advice:

prefer allowlisted directory roots plus exact operation classes,
normalize and validate all path policy in the broker,
add O_CLOEXEC unless you deliberately want inheritance,
keep the surface tiny.

Pattern 3: Inspect, then `CONTINUE`

Use when:

the seccomp filter alone lacks enough context,
but after inspection you want the kernel to perform the real syscall.

This is attractive because it avoids incorrect userspace emulation.

But it is also the most dangerous pattern.

Christian Brauner’s write-up and the man page both highlight the same warning:

SECCOMP_USER_NOTIF_FLAG_CONTINUE must be used with extreme caution because of TOCTOU risk.

Why:

you inspect user memory / state,
then the original syscall later continues,
and the target may have changed the relevant inputs or surrounding state.

Use CONTINUE only when:

the inspected inputs are stable enough for your design,
the remaining kernel semantics are exactly what you want,
and you understand the race window.

If you can instead deny or emulate through a narrow broker, that is usually cleaner.

6) Decision matrix

A) “I just need to deny dangerous syscalls.”

Use classic seccomp allowlist / deny behavior.
Do not use user notification.

B) “A container manager should selectively perform host work for a container.”

Use user notification + narrow privileged broker.

C) “I need to inspect a pointer argument before deciding.”

Use user notification, but treat pointer reads and ID validation as first-class design constraints.

D) “The target actually needs a valid file descriptor returned.”

Use ADDFD if kernel support is available.

E) “I want a universal userspace authorization hook for syscalls.”

Usually a bad design.
Re-think the sandbox boundary.

F) “This syscall happens constantly on the hot path.”

Do not broker it unless you truly want to pay the latency and complexity bill.

7) The big footguns

7.1 Treating user notification as a complete security policy

This is the conceptual mistake.

Kernel docs explicitly say seccomp filtering is not a full sandbox and that user notification is not intended as a security policy mechanism by itself.

If the broker is the only thing standing between the target and the host, you need stronger outer walls:

namespaces,
capabilities,
LSM policy,
normal seccomp allow/deny fences,
filesystem / mount design.

Think of user notification as a controlled exception channel, not the whole prison.

7.2 Forgetting architecture / ABI checks

This is old seccomp advice, but it still matters here.

Always validate the syscall architecture in the filter. On x86, x86-64 vs x32 quirks can otherwise produce bypasses or confused policy.

If the filter arch handling is wrong, everything above it becomes fiction.

7.3 Using compile-time structure sizes

Notification structures may evolve. Always query sizes with SECCOMP_GET_NOTIF_SIZES.

If you hardcode sizes, you are betting against the kernel ABI’s documented evolution path.

7.4 Ignoring PID / target-liveness races

A notification includes a target TID, but the target can exit or be interrupted. The kernel provides SECCOMP_IOCTL_NOTIF_ID_VALID for exactly this reason.

Use it whenever your broker needs to:

open /proc/<tid>/mem,
inspect /proc/<tid>/... state,
or do anything where PID reuse or target disappearance would matter.

A practical pattern is:

receive notification,
open the proc resource you need,
validate notification ID,
read target state,
make one atomic-ish policy decision from broker-local copied data,
reply.

And still assume the target may disappear afterward.

7.5 Reading target memory piecemeal while making policy decisions

Pointer arguments are the sharp edge.

Kernel docs warn about TOCTOU here too. If you need pointer-backed data:

copy what you need into broker memory first,
validate liveness appropriately,
make the policy decision from the copied snapshot,
avoid interleaving “read a little, decide a little, read more” logic.

The more incremental the inspection, the more racey the design.

7.6 Casual use of `CONTINUE`

CONTINUE is tempting because it lets the kernel perform the real syscall after broker inspection.

But if the target can mutate relevant state between inspection and continuation, your broker approved one thing and the kernel executed another.

That is the textbook TOCTOU failure.

If you reach for CONTINUE, ask:

what exact object / state did I inspect?
what can still change before syscall execution completes?
should I instead deny, emulate, or use an fd broker?

Use CONTINUE sparingly and intentionally.

7.7 Building a high-QPS broker without admitting it is a scheduler

Every notified syscall blocks the target thread. That means your broker is now part of:

application latency,
queueing behavior,
backpressure,
fault tolerance,
supervisor restart semantics.

If notification rates are high, you are no longer “just intercepting syscalls”; you are building a scheduler / RPC path on top of the syscall layer.

That can be valid, but then you need:

concurrency design,
latency SLOs,
queue instrumentation,
overload behavior,
broker crash recovery.

7.8 Forgetting that the listener fd can fan in from multiple tasks

A listener is attached to a filter, not a single thread forever. Forked descendants can generate notifications on the same fd.

Consequences:

per-task accounting must be explicit,
worker pools need consistent matching / response logic,
observability should include task identity and broker decision outcome.

7.9 Not planning for signals / interrupted notifications

SECCOMP_IOCTL_NOTIF_RECV / response flow can hit cases where the target syscall is interrupted or the target vanishes. You should expect ENOENT-style invalidation behavior and treat it as a normal race, not an impossible bug.

For long-running broker work, kernel docs also describe SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV, which changes how the target handles non-fatal signals after userspace has received the notification. This can matter when the supervisor is doing something slower and retryable, such as a mount-related operation.

7.10 Assuming a visible PID always exists

Kernel docs note that the notification pid can be 0 if the target task is in a pid namespace not visible from the listener’s pid namespace.

Do not build mandatory logic that assumes a nonzero PID is always available.

8) Minimal safe rollout patterns

Pattern A: Container manager brokering a tiny syscall set

Start with:

a very small allowlist of intercepted syscalls,
a strong default deny path,
simple explicit policy (e.g. approved mount fs types / targets, approved device node classes),
detailed logging of broker decisions.

Ship only after you can answer:

how often does each syscall notify?
what is p50 / p95 / p99 decision latency?
how often do notifications expire or get invalidated?
what happens when the broker restarts?

Pattern B: Brokered file open with `ADDFD`

Good first production use when you want one strong capability:

“target may only obtain file descriptors for these resources through the broker.”

Make the policy narrow:

exact directory roots,
explicit open flags/modes,
O_CLOEXEC by default,
clear read-only vs write-capable separation,
audited symlink / path traversal handling.

This usually scales better conceptually than trying to broker a wide variety of syscall semantics.

Pattern C: Deep inspect, mostly deny, rarely continue

If you need CONTINUE, make it the exception rather than the default:

inspect more than plain seccomp can,
deny the suspicious / unsupported majority,
continue only narrow cases you have modeled carefully.

That keeps the TOCTOU budget contained.

9) Observability: what to measure

If you operate a seccomp-notify broker, measure at least:

notification rate by syscall
decision latency (p50, p95, p99)
queue depth / outstanding notifications
decision mix
- spoof-success
- spoof-error
- continue
- addfd
invalid / expired notification rate (ID_VALID failures, send failures, interrupted targets)
broker restarts / failovers
target-side syscall failure reasons after brokerage
per-tenant / per-container notification concentration

You want early visibility into two failure classes:

policy mistakes — wrong allow / deny decisions;
control-plane pain — the broker itself becoming a latency tax.

10) Practical checklist before shipping

Running kernel supports the notification features you plan to use.
Filter validates syscall architecture correctly.
no_new_privs / capability preconditions are understood.
Notification buffers are allocated using SECCOMP_GET_NOTIF_SIZES.
Intercepted syscall set is intentionally tiny.
Broker policy is explicit and allowlist-based.
Pointer-argument reads are snapshot-based, not incremental guesswork.
SECCOMP_IOCTL_NOTIF_ID_VALID is used around /proc/<tid>/... inspection paths.
CONTINUE usage is minimal and TOCTOU-reviewed.
ADDFD path sets sane fd flags (O_CLOEXEC unless intentionally otherwise).
Broker restart / listener-loss behavior is tested.
Metrics and logs exist for rate, latency, invalidation, and outcome mix.
Load tests cover notification bursts and multi-task fan-in.

11) Rule-of-thumb guidance

If a plain seccomp allowlist solves the problem, use that.

If you need a broker, use seccomp user notification only for small, high-leverage exceptions such as:

selective privileged delegation,
narrow file-descriptor brokering,
or extra inspection before a rare decision.

The cleanest production posture is usually:

classic seccomp for the broad safety perimeter, user notification for narrow brokered exceptions.

That keeps the fast path in the kernel, the policy easy to reason about, and the userspace broker small enough that you can actually trust it.

References

Linux man-pages — seccomp_unotify(2)
https://man7.org/linux/man-pages/man2/seccomp_unotify.2.html
Linux man-pages — seccomp(2)
https://man7.org/linux/man-pages/man2/seccomp.2.html
Linux kernel documentation — Seccomp BPF / Userspace Notification
https://www.kernel.org/doc/html/latest/userspace-api/seccomp_filter.html
Christian Brauner — The Seccomp Notifier - New Frontiers in Unprivileged Container Development
https://brauner.io/2020/07/23/seccomp-notify.html

Linux seccomp User Notification Playbook (`SECCOMP_RET_USER_NOTIF`, brokered syscalls, `CONTINUE`, `ADDFD`)

Linux seccomp User Notification Playbook (SECCOMP_RET_USER_NOTIF, brokered syscalls, CONTINUE, ADDFD)

Why this matters

1) Quick mental model

2) What it is good for — and what it is not

Good fits

Bad fits

3) Feature / kernel-version map to remember

4) Core mechanics

A) Install the filter with a listener

B) The filter chooses which syscalls go to the broker

C) Pass the listener fd to the supervisor

D) Supervisor waits for notifications

E) Supervisor inspects and decides

F) Supervisor replies

5) The three most useful design patterns

Pattern 1: Narrow privileged emulation

Pattern 2: Brokered open via ADDFD

Pattern 3: Inspect, then CONTINUE

6) Decision matrix

A) “I just need to deny dangerous syscalls.”

B) “A container manager should selectively perform host work for a container.”

C) “I need to inspect a pointer argument before deciding.”

D) “The target actually needs a valid file descriptor returned.”

E) “I want a universal userspace authorization hook for syscalls.”

F) “This syscall happens constantly on the hot path.”

7) The big footguns

7.1 Treating user notification as a complete security policy

7.2 Forgetting architecture / ABI checks

7.3 Using compile-time structure sizes

7.4 Ignoring PID / target-liveness races

7.5 Reading target memory piecemeal while making policy decisions

7.6 Casual use of CONTINUE

7.7 Building a high-QPS broker without admitting it is a scheduler

7.8 Forgetting that the listener fd can fan in from multiple tasks

7.9 Not planning for signals / interrupted notifications

7.10 Assuming a visible PID always exists

8) Minimal safe rollout patterns

Pattern A: Container manager brokering a tiny syscall set

Pattern B: Brokered file open with ADDFD

Pattern C: Deep inspect, mostly deny, rarely continue

9) Observability: what to measure

10) Practical checklist before shipping

11) Rule-of-thumb guidance

References

Linux seccomp User Notification Playbook (`SECCOMP_RET_USER_NOTIF`, brokered syscalls, `CONTINUE`, `ADDFD`)

Pattern 2: Brokered open via `ADDFD`

Pattern 3: Inspect, then `CONTINUE`

7.6 Casual use of `CONTINUE`

Pattern B: Brokered file open with `ADDFD`