Linux seccomp User Notification Playbook (SECCOMP_RET_USER_NOTIF, brokered syscalls, CONTINUE, ADDFD)
Date: 2026-04-06
Category: knowledge
Why this matters
Classic seccomp is great when the answer is simple:
- allow this syscall,
- deny it with
EPERM/ENOSYS, - or kill / trap / log.
But some production cases are more awkward:
- a container should usually be unable to
mount(2), except for a very narrow host-approved case; - an unprivileged workload should not
mknod(2)freely, but a supervisor may safely materialize a small allowlisted device node on its behalf; - a sandbox wants to broker a restricted
openat(2)flow instead of giving the process broad filesystem reach; - you need more context than classic seccomp can see, especially when pointer arguments matter.
That is where seccomp user notification fits.
The core idea:
Let the seccomp filter pause a selected syscall and hand the decision to a userspace supervisor / broker.
This is powerful, but it is also easy to misuse. The kernel documentation is very explicit: user notification is not itself a security policy engine. It is a syscall interception-and-broker mechanism that must sit inside a larger design:
- seccomp allow/deny policy,
- namespaces / cgroups / capabilities,
- LSMs such as AppArmor / SELinux / Landlock where relevant,
- and a very careful broker.
If you treat it like a magical authorization layer, you will build something clever and unsafe.
1) Quick mental model
Normal seccomp flow:
- task makes syscall,
- BPF filter runs,
- kernel immediately decides allow / errno / kill / trap / trace / log.
User-notify flow:
- task makes syscall,
- BPF filter returns
SECCOMP_RET_USER_NOTIFfor that case, - kernel blocks the target thread in that syscall,
- a listener fd receives a notification,
- supervisor inspects the syscall and context,
- supervisor replies with one of these outcomes:
- spoof success,
- spoof failure (
errno), - tell kernel to continue the original syscall via
SECCOMP_USER_NOTIF_FLAG_CONTINUE, - or inject an fd into the target via
SECCOMP_IOCTL_NOTIF_ADDFDand return that.
The right mental model is:
seccomp user notification is a brokered syscall slow path.
That implies three consequences:
- use it for narrow, high-value exceptions, not for everything;
- design for latency and failure handling, because the target thread is blocked;
- assume race conditions matter, especially when reading target memory or using
CONTINUE.
2) What it is good for — and what it is not
Good fits
- container managers brokering a small set of privileged operations (
mount,mknod, etc.) - sandboxed file-open brokers, especially when later kernels with
ADDFDare available - deep inspection before deciding whether a syscall should proceed
- host-side emulation / delegation for operations the target cannot perform directly
Bad fits
- general syscall authorization for an entire application
- hot-path high-frequency syscalls (
read,write,futex, common networking fast paths) - broad “open anything through the broker” designs unless you are deliberately building a browser-style or container-style sandbox architecture
- security policy by itself without LSMs / namespace / capability boundaries
Kernel docs say this plainly: seccomp filtering reduces kernel attack surface, but it is not a complete sandbox. User notification extends seccomp’s flexibility; it does not change that fact.
3) Feature / kernel-version map to remember
Useful baseline milestones:
- Linux 5.0
SECCOMP_FILTER_FLAG_NEW_LISTENERSECCOMP_GET_NOTIF_SIZESSECCOMP_IOCTL_NOTIF_RECVSECCOMP_IOCTL_NOTIF_SENDSECCOMP_IOCTL_NOTIF_ID_VALID
- Linux 5.5
SECCOMP_USER_NOTIF_FLAG_CONTINUE
- Linux 5.9
SECCOMP_IOCTL_NOTIF_ADDFD
Operational rule:
- check the running kernel, not just headers / build docs;
- for action support, use
SECCOMP_GET_ACTION_AVAILstyle checks where appropriate; - always allocate notification structures using
SECCOMP_GET_NOTIF_SIZES, not compile-time assumptions.
4) Core mechanics
A) Install the filter with a listener
The target uses seccomp filter mode and includes:
SECCOMP_FILTER_FLAG_NEW_LISTENER
This causes a successful install to return a listener file descriptor.
Preconditions still matter:
- the task needs
no_new_privs=1or suitable privilege (CAP_SYS_ADMINin its namespace), - the filter still needs correct architecture checks,
- and there can be at most one listening seccomp filter per thread.
B) The filter chooses which syscalls go to the broker
For targeted cases, the BPF program returns:
SECCOMP_RET_USER_NOTIF
Everything else should usually remain ordinary seccomp policy:
ALLOWfor normal safe syscalls,ERRNO/KILL/TRAPfor clear denials.
Do not turn user notification into the default outcome for large syscall surfaces unless you intentionally want a very expensive broker design.
C) Pass the listener fd to the supervisor
The listener fd is useful to the supervisor, not the target itself. Typical transfer paths:
SCM_RIGHTSover a Unix domain socket,- or
pidfd_getfd()in suitable designs.
Important nuance from kernel docs:
- listener fds correspond to the filter, not a single task;
- if the filtered task later forks, notifications from multiple tasks may arrive on the same listener fd;
- reads and writes on the listener are synchronized, so multiple readers can share it safely.
D) Supervisor waits for notifications
The supervisor should:
- call
SECCOMP_GET_NOTIF_SIZES, - allocate
seccomp_notif/seccomp_notif_respwith kernel-reported sizes, - use
poll/epoll/ blockingSECCOMP_IOCTL_NOTIF_RECVto receive events.
Each notification includes roughly:
- a unique notification ID,
- target TID (
pidfield; this may be0if not visible from the listener’s pid namespace), - syscall number / arch / args in
seccomp_data.
E) Supervisor inspects and decides
The supervisor can inspect:
- syscall number,
- raw register arguments,
- target memory for pointer arguments (with care),
/procstate and other host-visible metadata,- broker-side policy such as path allowlists, namespace mapping, capability context, mount plan, device allowlist, etc.
F) Supervisor replies
Main reply paths:
- spoof success
resp.error = 0,resp.val = <return value>
- spoof failure
resp.error = -EPERMstyle negative errno,resp.val = 0
- continue the original syscall
- set
SECCOMP_USER_NOTIF_FLAG_CONTINUE - only when you intentionally want kernel execution after deeper inspection
- set
- inject an fd
- via
SECCOMP_IOCTL_NOTIF_ADDFD - optionally atomically with
SECCOMP_ADDFD_FLAG_SEND
- via
5) The three most useful design patterns
Pattern 1: Narrow privileged emulation
Use when:
- the target lacks host privilege,
- the supervisor can safely perform the action,
- and the action volume is low.
Examples:
- selected
mount(2)cases in container management, - selected
mknod(2)cases for a small approved device list.
Good properties:
- simple policy boundary,
- explicit allowlist,
- easy to measure / audit.
Main risks:
- doing too much on behalf of the target,
- forgetting that host context and target context are not identical.
Pattern 2: Brokered open via ADDFD
Use when:
- the target should not have direct ambient filesystem access,
- but a broker can open a narrow set of files and hand back an fd.
This is where SECCOMP_IOCTL_NOTIF_ADDFD becomes especially valuable.
Why it matters:
- pure “pretend open succeeded” is useless if the target actually needs a live fd,
ADDFDlets the supervisor open the resource and install that fd into the target.
This is usually the cleanest way to implement:
- path-brokered reads,
- narrow runtime asset access,
- selective file handoff from a privileged manager.
Operational advice:
- prefer allowlisted directory roots plus exact operation classes,
- normalize and validate all path policy in the broker,
- add
O_CLOEXECunless you deliberately want inheritance, - keep the surface tiny.
Pattern 3: Inspect, then CONTINUE
Use when:
- the seccomp filter alone lacks enough context,
- but after inspection you want the kernel to perform the real syscall.
This is attractive because it avoids incorrect userspace emulation.
But it is also the most dangerous pattern.
Christian Brauner’s write-up and the man page both highlight the same warning:
SECCOMP_USER_NOTIF_FLAG_CONTINUEmust be used with extreme caution because of TOCTOU risk.
Why:
- you inspect user memory / state,
- then the original syscall later continues,
- and the target may have changed the relevant inputs or surrounding state.
Use CONTINUE only when:
- the inspected inputs are stable enough for your design,
- the remaining kernel semantics are exactly what you want,
- and you understand the race window.
If you can instead deny or emulate through a narrow broker, that is usually cleaner.
6) Decision matrix
A) “I just need to deny dangerous syscalls.”
Use classic seccomp allowlist / deny behavior.
Do not use user notification.
B) “A container manager should selectively perform host work for a container.”
Use user notification + narrow privileged broker.
C) “I need to inspect a pointer argument before deciding.”
Use user notification, but treat pointer reads and ID validation as first-class design constraints.
D) “The target actually needs a valid file descriptor returned.”
Use ADDFD if kernel support is available.
E) “I want a universal userspace authorization hook for syscalls.”
Usually a bad design.
Re-think the sandbox boundary.
F) “This syscall happens constantly on the hot path.”
Do not broker it unless you truly want to pay the latency and complexity bill.
7) The big footguns
7.1 Treating user notification as a complete security policy
This is the conceptual mistake.
Kernel docs explicitly say seccomp filtering is not a full sandbox and that user notification is not intended as a security policy mechanism by itself.
If the broker is the only thing standing between the target and the host, you need stronger outer walls:
- namespaces,
- capabilities,
- LSM policy,
- normal seccomp allow/deny fences,
- filesystem / mount design.
Think of user notification as a controlled exception channel, not the whole prison.
7.2 Forgetting architecture / ABI checks
This is old seccomp advice, but it still matters here.
Always validate the syscall architecture in the filter. On x86, x86-64 vs x32 quirks can otherwise produce bypasses or confused policy.
If the filter arch handling is wrong, everything above it becomes fiction.
7.3 Using compile-time structure sizes
Notification structures may evolve.
Always query sizes with SECCOMP_GET_NOTIF_SIZES.
If you hardcode sizes, you are betting against the kernel ABI’s documented evolution path.
7.4 Ignoring PID / target-liveness races
A notification includes a target TID, but the target can exit or be interrupted.
The kernel provides SECCOMP_IOCTL_NOTIF_ID_VALID for exactly this reason.
Use it whenever your broker needs to:
- open
/proc/<tid>/mem, - inspect
/proc/<tid>/...state, - or do anything where PID reuse or target disappearance would matter.
A practical pattern is:
- receive notification,
- open the proc resource you need,
- validate notification ID,
- read target state,
- make one atomic-ish policy decision from broker-local copied data,
- reply.
And still assume the target may disappear afterward.
7.5 Reading target memory piecemeal while making policy decisions
Pointer arguments are the sharp edge.
Kernel docs warn about TOCTOU here too. If you need pointer-backed data:
- copy what you need into broker memory first,
- validate liveness appropriately,
- make the policy decision from the copied snapshot,
- avoid interleaving “read a little, decide a little, read more” logic.
The more incremental the inspection, the more racey the design.
7.6 Casual use of CONTINUE
CONTINUE is tempting because it lets the kernel perform the real syscall after broker inspection.
But if the target can mutate relevant state between inspection and continuation, your broker approved one thing and the kernel executed another.
That is the textbook TOCTOU failure.
If you reach for CONTINUE, ask:
- what exact object / state did I inspect?
- what can still change before syscall execution completes?
- should I instead deny, emulate, or use an fd broker?
Use CONTINUE sparingly and intentionally.
7.7 Building a high-QPS broker without admitting it is a scheduler
Every notified syscall blocks the target thread. That means your broker is now part of:
- application latency,
- queueing behavior,
- backpressure,
- fault tolerance,
- supervisor restart semantics.
If notification rates are high, you are no longer “just intercepting syscalls”; you are building a scheduler / RPC path on top of the syscall layer.
That can be valid, but then you need:
- concurrency design,
- latency SLOs,
- queue instrumentation,
- overload behavior,
- broker crash recovery.
7.8 Forgetting that the listener fd can fan in from multiple tasks
A listener is attached to a filter, not a single thread forever. Forked descendants can generate notifications on the same fd.
Consequences:
- per-task accounting must be explicit,
- worker pools need consistent matching / response logic,
- observability should include task identity and broker decision outcome.
7.9 Not planning for signals / interrupted notifications
SECCOMP_IOCTL_NOTIF_RECV / response flow can hit cases where the target syscall is interrupted or the target vanishes.
You should expect ENOENT-style invalidation behavior and treat it as a normal race, not an impossible bug.
For long-running broker work, kernel docs also describe SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV, which changes how the target handles non-fatal signals after userspace has received the notification.
This can matter when the supervisor is doing something slower and retryable, such as a mount-related operation.
7.10 Assuming a visible PID always exists
Kernel docs note that the notification pid can be 0 if the target task is in a pid namespace not visible from the listener’s pid namespace.
Do not build mandatory logic that assumes a nonzero PID is always available.
8) Minimal safe rollout patterns
Pattern A: Container manager brokering a tiny syscall set
Start with:
- a very small allowlist of intercepted syscalls,
- a strong default deny path,
- simple explicit policy (e.g. approved mount fs types / targets, approved device node classes),
- detailed logging of broker decisions.
Ship only after you can answer:
- how often does each syscall notify?
- what is p50 / p95 / p99 decision latency?
- how often do notifications expire or get invalidated?
- what happens when the broker restarts?
Pattern B: Brokered file open with ADDFD
Good first production use when you want one strong capability:
- “target may only obtain file descriptors for these resources through the broker.”
Make the policy narrow:
- exact directory roots,
- explicit open flags/modes,
O_CLOEXECby default,- clear read-only vs write-capable separation,
- audited symlink / path traversal handling.
This usually scales better conceptually than trying to broker a wide variety of syscall semantics.
Pattern C: Deep inspect, mostly deny, rarely continue
If you need CONTINUE, make it the exception rather than the default:
- inspect more than plain seccomp can,
- deny the suspicious / unsupported majority,
- continue only narrow cases you have modeled carefully.
That keeps the TOCTOU budget contained.
9) Observability: what to measure
If you operate a seccomp-notify broker, measure at least:
- notification rate by syscall
- decision latency (
p50,p95,p99) - queue depth / outstanding notifications
- decision mix
- spoof-success
- spoof-error
- continue
- addfd
- invalid / expired notification rate (
ID_VALIDfailures, send failures, interrupted targets) - broker restarts / failovers
- target-side syscall failure reasons after brokerage
- per-tenant / per-container notification concentration
You want early visibility into two failure classes:
- policy mistakes — wrong allow / deny decisions;
- control-plane pain — the broker itself becoming a latency tax.
10) Practical checklist before shipping
- Running kernel supports the notification features you plan to use.
- Filter validates syscall architecture correctly.
-
no_new_privs/ capability preconditions are understood. - Notification buffers are allocated using
SECCOMP_GET_NOTIF_SIZES. - Intercepted syscall set is intentionally tiny.
- Broker policy is explicit and allowlist-based.
- Pointer-argument reads are snapshot-based, not incremental guesswork.
-
SECCOMP_IOCTL_NOTIF_ID_VALIDis used around/proc/<tid>/...inspection paths. -
CONTINUEusage is minimal and TOCTOU-reviewed. -
ADDFDpath sets sane fd flags (O_CLOEXECunless intentionally otherwise). - Broker restart / listener-loss behavior is tested.
- Metrics and logs exist for rate, latency, invalidation, and outcome mix.
- Load tests cover notification bursts and multi-task fan-in.
11) Rule-of-thumb guidance
If a plain seccomp allowlist solves the problem, use that.
If you need a broker, use seccomp user notification only for small, high-leverage exceptions such as:
- selective privileged delegation,
- narrow file-descriptor brokering,
- or extra inspection before a rare decision.
The cleanest production posture is usually:
classic seccomp for the broad safety perimeter, user notification for narrow brokered exceptions.
That keeps the fast path in the kernel, the policy easy to reason about, and the userspace broker small enough that you can actually trust it.
References
Linux man-pages —
seccomp_unotify(2)
https://man7.org/linux/man-pages/man2/seccomp_unotify.2.htmlLinux man-pages —
seccomp(2)
https://man7.org/linux/man-pages/man2/seccomp.2.htmlLinux kernel documentation — Seccomp BPF / Userspace Notification
https://www.kernel.org/doc/html/latest/userspace-api/seccomp_filter.htmlChristian Brauner — The Seccomp Notifier - New Frontiers in Unprivileged Container Development
https://brauner.io/2020/07/23/seccomp-notify.html