Linux seccomp User Notification Playbook (`SECCOMP_RET_USER_NOTIF`, brokered syscalls, `CONTINUE`, `ADDFD`)

2026-04-06 · software

Linux seccomp User Notification Playbook (SECCOMP_RET_USER_NOTIF, brokered syscalls, CONTINUE, ADDFD)

Date: 2026-04-06
Category: knowledge

Why this matters

Classic seccomp is great when the answer is simple:

But some production cases are more awkward:

That is where seccomp user notification fits.

The core idea:

Let the seccomp filter pause a selected syscall and hand the decision to a userspace supervisor / broker.

This is powerful, but it is also easy to misuse. The kernel documentation is very explicit: user notification is not itself a security policy engine. It is a syscall interception-and-broker mechanism that must sit inside a larger design:

If you treat it like a magical authorization layer, you will build something clever and unsafe.


1) Quick mental model

Normal seccomp flow:

  1. task makes syscall,
  2. BPF filter runs,
  3. kernel immediately decides allow / errno / kill / trap / trace / log.

User-notify flow:

  1. task makes syscall,
  2. BPF filter returns SECCOMP_RET_USER_NOTIF for that case,
  3. kernel blocks the target thread in that syscall,
  4. a listener fd receives a notification,
  5. supervisor inspects the syscall and context,
  6. supervisor replies with one of these outcomes:
    • spoof success,
    • spoof failure (errno),
    • tell kernel to continue the original syscall via SECCOMP_USER_NOTIF_FLAG_CONTINUE,
    • or inject an fd into the target via SECCOMP_IOCTL_NOTIF_ADDFD and return that.

The right mental model is:

seccomp user notification is a brokered syscall slow path.

That implies three consequences:


2) What it is good for — and what it is not

Good fits

Bad fits

Kernel docs say this plainly: seccomp filtering reduces kernel attack surface, but it is not a complete sandbox. User notification extends seccomp’s flexibility; it does not change that fact.


3) Feature / kernel-version map to remember

Useful baseline milestones:

Operational rule:


4) Core mechanics

A) Install the filter with a listener

The target uses seccomp filter mode and includes:

This causes a successful install to return a listener file descriptor.

Preconditions still matter:

B) The filter chooses which syscalls go to the broker

For targeted cases, the BPF program returns:

Everything else should usually remain ordinary seccomp policy:

Do not turn user notification into the default outcome for large syscall surfaces unless you intentionally want a very expensive broker design.

C) Pass the listener fd to the supervisor

The listener fd is useful to the supervisor, not the target itself. Typical transfer paths:

Important nuance from kernel docs:

D) Supervisor waits for notifications

The supervisor should:

  1. call SECCOMP_GET_NOTIF_SIZES,
  2. allocate seccomp_notif / seccomp_notif_resp with kernel-reported sizes,
  3. use poll / epoll / blocking SECCOMP_IOCTL_NOTIF_RECV to receive events.

Each notification includes roughly:

E) Supervisor inspects and decides

The supervisor can inspect:

F) Supervisor replies

Main reply paths:

  1. spoof success
    • resp.error = 0, resp.val = <return value>
  2. spoof failure
    • resp.error = -EPERM style negative errno, resp.val = 0
  3. continue the original syscall
    • set SECCOMP_USER_NOTIF_FLAG_CONTINUE
    • only when you intentionally want kernel execution after deeper inspection
  4. inject an fd
    • via SECCOMP_IOCTL_NOTIF_ADDFD
    • optionally atomically with SECCOMP_ADDFD_FLAG_SEND

5) The three most useful design patterns

Pattern 1: Narrow privileged emulation

Use when:

Examples:

Good properties:

Main risks:


Pattern 2: Brokered open via ADDFD

Use when:

This is where SECCOMP_IOCTL_NOTIF_ADDFD becomes especially valuable.

Why it matters:

This is usually the cleanest way to implement:

Operational advice:


Pattern 3: Inspect, then CONTINUE

Use when:

This is attractive because it avoids incorrect userspace emulation.

But it is also the most dangerous pattern.

Christian Brauner’s write-up and the man page both highlight the same warning:

SECCOMP_USER_NOTIF_FLAG_CONTINUE must be used with extreme caution because of TOCTOU risk.

Why:

Use CONTINUE only when:

If you can instead deny or emulate through a narrow broker, that is usually cleaner.


6) Decision matrix

A) “I just need to deny dangerous syscalls.”

Use classic seccomp allowlist / deny behavior.
Do not use user notification.

B) “A container manager should selectively perform host work for a container.”

Use user notification + narrow privileged broker.

C) “I need to inspect a pointer argument before deciding.”

Use user notification, but treat pointer reads and ID validation as first-class design constraints.

D) “The target actually needs a valid file descriptor returned.”

Use ADDFD if kernel support is available.

E) “I want a universal userspace authorization hook for syscalls.”

Usually a bad design.
Re-think the sandbox boundary.

F) “This syscall happens constantly on the hot path.”

Do not broker it unless you truly want to pay the latency and complexity bill.


7) The big footguns

7.1 Treating user notification as a complete security policy

This is the conceptual mistake.

Kernel docs explicitly say seccomp filtering is not a full sandbox and that user notification is not intended as a security policy mechanism by itself.

If the broker is the only thing standing between the target and the host, you need stronger outer walls:

Think of user notification as a controlled exception channel, not the whole prison.


7.2 Forgetting architecture / ABI checks

This is old seccomp advice, but it still matters here.

Always validate the syscall architecture in the filter. On x86, x86-64 vs x32 quirks can otherwise produce bypasses or confused policy.

If the filter arch handling is wrong, everything above it becomes fiction.


7.3 Using compile-time structure sizes

Notification structures may evolve. Always query sizes with SECCOMP_GET_NOTIF_SIZES.

If you hardcode sizes, you are betting against the kernel ABI’s documented evolution path.


7.4 Ignoring PID / target-liveness races

A notification includes a target TID, but the target can exit or be interrupted. The kernel provides SECCOMP_IOCTL_NOTIF_ID_VALID for exactly this reason.

Use it whenever your broker needs to:

A practical pattern is:

  1. receive notification,
  2. open the proc resource you need,
  3. validate notification ID,
  4. read target state,
  5. make one atomic-ish policy decision from broker-local copied data,
  6. reply.

And still assume the target may disappear afterward.


7.5 Reading target memory piecemeal while making policy decisions

Pointer arguments are the sharp edge.

Kernel docs warn about TOCTOU here too. If you need pointer-backed data:

The more incremental the inspection, the more racey the design.


7.6 Casual use of CONTINUE

CONTINUE is tempting because it lets the kernel perform the real syscall after broker inspection.

But if the target can mutate relevant state between inspection and continuation, your broker approved one thing and the kernel executed another.

That is the textbook TOCTOU failure.

If you reach for CONTINUE, ask:

Use CONTINUE sparingly and intentionally.


7.7 Building a high-QPS broker without admitting it is a scheduler

Every notified syscall blocks the target thread. That means your broker is now part of:

If notification rates are high, you are no longer “just intercepting syscalls”; you are building a scheduler / RPC path on top of the syscall layer.

That can be valid, but then you need:


7.8 Forgetting that the listener fd can fan in from multiple tasks

A listener is attached to a filter, not a single thread forever. Forked descendants can generate notifications on the same fd.

Consequences:


7.9 Not planning for signals / interrupted notifications

SECCOMP_IOCTL_NOTIF_RECV / response flow can hit cases where the target syscall is interrupted or the target vanishes. You should expect ENOENT-style invalidation behavior and treat it as a normal race, not an impossible bug.

For long-running broker work, kernel docs also describe SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV, which changes how the target handles non-fatal signals after userspace has received the notification. This can matter when the supervisor is doing something slower and retryable, such as a mount-related operation.


7.10 Assuming a visible PID always exists

Kernel docs note that the notification pid can be 0 if the target task is in a pid namespace not visible from the listener’s pid namespace.

Do not build mandatory logic that assumes a nonzero PID is always available.


8) Minimal safe rollout patterns

Pattern A: Container manager brokering a tiny syscall set

Start with:

Ship only after you can answer:


Pattern B: Brokered file open with ADDFD

Good first production use when you want one strong capability:

Make the policy narrow:

This usually scales better conceptually than trying to broker a wide variety of syscall semantics.


Pattern C: Deep inspect, mostly deny, rarely continue

If you need CONTINUE, make it the exception rather than the default:

That keeps the TOCTOU budget contained.


9) Observability: what to measure

If you operate a seccomp-notify broker, measure at least:

You want early visibility into two failure classes:

  1. policy mistakes — wrong allow / deny decisions;
  2. control-plane pain — the broker itself becoming a latency tax.

10) Practical checklist before shipping


11) Rule-of-thumb guidance

If a plain seccomp allowlist solves the problem, use that.

If you need a broker, use seccomp user notification only for small, high-leverage exceptions such as:

The cleanest production posture is usually:

classic seccomp for the broad safety perimeter, user notification for narrow brokered exceptions.

That keeps the fast path in the kernel, the policy easy to reason about, and the userspace broker small enough that you can actually trust it.


References

  1. Linux man-pages — seccomp_unotify(2)
    https://man7.org/linux/man-pages/man2/seccomp_unotify.2.html

  2. Linux man-pages — seccomp(2)
    https://man7.org/linux/man-pages/man2/seccomp.2.html

  3. Linux kernel documentation — Seccomp BPF / Userspace Notification
    https://www.kernel.org/doc/html/latest/userspace-api/seccomp_filter.html

  4. Christian Brauner — The Seccomp Notifier - New Frontiers in Unprivileged Container Development
    https://brauner.io/2020/07/23/seccomp-notify.html