systemd-oomd Playbook (PSI, cgroup v2, memory protection, and pre-kernel OOM control)

Date: 2026-04-06
Category: knowledge

Why this matters

A lot of Linux memory incidents do not fail as a clean, immediate OOM. They usually fail like this instead:

the box enters reclaim hell,
swap churn explodes,
page cache and anonymous memory fight each other,
user-facing latency turns to sludge,
the kernel OOM killer fires late and somewhat opaquely,
by the time a process dies, the machine has already felt half-dead for too long.

systemd-oomd exists to act earlier than the kernel OOM killer by using:

cgroup v2 hierarchy,
PSI (Pressure Stall Information),
swap pressure,
and systemd unit boundaries.

That combination matters because it lets you kill at the workload / cgroup level instead of hoping a last-second kernel OOM will pick the right victim.

The right mental model is:

systemd-oomd is not a "memory limit" feature. It is a pre-collapse policy engine for choosing which workload should die before the whole machine becomes useless.

1) Quick mental model

There are two major trigger families:

Memory-pressure kill path
Watch a monitored cgroup's memory PSI. If pressure stays above a configured limit for long enough, systemd-oomd kills an eligible descendant cgroup.
Swap-exhaustion kill path
Watch system-wide memory+swap usage. If both are above the configured threshold, systemd-oomd kills eligible descendant cgroups with meaningful swap usage, starting from the biggest swap users.

The key distinction:

memory pressure mode is about stall / reclaim pain,
swap mode is about global survival before total exhaustion.

Think of it like this:

memory.high shapes slowdown,
systemd-oomd chooses when slowdown has become unacceptable,
the kernel OOM killer is the last-resort crash barrier.

2) Requirements you should verify first

systemd-oomd is only useful when the host is set up correctly.

Required / strongly recommended

full unified cgroup v2 hierarchy
PSI support in the kernel (Linux 4.20+)
memory accounting enabled for monitored units
reasonable cgroup boundaries between workloads

The easiest way to avoid a silent misconfiguration is to verify:

stat -fc %T /sys/fs/cgroup
# expect: cgroup2fs

cat /proc/pressure/memory
cat /proc/pressure/io
cat /proc/pressure/cpu

For systemd-managed hosts, make sure memory accounting is on:

# /etc/systemd/system.conf (conceptually)
DefaultMemoryAccounting=yes

Operational truth:

If everything important runs in one giant cgroup, systemd-oomd cannot make good decisions.
If memory accounting is off, monitored units may not behave as expected.
If PSI is missing, the memory-pressure path is dead on arrival.

3) How kill selection actually works

This is the part people often misunderstand.

When a monitored unit crosses a threshold, systemd-oomd does not simply kill that unit itself. It looks for eligible descendant cgroups under the monitored unit.

Important rules from the upstream behavior:

only descendant cgroups are kill candidates,
the monitored unit itself is not the victim unless one of its ancestors monitors and targets below it,
only leaf cgroups and cgroups with memory.oom.group=1 are eligible kill targets.

That means tree design matters.

Bad shape:

system.slice monitored,
one giant service with every worker in the same non-leaf bucket,
no clean descendants,
unclear victim choice.

Better shape:

parent slice or service monitors pressure,
children represent meaningful kill domains,
each kill domain is either a leaf or explicitly grouped with memory.oom.group=1 semantics.

In plain English:

systemd-oomd works best when your cgroup tree already matches your operational blast-radius boundaries.

4) The main knobs

Per-unit knobs

These live on units / slices you want monitored.

ManagedOOMMemoryPressure=
Enable pressure-based action, usually kill.
ManagedOOMMemoryPressureLimit=
Pressure threshold for that unit.
ManagedOOMMemoryPressureDurationSec=
How long the pressure must stay above threshold.
ManagedOOMSwap=
Enable swap-based action, usually kill.
ManagedOOMPreference=
Candidate preference: avoid or omit for important workloads.

Global knobs (`oomd.conf`)

SwapUsedLimit=
Global memory+swap pressure threshold for swap-based actions. Default: 90%.
DefaultMemoryPressureLimit=
Default memory PSI threshold if units do not override it.
Default: 60%.
DefaultMemoryPressureDurationSec=
How long pressure must exceed threshold before action.
Default: 30s.

Operational interpretation:

lower threshold = earlier kills,
longer duration = more tolerance for bursts,
avoid means “de-prioritize this as a victim,”
omit means “do not consider this as a victim at all.”

Use avoid/omit sparingly. If everything is critical, nothing is.

5) Decision matrix

A) Desktop / workstation freezes under browser + IDE + VM pressure

enable swap-based protection at -.slice,
enable memory-pressure protection in user slices,
keep apps separated into meaningful user-service/session scopes.

This is the classic case where systemd-oomd shines.

B) Server host has foreground API + background ETL / compaction / indexing

put the background jobs in separate slices/scopes,
use memory-pressure kill policy on the parent slice,
use cgroup memory controls (memory.high, memory.low) first,
let systemd-oomd be the “we are now beyond graceful reclaim” enforcement layer.

C) Multi-tenant box where one tenant can consume massive swap

set ManagedOOMSwap=kill high in the tree,
keep descendants cleanly separated,
watch swap users and journal events carefully.

D) Latency-sensitive system with no swap

pressure mode can still work,
but reaction time is tighter and failure is steeper,
thresholds often need more conservative tuning,
expect less forgiveness before user-visible collapse.

Upstream guidance is clear here: swap is strongly recommended because it buys time for systemd-oomd to react before total livelock.

6) Minimal safe rollout patterns

Pattern 1: Start with swap-based protection at the root slice

This is the cleanest broad safety net.

# /etc/systemd/system.control/-.slice.d/oomd.conf
[Slice]
ManagedOOMSwap=kill

And in oomd.conf:

[OOM]
SwapUsedLimit=90%

Why this works:

it watches system-wide swap exhaustion,
it gives systemd-oomd room to intervene before hard collapse,
it does not require perfect per-service PSI tuning on day one.

Pattern 2: Add pressure-based monitoring to noisy slices

For example, on a service slice containing mixed workers:

[Slice]
ManagedOOMMemoryPressure=kill
ManagedOOMMemoryPressureLimit=50%
ManagedOOMMemoryPressureDurationSec=20s

This says:

if this slice spends too much time stalled on memory for long enough,
choose an eligible descendant and kill it.

Pattern 3: Protect truly critical units with preference, not immunity everywhere

[Service]
ManagedOOMPreference=avoid

Use omit only for very small, genuinely essential services. If you overuse omit, you force systemd-oomd to kill less appropriate victims later and harder.

7) How to think about PSI in this context

systemd-oomd uses memory PSI, not just raw RSS. That is a huge conceptual upgrade.

A workload can be “only” using moderate memory but still cause:

constant reclaim,
page fault churn,
swap-in latency,
stalled allocators,
system-wide performance collapse.

PSI sees the wasted time. That is why it aligns much better with real user pain than a simple memory-used number.

For systemd-oomd, the relevant framing is:

high memory PSI = the workload tree is spending too much time not making progress,
sustained high PSI = reclaim is no longer a temporary event but a steady-state failure mode.

This also explains why systemd-oomd tends to pair well with cgroup memory protections:

memory.low / memory.min say who deserves protection,
reclaim behavior reflects those protections,
PSI captures the resulting stall cost,
systemd-oomd kills based on the collapse pattern rather than just largest-bytes-wins.

Fedora's rollout notes make this explicit: pressure-based selection better reflects memory protection policy than raw usage does.

8) Practical architecture guidance

Good cgroup shapes

Use separate slices/scopes for:

foreground API vs background batch,
interactive desktop apps vs large build/test jobs,
per-tenant workers,
transient CLI experiments launched via systemd-run --scope.

Bad cgroup shapes

Avoid:

everything under one huge session scope,
one service owning unrelated children with no meaningful kill domains,
“critical” labels on half the host,
unbounded helper processes escaping service accounting.

If you want good systemd-oomd behavior, the cgroup tree itself must already express:

who competes together,
who can die together,
who should be spared if possible.

9) A sensible tuning order

Do not start by making systemd-oomd aggressive. Use this order instead:

Fix the tree
Separate meaningful workloads into slices/scopes.
Apply cgroup memory policy
Use memory.low, memory.high, memory.max, and explicit swap policy first.
Enable swap-based protection
Use it as the host-wide guardrail.
Enable pressure-based kills on specific slices
Start with noisy, mixed-priority areas rather than the whole machine.
Tune durations before thresholds
If kills are too eager, lengthen duration before radically raising limits.
Use preferences sparingly
avoid is often enough. omit should be rare.

This keeps systemd-oomd from becoming a blunt instrument.

10) Observability: what to watch

The first tool to know is:

oomctl

Use it to inspect monitored cgroups and pressure state.

Also watch:

journalctl -u systemd-oomd
cat /proc/pressure/memory
cat /proc/pressure/io
cat /sys/fs/cgroup/<path>/memory.pressure
cat /sys/fs/cgroup/<path>/memory.events
cat /sys/fs/cgroup/<path>/memory.current
cat /sys/fs/cgroup/<path>/memory.swap.current

Minimum signal set for real operations:

per-cgroup memory.pressure
per-cgroup memory.events
swap usage by cgroup when relevant
journalctl -u systemd-oomd
request latency / queue depth / timeouts for user-facing services

What you want to learn from an event:

which monitored parent crossed threshold,
which descendant was chosen,
whether the victim had been reclaim-hot for a long time,
whether memory protections were working as intended,
whether swap exhaustion or PSI was the true trigger.

11) Common mistakes

1) Treating `systemd-oomd` as a substitute for memory controls

It is not. If memory.high, memory.low, and swap policy are nonsense, systemd-oomd inherits nonsense.

2) Running without swap and expecting graceful behavior

Pressure mode can still help, but the machine reaches unusable states faster. Swap often provides the reaction window userspace OOM control needs.

3) Monitoring a parent with no meaningful children

Then the kill domain is poorly defined, and outcomes become surprising.

4) Marking too many services `omit`

That just pushes death onto less appropriate victims. Reserve omit for tiny, high-importance control-plane pieces.

5) Ignoring leaf / `memory.oom.group=1` eligibility rules

If you do not shape the tree around killable units, systemd-oomd cannot operate cleanly.

6) Tuning by memory bytes only

Pressure-based systems should be tuned by:

PSI,
reclaim behavior,
user latency,
and survival behavior after a kill.

12) Example rollout for a mixed server host

Imagine:

api.slice = user-facing services,
jobs.slice = indexing/backfill/compaction,
sandbox.slice = ad-hoc experiments.

A reasonable first pass:

Global

# /etc/systemd/oomd.conf.d/80-defaults.conf
[OOM]
SwapUsedLimit=90%
DefaultMemoryPressureLimit=60%
DefaultMemoryPressureDurationSec=30s

Root slice guardrail

# /etc/systemd/system.control/-.slice.d/oomd.conf
[Slice]
ManagedOOMSwap=kill

Noisy batch area

# /etc/systemd/system.control/jobs.slice.d/oomd.conf
[Slice]
ManagedOOMMemoryPressure=kill
ManagedOOMMemoryPressureLimit=50%
ManagedOOMMemoryPressureDurationSec=20s

Critical API

# /etc/systemd/system.control/api.slice.d/oomd.conf
[Slice]
ManagedOOMPreference=avoid

This does not make the API immortal. It just tells the host:

if we have to kill something, start with the reclaim-hot junk before the thing users are actively waiting on.

13) When to choose `avoid` vs `omit`

Use `avoid` when:

the service is important but not absolutely sacred,
it still should remain killable in a genuine crisis,
you want de-prioritization, not immunity.

Use `omit` when:

killing it would make the host less recoverable,
it is small and essential,
there are other reasonable kill candidates.

Examples that might deserve omit in some environments:

tiny host control-plane helpers,
service managers or recovery agents,
a minimal SSH / remote-repair path on certain boxes.

Examples that usually should not get omit:

large app servers,
huge browsers/IDEs/VMs,
bulky caches,
broad user session managers with everything packed underneath.

14) Incident runbook

When memory collapse starts and you suspect systemd-oomd policy issues:

Check whether the host is on cgroup v2.
Inspect PSI (/proc/pressure/memory, memory.pressure on key slices).
Inspect journal for systemd-oomd decisions.
Confirm candidate topology:
- are descendants clean?
- are victims leaf cgroups / grouped correctly?
Check swap reality:
- is there swap?
- is swap nearly exhausted?
- are kills coming from swap mode or pressure mode?
Review protection settings:
- memory.low / memory.min
- ManagedOOMPreference
Fix the tree before retuning thresholds.
Only then adjust pressure duration / thresholds.

A lot of bad systemd-oomd behavior is actually bad cgroup design plus missing memory policy.

15) One-page starter policy

If you need a safe default mindset:

Use cgroup v2 everywhere.
Turn on memory accounting.
Keep swap enabled unless you have a very strong reason not to.
Use swap-based kill at -.slice as the host-wide guardrail.
Use pressure-based kill on mixed-priority child slices.
Protect critical services with avoid, not blanket omit.
Model kill domains explicitly with leaf cgroups / grouped children.
Tune with PSI + reclaim + user latency, not RSS alone.

If the box routinely reaches kernel OOM before systemd-oomd helps, the usual suspects are:

thresholds too lax,
durations too long,
no swap,
poor cgroup topology,
or missing memory accounting / cgroup v2 support.

16) Bottom line

systemd-oomd is best understood as a host survivability layer.

It is not trying to make OOM disappear. It is trying to answer a much more operational question:

When memory contention turns into real progress loss, which workload should die first so the rest of the machine can keep breathing?

If you give it:

a good cgroup tree,
explicit memory protection policy,
swap headroom,
and sane kill preferences,

it becomes far more predictable than waiting for late kernel OOM roulette.

And if you do not give it those things, it will still tell you something valuable:

your workload boundaries are not yet expressed clearly enough for the kernel and service manager to protect the machine on your behalf.

References

systemd-oomd.service(8) / systemd-oomd(8) upstream man page
oomd.conf(5) upstream man page
Fedora Change Proposal: Enable systemd-oomd by default
Linux PSI and cgroup v2 memory-controller documentation

systemd-oomd Playbook (PSI, cgroup v2, memory protection, and pre-kernel OOM control)

systemd-oomd Playbook (PSI, cgroup v2, memory protection, and pre-kernel OOM control)

Why this matters

1) Quick mental model

2) Requirements you should verify first

Required / strongly recommended

3) How kill selection actually works

4) The main knobs

Per-unit knobs

Global knobs (oomd.conf)

5) Decision matrix

A) Desktop / workstation freezes under browser + IDE + VM pressure

B) Server host has foreground API + background ETL / compaction / indexing

C) Multi-tenant box where one tenant can consume massive swap

D) Latency-sensitive system with no swap

6) Minimal safe rollout patterns

Pattern 1: Start with swap-based protection at the root slice

Pattern 2: Add pressure-based monitoring to noisy slices

Pattern 3: Protect truly critical units with preference, not immunity everywhere

7) How to think about PSI in this context

8) Practical architecture guidance

Good cgroup shapes

Bad cgroup shapes

9) A sensible tuning order

10) Observability: what to watch

11) Common mistakes

1) Treating systemd-oomd as a substitute for memory controls

2) Running without swap and expecting graceful behavior

3) Monitoring a parent with no meaningful children

4) Marking too many services omit

5) Ignoring leaf / memory.oom.group=1 eligibility rules

6) Tuning by memory bytes only

12) Example rollout for a mixed server host

Global

Root slice guardrail

Noisy batch area

Critical API

13) When to choose avoid vs omit

Use avoid when:

Use omit when:

14) Incident runbook

15) One-page starter policy

16) Bottom line

References

Global knobs (`oomd.conf`)

1) Treating `systemd-oomd` as a substitute for memory controls

4) Marking too many services `omit`

5) Ignoring leaf / `memory.oom.group=1` eligibility rules

13) When to choose `avoid` vs `omit`

Use `avoid` when:

Use `omit` when: