systemd-oomd Playbook (PSI, cgroup v2, memory protection, and pre-kernel OOM control)
Date: 2026-04-06
Category: knowledge
Why this matters
A lot of Linux memory incidents do not fail as a clean, immediate OOM. They usually fail like this instead:
- the box enters reclaim hell,
- swap churn explodes,
- page cache and anonymous memory fight each other,
- user-facing latency turns to sludge,
- the kernel OOM killer fires late and somewhat opaquely,
- by the time a process dies, the machine has already felt half-dead for too long.
systemd-oomd exists to act earlier than the kernel OOM killer by using:
- cgroup v2 hierarchy,
- PSI (Pressure Stall Information),
- swap pressure,
- and systemd unit boundaries.
That combination matters because it lets you kill at the workload / cgroup level instead of hoping a last-second kernel OOM will pick the right victim.
The right mental model is:
systemd-oomdis not a "memory limit" feature. It is a pre-collapse policy engine for choosing which workload should die before the whole machine becomes useless.
1) Quick mental model
There are two major trigger families:
Memory-pressure kill path
Watch a monitored cgroup's memory PSI. If pressure stays above a configured limit for long enough,systemd-oomdkills an eligible descendant cgroup.Swap-exhaustion kill path
Watch system-wide memory+swap usage. If both are above the configured threshold,systemd-oomdkills eligible descendant cgroups with meaningful swap usage, starting from the biggest swap users.
The key distinction:
- memory pressure mode is about stall / reclaim pain,
- swap mode is about global survival before total exhaustion.
Think of it like this:
memory.highshapes slowdown,systemd-oomdchooses when slowdown has become unacceptable,- the kernel OOM killer is the last-resort crash barrier.
2) Requirements you should verify first
systemd-oomd is only useful when the host is set up correctly.
Required / strongly recommended
- full unified cgroup v2 hierarchy
- PSI support in the kernel (Linux 4.20+)
- memory accounting enabled for monitored units
- reasonable cgroup boundaries between workloads
The easiest way to avoid a silent misconfiguration is to verify:
stat -fc %T /sys/fs/cgroup
# expect: cgroup2fs
cat /proc/pressure/memory
cat /proc/pressure/io
cat /proc/pressure/cpu
For systemd-managed hosts, make sure memory accounting is on:
# /etc/systemd/system.conf (conceptually)
DefaultMemoryAccounting=yes
Operational truth:
- If everything important runs in one giant cgroup,
systemd-oomdcannot make good decisions. - If memory accounting is off, monitored units may not behave as expected.
- If PSI is missing, the memory-pressure path is dead on arrival.
3) How kill selection actually works
This is the part people often misunderstand.
When a monitored unit crosses a threshold, systemd-oomd does not simply kill that unit itself.
It looks for eligible descendant cgroups under the monitored unit.
Important rules from the upstream behavior:
- only descendant cgroups are kill candidates,
- the monitored unit itself is not the victim unless one of its ancestors monitors and targets below it,
- only leaf cgroups and cgroups with
memory.oom.group=1are eligible kill targets.
That means tree design matters.
Bad shape:
system.slicemonitored,- one giant service with every worker in the same non-leaf bucket,
- no clean descendants,
- unclear victim choice.
Better shape:
- parent slice or service monitors pressure,
- children represent meaningful kill domains,
- each kill domain is either a leaf or explicitly grouped with
memory.oom.group=1semantics.
In plain English:
systemd-oomdworks best when your cgroup tree already matches your operational blast-radius boundaries.
4) The main knobs
Per-unit knobs
These live on units / slices you want monitored.
ManagedOOMMemoryPressure=
Enable pressure-based action, usuallykill.ManagedOOMMemoryPressureLimit=
Pressure threshold for that unit.ManagedOOMMemoryPressureDurationSec=
How long the pressure must stay above threshold.ManagedOOMSwap=
Enable swap-based action, usuallykill.ManagedOOMPreference=
Candidate preference:avoidoromitfor important workloads.
Global knobs (oomd.conf)
SwapUsedLimit=
Global memory+swap pressure threshold for swap-based actions. Default: 90%.DefaultMemoryPressureLimit=
Default memory PSI threshold if units do not override it.
Default: 60%.DefaultMemoryPressureDurationSec=
How long pressure must exceed threshold before action.
Default: 30s.
Operational interpretation:
- lower threshold = earlier kills,
- longer duration = more tolerance for bursts,
avoidmeans “de-prioritize this as a victim,”omitmeans “do not consider this as a victim at all.”
Use avoid/omit sparingly. If everything is critical, nothing is.
5) Decision matrix
A) Desktop / workstation freezes under browser + IDE + VM pressure
- enable swap-based protection at
-.slice, - enable memory-pressure protection in user slices,
- keep apps separated into meaningful user-service/session scopes.
This is the classic case where systemd-oomd shines.
B) Server host has foreground API + background ETL / compaction / indexing
- put the background jobs in separate slices/scopes,
- use memory-pressure kill policy on the parent slice,
- use cgroup memory controls (
memory.high,memory.low) first, - let
systemd-oomdbe the “we are now beyond graceful reclaim” enforcement layer.
C) Multi-tenant box where one tenant can consume massive swap
- set
ManagedOOMSwap=killhigh in the tree, - keep descendants cleanly separated,
- watch swap users and journal events carefully.
D) Latency-sensitive system with no swap
- pressure mode can still work,
- but reaction time is tighter and failure is steeper,
- thresholds often need more conservative tuning,
- expect less forgiveness before user-visible collapse.
Upstream guidance is clear here: swap is strongly recommended because it buys time for systemd-oomd to react before total livelock.
6) Minimal safe rollout patterns
Pattern 1: Start with swap-based protection at the root slice
This is the cleanest broad safety net.
# /etc/systemd/system.control/-.slice.d/oomd.conf
[Slice]
ManagedOOMSwap=kill
And in oomd.conf:
[OOM]
SwapUsedLimit=90%
Why this works:
- it watches system-wide swap exhaustion,
- it gives
systemd-oomdroom to intervene before hard collapse, - it does not require perfect per-service PSI tuning on day one.
Pattern 2: Add pressure-based monitoring to noisy slices
For example, on a service slice containing mixed workers:
[Slice]
ManagedOOMMemoryPressure=kill
ManagedOOMMemoryPressureLimit=50%
ManagedOOMMemoryPressureDurationSec=20s
This says:
- if this slice spends too much time stalled on memory for long enough,
- choose an eligible descendant and kill it.
Pattern 3: Protect truly critical units with preference, not immunity everywhere
[Service]
ManagedOOMPreference=avoid
Use omit only for very small, genuinely essential services.
If you overuse omit, you force systemd-oomd to kill less appropriate victims later and harder.
7) How to think about PSI in this context
systemd-oomd uses memory PSI, not just raw RSS.
That is a huge conceptual upgrade.
A workload can be “only” using moderate memory but still cause:
- constant reclaim,
- page fault churn,
- swap-in latency,
- stalled allocators,
- system-wide performance collapse.
PSI sees the wasted time. That is why it aligns much better with real user pain than a simple memory-used number.
For systemd-oomd, the relevant framing is:
- high memory PSI = the workload tree is spending too much time not making progress,
- sustained high PSI = reclaim is no longer a temporary event but a steady-state failure mode.
This also explains why systemd-oomd tends to pair well with cgroup memory protections:
memory.low/memory.minsay who deserves protection,- reclaim behavior reflects those protections,
- PSI captures the resulting stall cost,
systemd-oomdkills based on the collapse pattern rather than just largest-bytes-wins.
Fedora's rollout notes make this explicit: pressure-based selection better reflects memory protection policy than raw usage does.
8) Practical architecture guidance
Good cgroup shapes
Use separate slices/scopes for:
- foreground API vs background batch,
- interactive desktop apps vs large build/test jobs,
- per-tenant workers,
- transient CLI experiments launched via
systemd-run --scope.
Bad cgroup shapes
Avoid:
- everything under one huge session scope,
- one service owning unrelated children with no meaningful kill domains,
- “critical” labels on half the host,
- unbounded helper processes escaping service accounting.
If you want good systemd-oomd behavior, the cgroup tree itself must already express:
- who competes together,
- who can die together,
- who should be spared if possible.
9) A sensible tuning order
Do not start by making systemd-oomd aggressive.
Use this order instead:
Fix the tree
Separate meaningful workloads into slices/scopes.Apply cgroup memory policy
Usememory.low,memory.high,memory.max, and explicit swap policy first.Enable swap-based protection
Use it as the host-wide guardrail.Enable pressure-based kills on specific slices
Start with noisy, mixed-priority areas rather than the whole machine.Tune durations before thresholds
If kills are too eager, lengthen duration before radically raising limits.Use preferences sparingly
avoidis often enough.omitshould be rare.
This keeps systemd-oomd from becoming a blunt instrument.
10) Observability: what to watch
The first tool to know is:
oomctl
Use it to inspect monitored cgroups and pressure state.
Also watch:
journalctl -u systemd-oomd
cat /proc/pressure/memory
cat /proc/pressure/io
cat /sys/fs/cgroup/<path>/memory.pressure
cat /sys/fs/cgroup/<path>/memory.events
cat /sys/fs/cgroup/<path>/memory.current
cat /sys/fs/cgroup/<path>/memory.swap.current
Minimum signal set for real operations:
- per-cgroup
memory.pressure - per-cgroup
memory.events - swap usage by cgroup when relevant
journalctl -u systemd-oomd- request latency / queue depth / timeouts for user-facing services
What you want to learn from an event:
- which monitored parent crossed threshold,
- which descendant was chosen,
- whether the victim had been reclaim-hot for a long time,
- whether memory protections were working as intended,
- whether swap exhaustion or PSI was the true trigger.
11) Common mistakes
1) Treating systemd-oomd as a substitute for memory controls
It is not.
If memory.high, memory.low, and swap policy are nonsense, systemd-oomd inherits nonsense.
2) Running without swap and expecting graceful behavior
Pressure mode can still help, but the machine reaches unusable states faster. Swap often provides the reaction window userspace OOM control needs.
3) Monitoring a parent with no meaningful children
Then the kill domain is poorly defined, and outcomes become surprising.
4) Marking too many services omit
That just pushes death onto less appropriate victims.
Reserve omit for tiny, high-importance control-plane pieces.
5) Ignoring leaf / memory.oom.group=1 eligibility rules
If you do not shape the tree around killable units, systemd-oomd cannot operate cleanly.
6) Tuning by memory bytes only
Pressure-based systems should be tuned by:
- PSI,
- reclaim behavior,
- user latency,
- and survival behavior after a kill.
12) Example rollout for a mixed server host
Imagine:
api.slice= user-facing services,jobs.slice= indexing/backfill/compaction,sandbox.slice= ad-hoc experiments.
A reasonable first pass:
Global
# /etc/systemd/oomd.conf.d/80-defaults.conf
[OOM]
SwapUsedLimit=90%
DefaultMemoryPressureLimit=60%
DefaultMemoryPressureDurationSec=30s
Root slice guardrail
# /etc/systemd/system.control/-.slice.d/oomd.conf
[Slice]
ManagedOOMSwap=kill
Noisy batch area
# /etc/systemd/system.control/jobs.slice.d/oomd.conf
[Slice]
ManagedOOMMemoryPressure=kill
ManagedOOMMemoryPressureLimit=50%
ManagedOOMMemoryPressureDurationSec=20s
Critical API
# /etc/systemd/system.control/api.slice.d/oomd.conf
[Slice]
ManagedOOMPreference=avoid
This does not make the API immortal. It just tells the host:
if we have to kill something, start with the reclaim-hot junk before the thing users are actively waiting on.
13) When to choose avoid vs omit
Use avoid when:
- the service is important but not absolutely sacred,
- it still should remain killable in a genuine crisis,
- you want de-prioritization, not immunity.
Use omit when:
- killing it would make the host less recoverable,
- it is small and essential,
- there are other reasonable kill candidates.
Examples that might deserve omit in some environments:
- tiny host control-plane helpers,
- service managers or recovery agents,
- a minimal SSH / remote-repair path on certain boxes.
Examples that usually should not get omit:
- large app servers,
- huge browsers/IDEs/VMs,
- bulky caches,
- broad user session managers with everything packed underneath.
14) Incident runbook
When memory collapse starts and you suspect systemd-oomd policy issues:
- Check whether the host is on cgroup v2.
- Inspect PSI (
/proc/pressure/memory,memory.pressureon key slices). - Inspect journal for
systemd-oomddecisions. - Confirm candidate topology:
- are descendants clean?
- are victims leaf cgroups / grouped correctly?
- Check swap reality:
- is there swap?
- is swap nearly exhausted?
- are kills coming from swap mode or pressure mode?
- Review protection settings:
memory.low/memory.minManagedOOMPreference
- Fix the tree before retuning thresholds.
- Only then adjust pressure duration / thresholds.
A lot of bad systemd-oomd behavior is actually bad cgroup design plus missing memory policy.
15) One-page starter policy
If you need a safe default mindset:
- Use cgroup v2 everywhere.
- Turn on memory accounting.
- Keep swap enabled unless you have a very strong reason not to.
- Use swap-based kill at
-.sliceas the host-wide guardrail. - Use pressure-based kill on mixed-priority child slices.
- Protect critical services with
avoid, not blanketomit. - Model kill domains explicitly with leaf cgroups / grouped children.
- Tune with PSI + reclaim + user latency, not RSS alone.
If the box routinely reaches kernel OOM before systemd-oomd helps, the usual suspects are:
- thresholds too lax,
- durations too long,
- no swap,
- poor cgroup topology,
- or missing memory accounting / cgroup v2 support.
16) Bottom line
systemd-oomd is best understood as a host survivability layer.
It is not trying to make OOM disappear. It is trying to answer a much more operational question:
When memory contention turns into real progress loss, which workload should die first so the rest of the machine can keep breathing?
If you give it:
- a good cgroup tree,
- explicit memory protection policy,
- swap headroom,
- and sane kill preferences,
it becomes far more predictable than waiting for late kernel OOM roulette.
And if you do not give it those things, it will still tell you something valuable:
your workload boundaries are not yet expressed clearly enough for the kernel and service manager to protect the machine on your behalf.
References
systemd-oomd.service(8)/systemd-oomd(8)upstream man pageoomd.conf(5)upstream man page- Fedora Change Proposal: Enable systemd-oomd by default
- Linux PSI and cgroup v2 memory-controller documentation