systemd-oomd Playbook (PSI, cgroup v2, memory protection, and pre-kernel OOM control)

2026-04-06 · software

systemd-oomd Playbook (PSI, cgroup v2, memory protection, and pre-kernel OOM control)

Date: 2026-04-06
Category: knowledge

Why this matters

A lot of Linux memory incidents do not fail as a clean, immediate OOM. They usually fail like this instead:

systemd-oomd exists to act earlier than the kernel OOM killer by using:

That combination matters because it lets you kill at the workload / cgroup level instead of hoping a last-second kernel OOM will pick the right victim.

The right mental model is:

systemd-oomd is not a "memory limit" feature. It is a pre-collapse policy engine for choosing which workload should die before the whole machine becomes useless.


1) Quick mental model

There are two major trigger families:

The key distinction:

Think of it like this:


2) Requirements you should verify first

systemd-oomd is only useful when the host is set up correctly.

Required / strongly recommended

The easiest way to avoid a silent misconfiguration is to verify:

stat -fc %T /sys/fs/cgroup
# expect: cgroup2fs

cat /proc/pressure/memory
cat /proc/pressure/io
cat /proc/pressure/cpu

For systemd-managed hosts, make sure memory accounting is on:

# /etc/systemd/system.conf (conceptually)
DefaultMemoryAccounting=yes

Operational truth:


3) How kill selection actually works

This is the part people often misunderstand.

When a monitored unit crosses a threshold, systemd-oomd does not simply kill that unit itself. It looks for eligible descendant cgroups under the monitored unit.

Important rules from the upstream behavior:

That means tree design matters.

Bad shape:

Better shape:

In plain English:

systemd-oomd works best when your cgroup tree already matches your operational blast-radius boundaries.


4) The main knobs

Per-unit knobs

These live on units / slices you want monitored.

Global knobs (oomd.conf)

Operational interpretation:

Use avoid/omit sparingly. If everything is critical, nothing is.


5) Decision matrix

A) Desktop / workstation freezes under browser + IDE + VM pressure

This is the classic case where systemd-oomd shines.

B) Server host has foreground API + background ETL / compaction / indexing

C) Multi-tenant box where one tenant can consume massive swap

D) Latency-sensitive system with no swap

Upstream guidance is clear here: swap is strongly recommended because it buys time for systemd-oomd to react before total livelock.


6) Minimal safe rollout patterns

Pattern 1: Start with swap-based protection at the root slice

This is the cleanest broad safety net.

# /etc/systemd/system.control/-.slice.d/oomd.conf
[Slice]
ManagedOOMSwap=kill

And in oomd.conf:

[OOM]
SwapUsedLimit=90%

Why this works:

Pattern 2: Add pressure-based monitoring to noisy slices

For example, on a service slice containing mixed workers:

[Slice]
ManagedOOMMemoryPressure=kill
ManagedOOMMemoryPressureLimit=50%
ManagedOOMMemoryPressureDurationSec=20s

This says:

Pattern 3: Protect truly critical units with preference, not immunity everywhere

[Service]
ManagedOOMPreference=avoid

Use omit only for very small, genuinely essential services. If you overuse omit, you force systemd-oomd to kill less appropriate victims later and harder.


7) How to think about PSI in this context

systemd-oomd uses memory PSI, not just raw RSS. That is a huge conceptual upgrade.

A workload can be “only” using moderate memory but still cause:

PSI sees the wasted time. That is why it aligns much better with real user pain than a simple memory-used number.

For systemd-oomd, the relevant framing is:

This also explains why systemd-oomd tends to pair well with cgroup memory protections:

Fedora's rollout notes make this explicit: pressure-based selection better reflects memory protection policy than raw usage does.


8) Practical architecture guidance

Good cgroup shapes

Use separate slices/scopes for:

Bad cgroup shapes

Avoid:

If you want good systemd-oomd behavior, the cgroup tree itself must already express:

  1. who competes together,
  2. who can die together,
  3. who should be spared if possible.

9) A sensible tuning order

Do not start by making systemd-oomd aggressive. Use this order instead:

  1. Fix the tree
    Separate meaningful workloads into slices/scopes.

  2. Apply cgroup memory policy
    Use memory.low, memory.high, memory.max, and explicit swap policy first.

  3. Enable swap-based protection
    Use it as the host-wide guardrail.

  4. Enable pressure-based kills on specific slices
    Start with noisy, mixed-priority areas rather than the whole machine.

  5. Tune durations before thresholds
    If kills are too eager, lengthen duration before radically raising limits.

  6. Use preferences sparingly
    avoid is often enough. omit should be rare.

This keeps systemd-oomd from becoming a blunt instrument.


10) Observability: what to watch

The first tool to know is:

oomctl

Use it to inspect monitored cgroups and pressure state.

Also watch:

journalctl -u systemd-oomd
cat /proc/pressure/memory
cat /proc/pressure/io
cat /sys/fs/cgroup/<path>/memory.pressure
cat /sys/fs/cgroup/<path>/memory.events
cat /sys/fs/cgroup/<path>/memory.current
cat /sys/fs/cgroup/<path>/memory.swap.current

Minimum signal set for real operations:

What you want to learn from an event:


11) Common mistakes

1) Treating systemd-oomd as a substitute for memory controls

It is not. If memory.high, memory.low, and swap policy are nonsense, systemd-oomd inherits nonsense.

2) Running without swap and expecting graceful behavior

Pressure mode can still help, but the machine reaches unusable states faster. Swap often provides the reaction window userspace OOM control needs.

3) Monitoring a parent with no meaningful children

Then the kill domain is poorly defined, and outcomes become surprising.

4) Marking too many services omit

That just pushes death onto less appropriate victims. Reserve omit for tiny, high-importance control-plane pieces.

5) Ignoring leaf / memory.oom.group=1 eligibility rules

If you do not shape the tree around killable units, systemd-oomd cannot operate cleanly.

6) Tuning by memory bytes only

Pressure-based systems should be tuned by:


12) Example rollout for a mixed server host

Imagine:

A reasonable first pass:

Global

# /etc/systemd/oomd.conf.d/80-defaults.conf
[OOM]
SwapUsedLimit=90%
DefaultMemoryPressureLimit=60%
DefaultMemoryPressureDurationSec=30s

Root slice guardrail

# /etc/systemd/system.control/-.slice.d/oomd.conf
[Slice]
ManagedOOMSwap=kill

Noisy batch area

# /etc/systemd/system.control/jobs.slice.d/oomd.conf
[Slice]
ManagedOOMMemoryPressure=kill
ManagedOOMMemoryPressureLimit=50%
ManagedOOMMemoryPressureDurationSec=20s

Critical API

# /etc/systemd/system.control/api.slice.d/oomd.conf
[Slice]
ManagedOOMPreference=avoid

This does not make the API immortal. It just tells the host:

if we have to kill something, start with the reclaim-hot junk before the thing users are actively waiting on.


13) When to choose avoid vs omit

Use avoid when:

Use omit when:

Examples that might deserve omit in some environments:

Examples that usually should not get omit:


14) Incident runbook

When memory collapse starts and you suspect systemd-oomd policy issues:

  1. Check whether the host is on cgroup v2.
  2. Inspect PSI (/proc/pressure/memory, memory.pressure on key slices).
  3. Inspect journal for systemd-oomd decisions.
  4. Confirm candidate topology:
    • are descendants clean?
    • are victims leaf cgroups / grouped correctly?
  5. Check swap reality:
    • is there swap?
    • is swap nearly exhausted?
    • are kills coming from swap mode or pressure mode?
  6. Review protection settings:
    • memory.low / memory.min
    • ManagedOOMPreference
  7. Fix the tree before retuning thresholds.
  8. Only then adjust pressure duration / thresholds.

A lot of bad systemd-oomd behavior is actually bad cgroup design plus missing memory policy.


15) One-page starter policy

If you need a safe default mindset:

If the box routinely reaches kernel OOM before systemd-oomd helps, the usual suspects are:


16) Bottom line

systemd-oomd is best understood as a host survivability layer.

It is not trying to make OOM disappear. It is trying to answer a much more operational question:

When memory contention turns into real progress loss, which workload should die first so the rest of the machine can keep breathing?

If you give it:

it becomes far more predictable than waiting for late kernel OOM roulette.

And if you do not give it those things, it will still tell you something valuable:

your workload boundaries are not yet expressed clearly enough for the kernel and service manager to protect the machine on your behalf.


References