Linux cgroup v2 Memory Controller Playbook (memory.min/low/high/max, swap, PSI, systemd mappings)

Date: 2026-04-06
Category: knowledge

Why this matters

A lot of production memory incidents are not really “we ran out of RAM” in the simple sense. They are usually one of these:

one batch job bloats page cache and pushes a latency-sensitive service into reclaim,
background daemons and sidecars quietly eat headroom until the main workload starts thrashing,
the host still looks fine on average, but memory.high reclaim turns p99 into sludge,
operators jump straight to hard limits and get surprise OOM kills instead of graceful slowdown,
swap policy is accidental rather than explicit, so the box fails in weird, slow motion.

cgroup v2 gives a much better vocabulary for this than old per-process tuning:

protection for workloads that must keep breathing,
throttling for workloads that can slow down before they die,
hard containment for blast-radius control,
pressure signals that tell you whether a workload is actually suffering,
hierarchical policy so the host can decide who yields first.

The key mindset shift is this:

Treat memory as a contested shared resource whose failure modes are reclaim, stall, swap, and finally OOM — not just a single usage number.

1) Quick mental model

Use the knobs for different jobs:

memory.min: hard protection.
“Do not reclaim this working set; if that becomes impossible, kill something else or OOM.”
memory.low: best-effort protection.
“Prefer not to reclaim this working set unless unprotected memory is exhausted.”
memory.high: throttle + reclaim pressure.
“This is the main operating limit. Slow the workload down before catastrophe.”
memory.max: hard ceiling.
“Last line of defense. If reclaim cannot contain usage, OOM inside the cgroup.”
memory.swap.max: swap ceiling.
“How much swap is this workload allowed to consume?”
memory.oom.group: kill as a unit.
“If this workload OOMs, kill the whole thing together rather than leaving a half-dead shard.”
memory.events / memory.pressure: telemetry.
“Are we protected, throttled, thrashing, or approaching OOM?”
memory.reclaim: proactive reclaim trigger.
“Ask the kernel to reclaim from this cgroup now.”

Rule of thumb:

start with memory.high as the main control,
use memory.low / memory.min only for workloads that truly deserve protection,
keep memory.max as the crash barrier, not the primary tuning knob,
decide swap policy explicitly,
read PSI + events + stat together, not one file at a time.

2) Decision matrix

A) User-facing service slows down under host contention

give the service memory.low or a modest memory.min if it is truly critical,
put noisy neighbors in sibling cgroups,
set a reasonable memory.high on the noisy workloads,
watch memory.pressure and memory.events.high.

B) You want graceful slowdown before OOM

set memory.high first,
alert on sustained memory.events.high growth or rising PSI,
use memory.max only as the final containment fence.

C) Batch / ETL / compaction jobs must not eat the machine

isolate them in their own cgroup,
apply a conservative memory.high,
optionally add memory.max if host safety matters more than job completion time,
consider allowing swap if throughput matters more than latency.

D) A workload must preserve integrity on OOM

set memory.oom.group=1,
ensure restart / orchestration logic expects whole-group death,
avoid partial survivors with corrupt or incomplete in-memory state.

E) Latency-sensitive workload should never swap

set memory.swap.max=0,
ensure memory.low / memory.high are realistic,
remember that “no swap” without enough headroom often just turns slow failure into fast OOM.

F) You are tuning protections in a hierarchy

remember protections are hierarchical,
set ancestor protection too, not just the leaf,
avoid protection overcommit unless you understand how siblings will compete.

3) What each knob really does

`memory.current`: what is charged right now

memory.current shows current memory usage for the cgroup and its descendants. It is broader than many people expect:

anonymous memory,
page cache,
kernel objects such as inodes/slab,
network buffers.

So if a service looks “too big”, don’t assume it is heap growth. It may be page cache, socket buffers, or kernel-side accounting.

`memory.min`: hard protection

memory.min is the strongest protection boundary. If the cgroup is within its effective min boundary, that memory is not reclaimed under normal reclaim logic. If the system runs out of reclaimable unprotected memory, the result is OOM pressure rather than violating the guarantee.

Use it sparingly:

control-plane daemons you really cannot afford to starve,
tiny but essential supporting services,
workloads where reclaim-induced collapse is worse than earlier OOM elsewhere.

Do not scatter large memory.min values everywhere. Overcommitted hard protection is how you manufacture constant OOMs.

`memory.low`: best-effort protection

memory.low is the softer, more operationally friendly sibling. Below this boundary, the cgroup is protected unless the system has no reclaimable memory in unprotected cgroups.

This is usually the better default protection knob for:

user-facing services,
databases or caches that benefit from preserving working set,
support slices needed by the primary workload.

If memory.events.low keeps rising, that usually means your low protection is overcommitted or the host is under deeper pressure than your model assumed.

`memory.high`: throttle limit, and usually the main control

This is the most important control. When usage goes above memory.high, the workload is pushed into heavy reclaim and allocation throttling.

Key behavior:

it does not directly invoke the OOM killer,
it can be exceeded temporarily,
if you set it too low, the workload usually degrades rather than dies,
it is the best knob for finding the smallest memory footprint that still performs acceptably.

This is why kernel docs and systemd guidance both treat memory.high / MemoryHigh= as the primary mechanism.

Think of it as:

“Pay back excess memory by doing reclaim work now.”

That repayment often shows up as latency spikes, CPU burn in reclaim, and PSI growth.

`memory.max`: hard limit, last line of defense

memory.max is the absolute ceiling. If usage reaches it and reclaim cannot contain the cgroup, the cgroup enters OOM handling.

Use it for:

tenant containment,
protection against buggy or runaway memory growth,
preventing one workload from consuming all host slack.

Do not use it as your first tuning tool unless you enjoy surprise kills. The usual safer pattern is:

tune memory.high,
observe pressure,
set memory.max above that as the safety backstop.

`memory.swap.max`: swap policy in one file

memory.swap.max sets the hard cap on swap usage for the cgroup.

Common patterns:

0 for low-latency services where swap-induced tail latency is unacceptable,
finite value for batch workloads that can trade latency for throughput or survival,
max only when you intentionally want the workload to compete for swap without a per-cgroup ceiling.

Be deliberate here. Swap is not automatically evil, but accidental swap policy usually is.

`memory.swap.high`: emergency warning, not your everyday knob

memory.swap.high is a throttle point for swap usage. Kernel docs describe it as a point of no return and explicitly not the normal way to manage healthy swap behavior.

Operationally:

use memory.swap.max for policy,
treat memory.swap.high as an advanced/rare signal when building custom out-of-memory responses.

`memory.oom.group`: kill the workload, not a random organ

With memory.oom.group=1, the OOM killer treats the cgroup as an indivisible workload and kills tasks together.

This is often better for:

multi-process workers,
JVM + helper process bundles,
application groups that become useless if one child disappears,
service units where partial survival creates bad state.

Without it, you can end up with the ugly middle ground: a “still running” workload that has already lost the process that mattered.

`memory.reclaim`: proactive reclaim

memory.reclaim lets you ask the kernel to reclaim memory from a cgroup manually.

Example:

echo "1G" | sudo tee /sys/fs/cgroup/batch/memory.reclaim

Use it for:

experiments,
pre-maintenance cleanup,
testing how reclaim-sensitive a workload is,
gentle pre-pressure trimming.

But note the caveat from kernel docs: reclaim triggered this way is not the same as natural pressure-driven reclaim, and some side effects such as socket memory balancing are not exercised in the same way.

`memory.events`: the scoreboard

memory.events is one of the most useful files in the whole controller. Key counters:

low: reclaimed despite being under low boundary → protection is overcommitted or ineffective,
high: throttled because high boundary was exceeded,
max: tried to go over max,
oom: allocation was about to fail at limit,
oom_kill: processes killed by OOM,
oom_group_kill: whole-group OOM event,
sock_throttled: socket throttling events.

Use memory.events.local if you want local-only signals rather than hierarchical subtree counts.

`memory.pressure`: the “is this hurting?” signal

PSI is what turns memory control from guesswork into operations.

memory.pressure exports:

some: some tasks are stalled on memory pressure,
full: all non-idle tasks are stalled; this is closer to actual thrash / productivity collapse.

The rolling windows:

avg10
avg60
avg300
total

memory.current tells you how much memory is charged. memory.pressure tells you whether that memory situation is actually costing productive time.

That distinction matters a lot. A big cache may be healthy. A modest footprint with rising PSI may be dying.

`memory.stat`: explain the bill

memory.stat breaks usage down into categories and reclaim-related counters. Use it to answer questions like:

is this anonymous memory or page cache?
are slab / sock / kernel allocations the hidden culprit?
are workingset refaults suggesting cache churn?
is file cache getting beaten up by reclaim?

If you only watch memory.current, you are debugging with one eye shut.

4) Hierarchy rules people forget

Protections are hierarchical

memory.min and memory.low are constrained by ancestor protections. If the parent does not provide enough protection budget, children compete for the effective protection.

So this is wrong:

leaf cgroup gets a large memory.low,
ancestor has no corresponding protection,
operator assumes the leaf is protected.

This is closer to correct:

parent slice allocates protection,
children receive explicit or shared portions,
siblings compete within the parent’s protection budget.

Overcommit is possible — and sometimes useful

Kernel docs explicitly note that overcommitting memory.high can be viable. That is often true because memory.high produces throttling and reclaim, not instant death.

Overcommitting protection is more dangerous:

overcommitted memory.low can be okay if you understand priority tradeoffs,
overcommitted memory.min is much riskier and can force frequent OOM.

Moving a process does not move old memory charges

This one bites people during debugging and live migrations. Memory is charged to the cgroup that instantiated it and stays charged there until released. Moving the process later does not retroactively move all previously charged memory.

That means:

post-hoc process migration can produce misleading accounting,
experiments should create the cgroup shape first, then start the process inside it,
“why is the old cgroup still fat?” often has a very boring answer.

5) Minimal safe patterns

Pattern 1: protect a critical API, throttle background jobs

sudo mkdir -p /sys/fs/cgroup/api /sys/fs/cgroup/batch
sudo sh -c 'echo +memory > /sys/fs/cgroup/cgroup.subtree_control'

echo 2147483648 | sudo tee /sys/fs/cgroup/api/memory.low
echo 3221225472 | sudo tee /sys/fs/cgroup/api/memory.high
echo 17179869184 | sudo tee /sys/fs/cgroup/batch/memory.high
echo 21474836480 | sudo tee /sys/fs/cgroup/batch/memory.max

Interpretation:

API gets 2 GiB best-effort protection,
API still has a throttle point to avoid silent growth,
batch gets slowed first under pressure,
batch has a hard containment wall.

Pattern 2: no-swap low-latency service

echo 0 | sudo tee /sys/fs/cgroup/gateway/memory.swap.max
echo 2147483648 | sudo tee /sys/fs/cgroup/gateway/memory.low
echo 3221225472 | sudo tee /sys/fs/cgroup/gateway/memory.high

Good for:

latency-sensitive gateways,
request-path components where swap would destroy tail latency.

But remember: if the working set truly needs 5 GiB, “no swap” plus memory.high=3G is just a very opinionated failure plan.

Pattern 3: treat worker pool as a single failure unit

echo 1 | sudo tee /sys/fs/cgroup/workerpool/memory.oom.group
echo 8589934592 | sudo tee /sys/fs/cgroup/workerpool/memory.max

Good when a half-killed worker pool is worse than a clean restart.

Pattern 4: proactive reclaim before maintenance

echo "512M" | sudo tee /sys/fs/cgroup/cache-warmers/memory.reclaim

Good for testing or trimming non-critical caches before doing disruptive work.

6) systemd mappings you will actually use

If the machine is managed by systemd, use unit properties instead of hand-writing /sys/fs/cgroup/* unless you are debugging live.

Important mappings:

MemoryAccounting=yes → enable memory accounting
MemoryMin= → memory.min
MemoryLow= → memory.low
MemoryHigh= → memory.high
MemoryMax= → memory.max
MemorySwapMax= → memory.swap.max
ManagedOOMMemoryPressure=kill → let systemd-oomd watch pressure and act
ManagedOOMMemoryPressureLimit= → custom PSI threshold for systemd-oomd

Examples:

sudo systemctl set-property api.service MemoryAccounting=yes
sudo systemctl set-property api.service MemoryLow=2G
sudo systemctl set-property api.service MemoryHigh=3G
sudo systemctl set-property api.service MemoryMax=4G

sudo systemctl set-property gateway.service MemorySwapMax=0

sudo systemctl set-property batch.service MemoryHigh=12G
sudo systemctl set-property batch.service MemoryMax=16G

sudo systemctl set-property worker.slice ManagedOOMMemoryPressure=kill
sudo systemctl set-property worker.slice ManagedOOMMemoryPressureLimit=40%

Important systemd guidance worth internalizing:

use MemoryHigh= as the main control mechanism,
use MemoryMax= as the last line of defense,
protection settings usually need meaningful ancestor allocation too.

7) Observability checklist

At minimum, watch per cgroup:

memory.current
memory.peak
memory.stat
memory.events
memory.events.local
memory.pressure
memory.swap.current
app p95 / p99 latency
timeout / retry / queue depth signals

Basic commands:

cat /sys/fs/cgroup/api/memory.current
cat /sys/fs/cgroup/api/memory.peak
cat /sys/fs/cgroup/api/memory.events
cat /sys/fs/cgroup/api/memory.events.local
cat /sys/fs/cgroup/api/memory.pressure
cat /sys/fs/cgroup/api/memory.swap.current
cat /sys/fs/cgroup/api/memory.stat | egrep 'anon|file|kernel|slab|sock|workingset|pgscan|pgsteal'

Interpretation hints:

memory.events.high climbing + latency degradation → memory.high is active and hurting; maybe okay, maybe too tight.
high PSI some but low full → contention exists, but total collapse has not arrived yet.
rising PSI full → reclaim thrash / serious productivity loss.
memory.events.low increasing → protection is overcommitted or insufficient at ancestors.
memory.events.max / oom / oom_kill non-zero → you are no longer tuning; you are already in containment failure.
swap usage climbing without PSI relief → swap is likely just stretching pain, not solving it.

If you can only alert on one thing beyond OOM kills, alert on sustained memory PSI plus memory.events.high. That catches the bad middle state before the box starts killing things.

8) Rollout sequence that usually works

Measure first: baseline memory.current, memory.pressure, memory.events, and app latency.
Separate workloads into sane sibling cgroups.
Set small, explicit protections only for the workloads that truly deserve them.
Introduce memory.high as the main operating limit.
Observe for at least one business cycle:
- PSI,
- memory.events.high,
- application latency,
- swap behavior.
Add memory.max as a safety barrier after you understand steady-state behavior.
Decide swap policy explicitly per workload class.
For integrity-sensitive groups, enable memory.oom.group=1.
Revisit settings after major changes in:
- kernel version,
- filesystem / I/O behavior,
- cache shape,
- traffic mix,
- resident set profile.

9) Common mistakes

Using only hard limits
Jumping straight to memory.max is the fastest route to surprise OOMs.
Protecting everything
If every cgroup is “critical”, none of your protection policy is real.
Forgetting ancestor allocations
Leaf memory.low without parent support is fake confidence.
Reading usage without pressure
High memory usage is not automatically bad; high PSI often is.
Ignoring page cache and kernel memory
memory.current is not just app heap.
Treating swap as a moral question
It is a workload-policy question. Some workloads should never swap; others benefit from controlled swap.
Moving processes after the fact and trusting accounting
Old charges stay where they were created.
Using group OOM blindly
memory.oom.group=1 is great when the workload is truly indivisible, bad when one helper process can be safely sacrificed.
Ignoring memory.events.low
It is one of the easiest ways to notice that your protection model is fantasy.

10) One-page starter policy

If you need a practical default today:

critical online service
- MemoryAccounting=yes
- modest MemoryLow= based on observed working set
- MemoryHigh= slightly above that working set
- MemoryMax= as last defense
- MemorySwapMax=0 only if tail latency really matters
background / batch / maintenance jobs
- separate cgroup or service slice
- lower or no protection
- firmer MemoryHigh=
- MemoryMax= if host blast radius must be capped
- allow some swap when throughput matters more than latency
small essential host-support services
- tiny MemoryMin= or MemoryLow=
- avoid starving the very daemons needed to keep the workload healthy
multi-process worker bundles
- consider memory.oom.group=1
- prefer clean restart over weird partial survival

If you are unsure, this heuristic is usually sane:

Protect a little, throttle early, cap late, watch PSI, and make swap a conscious choice.

Source notes

Primary references used for this note:

Linux kernel cgroup v2 documentation (memory.min, memory.low, memory.high, memory.max, memory.reclaim, memory.events, memory.swap.max, memory.oom.group)
Linux kernel PSI documentation (memory.pressure, some vs full, threshold monitoring)
systemd resource-control documentation / Debian manpage (MemoryMin, MemoryLow, MemoryHigh, MemoryMax, MemorySwapMax, ManagedOOMMemoryPressure)
Facebook cgroup2 memory controller guide (practical memory.low, memory.current, swap usage, and fbtax2 operating lessons)

Linux cgroup v2 Memory Controller Playbook (memory.min/low/high/max, swap, PSI, systemd mappings)

Linux cgroup v2 Memory Controller Playbook (memory.min/low/high/max, swap, PSI, systemd mappings)

Why this matters

1) Quick mental model

2) Decision matrix

A) User-facing service slows down under host contention

B) You want graceful slowdown before OOM

C) Batch / ETL / compaction jobs must not eat the machine

D) A workload must preserve integrity on OOM

E) Latency-sensitive workload should never swap

F) You are tuning protections in a hierarchy

3) What each knob really does

memory.current: what is charged right now

memory.min: hard protection

memory.low: best-effort protection

memory.high: throttle limit, and usually the main control

memory.max: hard limit, last line of defense

memory.swap.max: swap policy in one file

memory.swap.high: emergency warning, not your everyday knob

memory.oom.group: kill the workload, not a random organ

memory.reclaim: proactive reclaim

memory.events: the scoreboard

memory.pressure: the “is this hurting?” signal

memory.stat: explain the bill

4) Hierarchy rules people forget

Protections are hierarchical

Overcommit is possible — and sometimes useful

Moving a process does not move old memory charges

5) Minimal safe patterns

Pattern 1: protect a critical API, throttle background jobs

Pattern 2: no-swap low-latency service

Pattern 3: treat worker pool as a single failure unit

Pattern 4: proactive reclaim before maintenance

6) systemd mappings you will actually use

7) Observability checklist

8) Rollout sequence that usually works

9) Common mistakes

10) One-page starter policy

Source notes

`memory.current`: what is charged right now

`memory.min`: hard protection

`memory.low`: best-effort protection

`memory.high`: throttle limit, and usually the main control

`memory.max`: hard limit, last line of defense

`memory.swap.max`: swap policy in one file

`memory.swap.high`: emergency warning, not your everyday knob

`memory.oom.group`: kill the workload, not a random organ

`memory.reclaim`: proactive reclaim

`memory.events`: the scoreboard

`memory.pressure`: the “is this hurting?” signal

`memory.stat`: explain the bill