Linux cgroup v2 Memory Controller Playbook (memory.min/low/high/max, swap, PSI, systemd mappings)
Date: 2026-04-06
Category: knowledge
Why this matters
A lot of production memory incidents are not really “we ran out of RAM” in the simple sense. They are usually one of these:
- one batch job bloats page cache and pushes a latency-sensitive service into reclaim,
- background daemons and sidecars quietly eat headroom until the main workload starts thrashing,
- the host still looks fine on average, but
memory.highreclaim turns p99 into sludge, - operators jump straight to hard limits and get surprise OOM kills instead of graceful slowdown,
- swap policy is accidental rather than explicit, so the box fails in weird, slow motion.
cgroup v2 gives a much better vocabulary for this than old per-process tuning:
- protection for workloads that must keep breathing,
- throttling for workloads that can slow down before they die,
- hard containment for blast-radius control,
- pressure signals that tell you whether a workload is actually suffering,
- hierarchical policy so the host can decide who yields first.
The key mindset shift is this:
Treat memory as a contested shared resource whose failure modes are reclaim, stall, swap, and finally OOM — not just a single usage number.
1) Quick mental model
Use the knobs for different jobs:
memory.min: hard protection.
“Do not reclaim this working set; if that becomes impossible, kill something else or OOM.”memory.low: best-effort protection.
“Prefer not to reclaim this working set unless unprotected memory is exhausted.”memory.high: throttle + reclaim pressure.
“This is the main operating limit. Slow the workload down before catastrophe.”memory.max: hard ceiling.
“Last line of defense. If reclaim cannot contain usage, OOM inside the cgroup.”memory.swap.max: swap ceiling.
“How much swap is this workload allowed to consume?”memory.oom.group: kill as a unit.
“If this workload OOMs, kill the whole thing together rather than leaving a half-dead shard.”memory.events/memory.pressure: telemetry.
“Are we protected, throttled, thrashing, or approaching OOM?”memory.reclaim: proactive reclaim trigger.
“Ask the kernel to reclaim from this cgroup now.”
Rule of thumb:
- start with
memory.highas the main control, - use
memory.low/memory.minonly for workloads that truly deserve protection, - keep
memory.maxas the crash barrier, not the primary tuning knob, - decide swap policy explicitly,
- read PSI + events + stat together, not one file at a time.
2) Decision matrix
A) User-facing service slows down under host contention
- give the service
memory.lowor a modestmemory.minif it is truly critical, - put noisy neighbors in sibling cgroups,
- set a reasonable
memory.highon the noisy workloads, - watch
memory.pressureandmemory.events.high.
B) You want graceful slowdown before OOM
- set
memory.highfirst, - alert on sustained
memory.events.highgrowth or rising PSI, - use
memory.maxonly as the final containment fence.
C) Batch / ETL / compaction jobs must not eat the machine
- isolate them in their own cgroup,
- apply a conservative
memory.high, - optionally add
memory.maxif host safety matters more than job completion time, - consider allowing swap if throughput matters more than latency.
D) A workload must preserve integrity on OOM
- set
memory.oom.group=1, - ensure restart / orchestration logic expects whole-group death,
- avoid partial survivors with corrupt or incomplete in-memory state.
E) Latency-sensitive workload should never swap
- set
memory.swap.max=0, - ensure
memory.low/memory.highare realistic, - remember that “no swap” without enough headroom often just turns slow failure into fast OOM.
F) You are tuning protections in a hierarchy
- remember protections are hierarchical,
- set ancestor protection too, not just the leaf,
- avoid protection overcommit unless you understand how siblings will compete.
3) What each knob really does
memory.current: what is charged right now
memory.current shows current memory usage for the cgroup and its descendants.
It is broader than many people expect:
- anonymous memory,
- page cache,
- kernel objects such as inodes/slab,
- network buffers.
So if a service looks “too big”, don’t assume it is heap growth. It may be page cache, socket buffers, or kernel-side accounting.
memory.min: hard protection
memory.min is the strongest protection boundary.
If the cgroup is within its effective min boundary, that memory is not reclaimed under normal reclaim logic.
If the system runs out of reclaimable unprotected memory, the result is OOM pressure rather than violating the guarantee.
Use it sparingly:
- control-plane daemons you really cannot afford to starve,
- tiny but essential supporting services,
- workloads where reclaim-induced collapse is worse than earlier OOM elsewhere.
Do not scatter large memory.min values everywhere. Overcommitted hard protection is how you manufacture constant OOMs.
memory.low: best-effort protection
memory.low is the softer, more operationally friendly sibling.
Below this boundary, the cgroup is protected unless the system has no reclaimable memory in unprotected cgroups.
This is usually the better default protection knob for:
- user-facing services,
- databases or caches that benefit from preserving working set,
- support slices needed by the primary workload.
If memory.events.low keeps rising, that usually means your low protection is overcommitted or the host is under deeper pressure than your model assumed.
memory.high: throttle limit, and usually the main control
This is the most important control.
When usage goes above memory.high, the workload is pushed into heavy reclaim and allocation throttling.
Key behavior:
- it does not directly invoke the OOM killer,
- it can be exceeded temporarily,
- if you set it too low, the workload usually degrades rather than dies,
- it is the best knob for finding the smallest memory footprint that still performs acceptably.
This is why kernel docs and systemd guidance both treat memory.high / MemoryHigh= as the primary mechanism.
Think of it as:
“Pay back excess memory by doing reclaim work now.”
That repayment often shows up as latency spikes, CPU burn in reclaim, and PSI growth.
memory.max: hard limit, last line of defense
memory.max is the absolute ceiling.
If usage reaches it and reclaim cannot contain the cgroup, the cgroup enters OOM handling.
Use it for:
- tenant containment,
- protection against buggy or runaway memory growth,
- preventing one workload from consuming all host slack.
Do not use it as your first tuning tool unless you enjoy surprise kills. The usual safer pattern is:
- tune
memory.high, - observe pressure,
- set
memory.maxabove that as the safety backstop.
memory.swap.max: swap policy in one file
memory.swap.max sets the hard cap on swap usage for the cgroup.
Common patterns:
0for low-latency services where swap-induced tail latency is unacceptable,- finite value for batch workloads that can trade latency for throughput or survival,
maxonly when you intentionally want the workload to compete for swap without a per-cgroup ceiling.
Be deliberate here. Swap is not automatically evil, but accidental swap policy usually is.
memory.swap.high: emergency warning, not your everyday knob
memory.swap.high is a throttle point for swap usage.
Kernel docs describe it as a point of no return and explicitly not the normal way to manage healthy swap behavior.
Operationally:
- use
memory.swap.maxfor policy, - treat
memory.swap.highas an advanced/rare signal when building custom out-of-memory responses.
memory.oom.group: kill the workload, not a random organ
With memory.oom.group=1, the OOM killer treats the cgroup as an indivisible workload and kills tasks together.
This is often better for:
- multi-process workers,
- JVM + helper process bundles,
- application groups that become useless if one child disappears,
- service units where partial survival creates bad state.
Without it, you can end up with the ugly middle ground: a “still running” workload that has already lost the process that mattered.
memory.reclaim: proactive reclaim
memory.reclaim lets you ask the kernel to reclaim memory from a cgroup manually.
Example:
echo "1G" | sudo tee /sys/fs/cgroup/batch/memory.reclaim
Use it for:
- experiments,
- pre-maintenance cleanup,
- testing how reclaim-sensitive a workload is,
- gentle pre-pressure trimming.
But note the caveat from kernel docs: reclaim triggered this way is not the same as natural pressure-driven reclaim, and some side effects such as socket memory balancing are not exercised in the same way.
memory.events: the scoreboard
memory.events is one of the most useful files in the whole controller.
Key counters:
low: reclaimed despite being under low boundary → protection is overcommitted or ineffective,high: throttled because high boundary was exceeded,max: tried to go over max,oom: allocation was about to fail at limit,oom_kill: processes killed by OOM,oom_group_kill: whole-group OOM event,sock_throttled: socket throttling events.
Use memory.events.local if you want local-only signals rather than hierarchical subtree counts.
memory.pressure: the “is this hurting?” signal
PSI is what turns memory control from guesswork into operations.
memory.pressure exports:
some: some tasks are stalled on memory pressure,full: all non-idle tasks are stalled; this is closer to actual thrash / productivity collapse.
The rolling windows:
avg10avg60avg300total
memory.current tells you how much memory is charged.
memory.pressure tells you whether that memory situation is actually costing productive time.
That distinction matters a lot. A big cache may be healthy. A modest footprint with rising PSI may be dying.
memory.stat: explain the bill
memory.stat breaks usage down into categories and reclaim-related counters.
Use it to answer questions like:
- is this anonymous memory or page cache?
- are slab / sock / kernel allocations the hidden culprit?
- are workingset refaults suggesting cache churn?
- is file cache getting beaten up by reclaim?
If you only watch memory.current, you are debugging with one eye shut.
4) Hierarchy rules people forget
Protections are hierarchical
memory.min and memory.low are constrained by ancestor protections.
If the parent does not provide enough protection budget, children compete for the effective protection.
So this is wrong:
- leaf cgroup gets a large
memory.low, - ancestor has no corresponding protection,
- operator assumes the leaf is protected.
This is closer to correct:
- parent slice allocates protection,
- children receive explicit or shared portions,
- siblings compete within the parent’s protection budget.
Overcommit is possible — and sometimes useful
Kernel docs explicitly note that overcommitting memory.high can be viable.
That is often true because memory.high produces throttling and reclaim, not instant death.
Overcommitting protection is more dangerous:
- overcommitted
memory.lowcan be okay if you understand priority tradeoffs, - overcommitted
memory.minis much riskier and can force frequent OOM.
Moving a process does not move old memory charges
This one bites people during debugging and live migrations. Memory is charged to the cgroup that instantiated it and stays charged there until released. Moving the process later does not retroactively move all previously charged memory.
That means:
- post-hoc process migration can produce misleading accounting,
- experiments should create the cgroup shape first, then start the process inside it,
- “why is the old cgroup still fat?” often has a very boring answer.
5) Minimal safe patterns
Pattern 1: protect a critical API, throttle background jobs
sudo mkdir -p /sys/fs/cgroup/api /sys/fs/cgroup/batch
sudo sh -c 'echo +memory > /sys/fs/cgroup/cgroup.subtree_control'
echo 2147483648 | sudo tee /sys/fs/cgroup/api/memory.low
echo 3221225472 | sudo tee /sys/fs/cgroup/api/memory.high
echo 17179869184 | sudo tee /sys/fs/cgroup/batch/memory.high
echo 21474836480 | sudo tee /sys/fs/cgroup/batch/memory.max
Interpretation:
- API gets 2 GiB best-effort protection,
- API still has a throttle point to avoid silent growth,
- batch gets slowed first under pressure,
- batch has a hard containment wall.
Pattern 2: no-swap low-latency service
echo 0 | sudo tee /sys/fs/cgroup/gateway/memory.swap.max
echo 2147483648 | sudo tee /sys/fs/cgroup/gateway/memory.low
echo 3221225472 | sudo tee /sys/fs/cgroup/gateway/memory.high
Good for:
- latency-sensitive gateways,
- request-path components where swap would destroy tail latency.
But remember: if the working set truly needs 5 GiB, “no swap” plus memory.high=3G is just a very opinionated failure plan.
Pattern 3: treat worker pool as a single failure unit
echo 1 | sudo tee /sys/fs/cgroup/workerpool/memory.oom.group
echo 8589934592 | sudo tee /sys/fs/cgroup/workerpool/memory.max
Good when a half-killed worker pool is worse than a clean restart.
Pattern 4: proactive reclaim before maintenance
echo "512M" | sudo tee /sys/fs/cgroup/cache-warmers/memory.reclaim
Good for testing or trimming non-critical caches before doing disruptive work.
6) systemd mappings you will actually use
If the machine is managed by systemd, use unit properties instead of hand-writing /sys/fs/cgroup/* unless you are debugging live.
Important mappings:
MemoryAccounting=yes→ enable memory accountingMemoryMin=→memory.minMemoryLow=→memory.lowMemoryHigh=→memory.highMemoryMax=→memory.maxMemorySwapMax=→memory.swap.maxManagedOOMMemoryPressure=kill→ letsystemd-oomdwatch pressure and actManagedOOMMemoryPressureLimit=→ custom PSI threshold forsystemd-oomd
Examples:
sudo systemctl set-property api.service MemoryAccounting=yes
sudo systemctl set-property api.service MemoryLow=2G
sudo systemctl set-property api.service MemoryHigh=3G
sudo systemctl set-property api.service MemoryMax=4G
sudo systemctl set-property gateway.service MemorySwapMax=0
sudo systemctl set-property batch.service MemoryHigh=12G
sudo systemctl set-property batch.service MemoryMax=16G
sudo systemctl set-property worker.slice ManagedOOMMemoryPressure=kill
sudo systemctl set-property worker.slice ManagedOOMMemoryPressureLimit=40%
Important systemd guidance worth internalizing:
- use
MemoryHigh=as the main control mechanism, - use
MemoryMax=as the last line of defense, - protection settings usually need meaningful ancestor allocation too.
7) Observability checklist
At minimum, watch per cgroup:
memory.currentmemory.peakmemory.statmemory.eventsmemory.events.localmemory.pressurememory.swap.current- app p95 / p99 latency
- timeout / retry / queue depth signals
Basic commands:
cat /sys/fs/cgroup/api/memory.current
cat /sys/fs/cgroup/api/memory.peak
cat /sys/fs/cgroup/api/memory.events
cat /sys/fs/cgroup/api/memory.events.local
cat /sys/fs/cgroup/api/memory.pressure
cat /sys/fs/cgroup/api/memory.swap.current
cat /sys/fs/cgroup/api/memory.stat | egrep 'anon|file|kernel|slab|sock|workingset|pgscan|pgsteal'
Interpretation hints:
memory.events.highclimbing + latency degradation →memory.highis active and hurting; maybe okay, maybe too tight.- high PSI
somebut lowfull→ contention exists, but total collapse has not arrived yet. - rising PSI
full→ reclaim thrash / serious productivity loss. memory.events.lowincreasing → protection is overcommitted or insufficient at ancestors.memory.events.max/oom/oom_killnon-zero → you are no longer tuning; you are already in containment failure.- swap usage climbing without PSI relief → swap is likely just stretching pain, not solving it.
If you can only alert on one thing beyond OOM kills, alert on sustained memory PSI plus memory.events.high.
That catches the bad middle state before the box starts killing things.
8) Rollout sequence that usually works
- Measure first: baseline
memory.current,memory.pressure,memory.events, and app latency. - Separate workloads into sane sibling cgroups.
- Set small, explicit protections only for the workloads that truly deserve them.
- Introduce
memory.highas the main operating limit. - Observe for at least one business cycle:
- PSI,
memory.events.high,- application latency,
- swap behavior.
- Add
memory.maxas a safety barrier after you understand steady-state behavior. - Decide swap policy explicitly per workload class.
- For integrity-sensitive groups, enable
memory.oom.group=1. - Revisit settings after major changes in:
- kernel version,
- filesystem / I/O behavior,
- cache shape,
- traffic mix,
- resident set profile.
9) Common mistakes
Using only hard limits
Jumping straight tomemory.maxis the fastest route to surprise OOMs.Protecting everything
If every cgroup is “critical”, none of your protection policy is real.Forgetting ancestor allocations
Leafmemory.lowwithout parent support is fake confidence.Reading usage without pressure
High memory usage is not automatically bad; high PSI often is.Ignoring page cache and kernel memory
memory.currentis not just app heap.Treating swap as a moral question
It is a workload-policy question. Some workloads should never swap; others benefit from controlled swap.Moving processes after the fact and trusting accounting
Old charges stay where they were created.Using group OOM blindly
memory.oom.group=1is great when the workload is truly indivisible, bad when one helper process can be safely sacrificed.Ignoring
memory.events.low
It is one of the easiest ways to notice that your protection model is fantasy.
10) One-page starter policy
If you need a practical default today:
critical online service
MemoryAccounting=yes- modest
MemoryLow=based on observed working set MemoryHigh=slightly above that working setMemoryMax=as last defenseMemorySwapMax=0only if tail latency really matters
background / batch / maintenance jobs
- separate cgroup or service slice
- lower or no protection
- firmer
MemoryHigh= MemoryMax=if host blast radius must be capped- allow some swap when throughput matters more than latency
small essential host-support services
- tiny
MemoryMin=orMemoryLow= - avoid starving the very daemons needed to keep the workload healthy
- tiny
multi-process worker bundles
- consider
memory.oom.group=1 - prefer clean restart over weird partial survival
- consider
If you are unsure, this heuristic is usually sane:
Protect a little, throttle early, cap late, watch PSI, and make swap a conscious choice.
Source notes
Primary references used for this note:
- Linux kernel cgroup v2 documentation (
memory.min,memory.low,memory.high,memory.max,memory.reclaim,memory.events,memory.swap.max,memory.oom.group) - Linux kernel PSI documentation (
memory.pressure,somevsfull, threshold monitoring) - systemd resource-control documentation / Debian manpage (
MemoryMin,MemoryLow,MemoryHigh,MemoryMax,MemorySwapMax,ManagedOOMMemoryPressure) - Facebook cgroup2 memory controller guide (practical
memory.low,memory.current, swap usage, and fbtax2 operating lessons)