Linux cgroup v2 Memory Controller Playbook (memory.min/low/high/max, swap, PSI, systemd mappings)

2026-04-06 · software

Linux cgroup v2 Memory Controller Playbook (memory.min/low/high/max, swap, PSI, systemd mappings)

Date: 2026-04-06
Category: knowledge

Why this matters

A lot of production memory incidents are not really “we ran out of RAM” in the simple sense. They are usually one of these:

cgroup v2 gives a much better vocabulary for this than old per-process tuning:

The key mindset shift is this:

Treat memory as a contested shared resource whose failure modes are reclaim, stall, swap, and finally OOM — not just a single usage number.


1) Quick mental model

Use the knobs for different jobs:

Rule of thumb:


2) Decision matrix

A) User-facing service slows down under host contention

B) You want graceful slowdown before OOM

C) Batch / ETL / compaction jobs must not eat the machine

D) A workload must preserve integrity on OOM

E) Latency-sensitive workload should never swap

F) You are tuning protections in a hierarchy


3) What each knob really does

memory.current: what is charged right now

memory.current shows current memory usage for the cgroup and its descendants. It is broader than many people expect:

So if a service looks “too big”, don’t assume it is heap growth. It may be page cache, socket buffers, or kernel-side accounting.

memory.min: hard protection

memory.min is the strongest protection boundary. If the cgroup is within its effective min boundary, that memory is not reclaimed under normal reclaim logic. If the system runs out of reclaimable unprotected memory, the result is OOM pressure rather than violating the guarantee.

Use it sparingly:

Do not scatter large memory.min values everywhere. Overcommitted hard protection is how you manufacture constant OOMs.

memory.low: best-effort protection

memory.low is the softer, more operationally friendly sibling. Below this boundary, the cgroup is protected unless the system has no reclaimable memory in unprotected cgroups.

This is usually the better default protection knob for:

If memory.events.low keeps rising, that usually means your low protection is overcommitted or the host is under deeper pressure than your model assumed.

memory.high: throttle limit, and usually the main control

This is the most important control. When usage goes above memory.high, the workload is pushed into heavy reclaim and allocation throttling.

Key behavior:

This is why kernel docs and systemd guidance both treat memory.high / MemoryHigh= as the primary mechanism.

Think of it as:

“Pay back excess memory by doing reclaim work now.”

That repayment often shows up as latency spikes, CPU burn in reclaim, and PSI growth.

memory.max: hard limit, last line of defense

memory.max is the absolute ceiling. If usage reaches it and reclaim cannot contain the cgroup, the cgroup enters OOM handling.

Use it for:

Do not use it as your first tuning tool unless you enjoy surprise kills. The usual safer pattern is:

  1. tune memory.high,
  2. observe pressure,
  3. set memory.max above that as the safety backstop.

memory.swap.max: swap policy in one file

memory.swap.max sets the hard cap on swap usage for the cgroup.

Common patterns:

Be deliberate here. Swap is not automatically evil, but accidental swap policy usually is.

memory.swap.high: emergency warning, not your everyday knob

memory.swap.high is a throttle point for swap usage. Kernel docs describe it as a point of no return and explicitly not the normal way to manage healthy swap behavior.

Operationally:

memory.oom.group: kill the workload, not a random organ

With memory.oom.group=1, the OOM killer treats the cgroup as an indivisible workload and kills tasks together.

This is often better for:

Without it, you can end up with the ugly middle ground: a “still running” workload that has already lost the process that mattered.

memory.reclaim: proactive reclaim

memory.reclaim lets you ask the kernel to reclaim memory from a cgroup manually.

Example:

echo "1G" | sudo tee /sys/fs/cgroup/batch/memory.reclaim

Use it for:

But note the caveat from kernel docs: reclaim triggered this way is not the same as natural pressure-driven reclaim, and some side effects such as socket memory balancing are not exercised in the same way.

memory.events: the scoreboard

memory.events is one of the most useful files in the whole controller. Key counters:

Use memory.events.local if you want local-only signals rather than hierarchical subtree counts.

memory.pressure: the “is this hurting?” signal

PSI is what turns memory control from guesswork into operations.

memory.pressure exports:

The rolling windows:

memory.current tells you how much memory is charged. memory.pressure tells you whether that memory situation is actually costing productive time.

That distinction matters a lot. A big cache may be healthy. A modest footprint with rising PSI may be dying.

memory.stat: explain the bill

memory.stat breaks usage down into categories and reclaim-related counters. Use it to answer questions like:

If you only watch memory.current, you are debugging with one eye shut.


4) Hierarchy rules people forget

Protections are hierarchical

memory.min and memory.low are constrained by ancestor protections. If the parent does not provide enough protection budget, children compete for the effective protection.

So this is wrong:

This is closer to correct:

Overcommit is possible — and sometimes useful

Kernel docs explicitly note that overcommitting memory.high can be viable. That is often true because memory.high produces throttling and reclaim, not instant death.

Overcommitting protection is more dangerous:

Moving a process does not move old memory charges

This one bites people during debugging and live migrations. Memory is charged to the cgroup that instantiated it and stays charged there until released. Moving the process later does not retroactively move all previously charged memory.

That means:


5) Minimal safe patterns

Pattern 1: protect a critical API, throttle background jobs

sudo mkdir -p /sys/fs/cgroup/api /sys/fs/cgroup/batch
sudo sh -c 'echo +memory > /sys/fs/cgroup/cgroup.subtree_control'

echo 2147483648 | sudo tee /sys/fs/cgroup/api/memory.low
echo 3221225472 | sudo tee /sys/fs/cgroup/api/memory.high
echo 17179869184 | sudo tee /sys/fs/cgroup/batch/memory.high
echo 21474836480 | sudo tee /sys/fs/cgroup/batch/memory.max

Interpretation:

Pattern 2: no-swap low-latency service

echo 0 | sudo tee /sys/fs/cgroup/gateway/memory.swap.max
echo 2147483648 | sudo tee /sys/fs/cgroup/gateway/memory.low
echo 3221225472 | sudo tee /sys/fs/cgroup/gateway/memory.high

Good for:

But remember: if the working set truly needs 5 GiB, “no swap” plus memory.high=3G is just a very opinionated failure plan.

Pattern 3: treat worker pool as a single failure unit

echo 1 | sudo tee /sys/fs/cgroup/workerpool/memory.oom.group
echo 8589934592 | sudo tee /sys/fs/cgroup/workerpool/memory.max

Good when a half-killed worker pool is worse than a clean restart.

Pattern 4: proactive reclaim before maintenance

echo "512M" | sudo tee /sys/fs/cgroup/cache-warmers/memory.reclaim

Good for testing or trimming non-critical caches before doing disruptive work.


6) systemd mappings you will actually use

If the machine is managed by systemd, use unit properties instead of hand-writing /sys/fs/cgroup/* unless you are debugging live.

Important mappings:

Examples:

sudo systemctl set-property api.service MemoryAccounting=yes
sudo systemctl set-property api.service MemoryLow=2G
sudo systemctl set-property api.service MemoryHigh=3G
sudo systemctl set-property api.service MemoryMax=4G

sudo systemctl set-property gateway.service MemorySwapMax=0

sudo systemctl set-property batch.service MemoryHigh=12G
sudo systemctl set-property batch.service MemoryMax=16G

sudo systemctl set-property worker.slice ManagedOOMMemoryPressure=kill
sudo systemctl set-property worker.slice ManagedOOMMemoryPressureLimit=40%

Important systemd guidance worth internalizing:


7) Observability checklist

At minimum, watch per cgroup:

Basic commands:

cat /sys/fs/cgroup/api/memory.current
cat /sys/fs/cgroup/api/memory.peak
cat /sys/fs/cgroup/api/memory.events
cat /sys/fs/cgroup/api/memory.events.local
cat /sys/fs/cgroup/api/memory.pressure
cat /sys/fs/cgroup/api/memory.swap.current
cat /sys/fs/cgroup/api/memory.stat | egrep 'anon|file|kernel|slab|sock|workingset|pgscan|pgsteal'

Interpretation hints:

If you can only alert on one thing beyond OOM kills, alert on sustained memory PSI plus memory.events.high. That catches the bad middle state before the box starts killing things.


8) Rollout sequence that usually works

  1. Measure first: baseline memory.current, memory.pressure, memory.events, and app latency.
  2. Separate workloads into sane sibling cgroups.
  3. Set small, explicit protections only for the workloads that truly deserve them.
  4. Introduce memory.high as the main operating limit.
  5. Observe for at least one business cycle:
    • PSI,
    • memory.events.high,
    • application latency,
    • swap behavior.
  6. Add memory.max as a safety barrier after you understand steady-state behavior.
  7. Decide swap policy explicitly per workload class.
  8. For integrity-sensitive groups, enable memory.oom.group=1.
  9. Revisit settings after major changes in:
    • kernel version,
    • filesystem / I/O behavior,
    • cache shape,
    • traffic mix,
    • resident set profile.

9) Common mistakes

  1. Using only hard limits
    Jumping straight to memory.max is the fastest route to surprise OOMs.

  2. Protecting everything
    If every cgroup is “critical”, none of your protection policy is real.

  3. Forgetting ancestor allocations
    Leaf memory.low without parent support is fake confidence.

  4. Reading usage without pressure
    High memory usage is not automatically bad; high PSI often is.

  5. Ignoring page cache and kernel memory
    memory.current is not just app heap.

  6. Treating swap as a moral question
    It is a workload-policy question. Some workloads should never swap; others benefit from controlled swap.

  7. Moving processes after the fact and trusting accounting
    Old charges stay where they were created.

  8. Using group OOM blindly
    memory.oom.group=1 is great when the workload is truly indivisible, bad when one helper process can be safely sacrificed.

  9. Ignoring memory.events.low
    It is one of the easiest ways to notice that your protection model is fantasy.


10) One-page starter policy

If you need a practical default today:

If you are unsure, this heuristic is usually sane:

Protect a little, throttle early, cap late, watch PSI, and make swap a conscious choice.


Source notes

Primary references used for this note: