Linux cgroup v2 I/O Controller Playbook (io.max, io.latency, io.cost, systemd mappings)

2026-04-05 · software

Linux cgroup v2 I/O Controller Playbook (io.max, io.latency, io.cost, systemd mappings)

Date: 2026-04-05
Category: knowledge

Why this matters

A lot of Linux incidents that look like “the disk is slow” are actually contention-shape problems:

cgroup v2 gives a cleaner toolbox:

The big upgrade over older setups is that v2 can account and control buffered writeback and page-cache related I/O much more coherently, not just direct I/O.


1) Quick mental model

Use different knobs for different jobs:

Rule of thumb:


2) Decision matrix

A) Backup / batch / compaction hurts API p99

B) Several tenants need fair disk sharing

C) You must guarantee background work never exceeds a ceiling

D) Host is oscillating between memory pressure and writeback pressure


3) What each knob really does

io.max: hard cap

io.max limits per-device bandwidth and/or IOPS.

Supported keys:

Example:

echo "8:16 rbps=2097152 wiops=120" | sudo tee /sys/fs/cgroup/batch/io.max

Interpretation:

Use it when the question is: “What is the most this group is allowed to consume?”

Do not use it as your only latency tool unless you already know the safe ceiling.

io.latency: work-conserving latency protection

io.latency protects a cgroup by setting a target completion latency. If that cgroup starts missing the target, the kernel throttles peer cgroups with looser targets.

Example interface:

echo "259:0 target=5000" | sudo tee /sys/fs/cgroup/api/io.latency

That means a 5 ms target on device 259:0.

Important behaviors:

Use it when the question is: “How do I keep the critical service responsive without wasting idle capacity?”

io.weight: simple proportional share

io.weight is the intuitive fairness lever. Higher weight means more share relative to siblings.

Example:

echo 400 | sudo tee /sys/fs/cgroup/api/io.weight
echo 100 | sudo tee /sys/fs/cgroup/batch/io.weight

This is good for:

io.cost: advanced model-based proportional control

io.cost is more sophisticated and more operationally demanding. It lives at the root cgroup and uses a device cost model plus QoS inputs.

Relevant files:

Kernel docs expose knobs like:

Use io.cost only when:

If not, io.latency + io.max is usually the better operational trade.


4) The memory / writeback connection people miss

This is the sneaky part.

Buffered writes, dirty page reclaim, and writeback sit between the memory and I/O domains. That means:

Practical consequence:

This is why cgroup v2 is much more useful than older approaches: page-cache writeback attribution is materially better.


5) Setup and discovery

Check that you are on unified hierarchy:

stat -fc %T /sys/fs/cgroup
# should print: cgroup2fs

See available controllers:

cat /sys/fs/cgroup/cgroup.controllers
cat /sys/fs/cgroup/cgroup.subtree_control

Enable I/O controller on the parent level:

sudo sh -c 'echo "+io" > /sys/fs/cgroup/cgroup.subtree_control'

Create sibling groups:

sudo mkdir -p /sys/fs/cgroup/api
sudo mkdir -p /sys/fs/cgroup/batch

Move processes:

echo <PID> | sudo tee /sys/fs/cgroup/api/cgroup.procs
echo <PID> | sudo tee /sys/fs/cgroup/batch/cgroup.procs

Remember the cgroup v2 rule: controllers distribute resources to children. Shape the tree deliberately.


6) Minimal safe patterns

Pattern 1: protect API from background jobs

# API gets tighter latency target
printf '259:0 target=5000\n' | sudo tee /sys/fs/cgroup/api/io.latency

# Batch gets looser target and lower weight
printf '259:0 target=20000\n' | sudo tee /sys/fs/cgroup/batch/io.latency
echo 100 | sudo tee /sys/fs/cgroup/batch/io.weight

Interpretation:

Pattern 2: cap a backup job so it never dominates

echo '259:0 rbps=20971520 wbps=20971520 riops=200 wiops=200' | sudo tee /sys/fs/cgroup/backup/io.max

For direct writes to io.max, use raw numeric bytes/IOPS values. Human-friendly suffixes are more appropriate in higher-level tooling such as systemd properties.

Pattern 3: fair-share between two tenants

echo 300 | sudo tee /sys/fs/cgroup/tenant-a/io.weight
echo 100 | sudo tee /sys/fs/cgroup/tenant-b/io.weight

This gives tenant A roughly 3x the share of contested I/O time versus tenant B.


7) systemd mappings you will actually use

If the host is systemd-managed, don’t hand-build everything in /sys/fs/cgroup unless you need ad-hoc debugging. Use unit properties.

Key mappings:

Examples:

sudo systemctl set-property api.service IOAccounting=yes
sudo systemctl set-property api.service IODeviceLatencyTargetSec="/dev/nvme0n1 5ms"

sudo systemctl set-property backup.service IOAccounting=yes
sudo systemctl set-property backup.service IOReadBandwidthMax="/var/backups 20M"
sudo systemctl set-property backup.service IOWriteBandwidthMax="/var/backups 20M"

sudo systemctl set-property tenant-a.service IOWeight=300
sudo systemctl set-property tenant-b.service IOWeight=100

Nice operational detail:


8) Observability checklist

At minimum, track per-cgroup:

Basic commands:

cat /sys/fs/cgroup/api/io.stat
cat /sys/fs/cgroup/api/io.pressure
cat /sys/fs/cgroup/api/memory.pressure
cat /sys/fs/cgroup/api/memory.stat | egrep 'file|writeback|dirty|workingset'

When io.latency is active, watch for extra io.stat fields such as:

Interpretation hints:


9) Rollout sequence that usually works

  1. Measure first: 24h baseline with service latency + io.stat / PSI snapshots.
  2. Separate workloads into sibling cgroups before tuning knobs.
  3. For online critical services, start with io.latency.
  4. For obvious noisy jobs, add io.max ceilings.
  5. Use io.weight for ongoing tenant fairness.
  6. Reach for io.cost only after the simple controls prove insufficient.
  7. Roll out canary → partial fleet → full fleet.
  8. Recalibrate after storage hardware, kernel, filesystem, or workload changes.

10) Common mistakes

  1. Using only io.max everywhere
    You get rigid ceilings but poor latency behavior.

  2. Setting impossible io.latency targets
    If the device cannot do 2 ms under real load, pretending it can just causes constant throttling noise.

  3. Ignoring sibling topology
    io.latency works among peers. Wrong tree shape means wrong protection behavior.

  4. Debugging I/O without memory context
    Many “disk issues” are dirty-page / reclaim issues wearing a disk-shaped mask.

  5. Relying on path-to-device resolution blindly in systemd
    Works nicely for simple storage, gets ambiguous on layered setups.

  6. Going straight to io.cost on day one
    Great paper-cutting tool, bad first-response tool.


11) One-page starter policy

If you need a practical default today:

If you are unsure, protect with latency first, cap second, model last.


Source notes

Primary references used for this note: