Linux cgroup v2 I/O Controller Playbook (io.max, io.latency, io.cost, systemd mappings)

Date: 2026-04-05
Category: knowledge

Why this matters

A lot of Linux incidents that look like “the disk is slow” are actually contention-shape problems:

backup or compaction jobs eat queue depth and API latency blows up,
one noisy writer turns page-cache reclaim into writeback storms,
everyone looks fine on average, but p99 falls off a cliff,
operators overreact with crude per-process ionice or global device tuning.

cgroup v2 gives a cleaner toolbox:

hard caps when a workload must not exceed a ceiling,
latency protection when a critical service must stay responsive,
proportional sharing when multiple tenants should compete fairly,
per-cgroup observability so you can see who is causing pain.

The big upgrade over older setups is that v2 can account and control buffered writeback and page-cache related I/O much more coherently, not just direct I/O.

1) Quick mental model

Use different knobs for different jobs:

io.max: absolute cap on read/write bandwidth or IOPS.
“This group may not exceed this ceiling.”
io.latency: protect latency-sensitive peers by throttling slower-priority peers when the protected group misses target latency.
“If this service starts missing its latency target, squeeze the others.”
io.weight: proportional sharing knob.
“When the device is contested, who gets a bigger slice?”
io.cost.qos / io.cost.model: advanced controller for model-based proportional control.
“Use a device cost model plus QoS bounds to shape proportional sharing more intelligently.”
io.stat / io.pressure: attribution + pressure signals.
“What is this cgroup doing, and is it suffering?”

Rule of thumb:

start with io.latency when protecting an online service,
use io.max for obvious blast-radius containment,
use io.weight / io.cost for multi-tenant fairness,
always observe memory + I/O together, because writeback sits between them.

2) Decision matrix

A) Backup / batch / compaction hurts API p99

Put batch work in a separate cgroup.
Protect API with io.latency.
If batch is still too bursty, add io.max ceiling.

B) Several tenants need fair disk sharing

Start with io.weight.
If you need tighter model-based proportional behavior on a stable fleet, evaluate io.cost.

C) You must guarantee background work never exceeds a ceiling

Use io.max for read/write BPS and/or IOPS.
Treat it as a hard governor, not a fairness tool.

D) Host is oscillating between memory pressure and writeback pressure

Inspect memory.pressure, io.pressure, memory.stat, and io.stat together.
Fixing only memory or only disk often misses the feedback loop.

3) What each knob really does

`io.max`: hard cap

io.max limits per-device bandwidth and/or IOPS.

Supported keys:

rbps
wbps
riops
wiops

Example:

echo "8:16 rbps=2097152 wiops=120" | sudo tee /sys/fs/cgroup/batch/io.max

Interpretation:

read capped at 2 MiB/s,
write capped at 120 IOPS,
short bursts may still appear because enforcement is not perfectly instantaneous,
operationally this behaves as a non-work-conserving ceiling.

Use it when the question is: “What is the most this group is allowed to consume?”

Do not use it as your only latency tool unless you already know the safe ceiling.

`io.latency`: work-conserving latency protection

io.latency protects a cgroup by setting a target completion latency. If that cgroup starts missing the target, the kernel throttles peer cgroups with looser targets.

Example interface:

echo "259:0 target=5000" | sudo tee /sys/fs/cgroup/api/io.latency

That means a 5 ms target on device 259:0.

Important behaviors:

peer-level only: only siblings influence each other,
work-conserving: if everyone is meeting target, the controller stays out of the way,
throttling can happen through queue depth clamping and artificial delay charging,
protected groups with the lowest / tightest target get priority under contention.

Use it when the question is: “How do I keep the critical service responsive without wasting idle capacity?”

`io.weight`: simple proportional share

io.weight is the intuitive fairness lever. Higher weight means more share relative to siblings.

Example:

echo 400 | sudo tee /sys/fs/cgroup/api/io.weight
echo 100 | sudo tee /sys/fs/cgroup/batch/io.weight

This is good for:

multiple always-on workloads,
environments where you want soft competition rather than hard caps,
first-pass tuning before using heavier machinery.

`io.cost`: advanced model-based proportional control

io.cost is more sophisticated and more operationally demanding. It lives at the root cgroup and uses a device cost model plus QoS inputs.

Relevant files:

io.cost.qos
io.cost.model

Kernel docs expose knobs like:

enable
ctrl=auto|user
latency percentiles / thresholds (rpct, rlat, wpct, wlat)
scaling bounds (min, max)

Use io.cost only when:

you have a relatively homogeneous fleet,
device behavior is understood and stable enough,
proportional fairness matters more than quick tactical containment,
you can actually test and recalibrate it.

If not, io.latency + io.max is usually the better operational trade.

4) The memory / writeback connection people miss

This is the sneaky part.

Buffered writes, dirty page reclaim, and writeback sit between the memory and I/O domains. That means:

memory pressure can become writeback pressure,
writeback throttling can change application latency even when raw disk throughput looks fine,
a cgroup flushing lots of dirty pages can hurt siblings unless the hierarchy is shaped correctly.

Practical consequence:

when you tune I/O policy, also watch memory.pressure and memory.stat,
when a workload is under memory reclaim, you may be debugging an I/O symptom caused by memory policy.

This is why cgroup v2 is much more useful than older approaches: page-cache writeback attribution is materially better.

5) Setup and discovery

Check that you are on unified hierarchy:

stat -fc %T /sys/fs/cgroup
# should print: cgroup2fs

See available controllers:

cat /sys/fs/cgroup/cgroup.controllers
cat /sys/fs/cgroup/cgroup.subtree_control

Enable I/O controller on the parent level:

sudo sh -c 'echo "+io" > /sys/fs/cgroup/cgroup.subtree_control'

Create sibling groups:

sudo mkdir -p /sys/fs/cgroup/api
sudo mkdir -p /sys/fs/cgroup/batch

Move processes:

echo <PID> | sudo tee /sys/fs/cgroup/api/cgroup.procs
echo <PID> | sudo tee /sys/fs/cgroup/batch/cgroup.procs

Remember the cgroup v2 rule: controllers distribute resources to children. Shape the tree deliberately.

6) Minimal safe patterns

Pattern 1: protect API from background jobs

# API gets tighter latency target
printf '259:0 target=5000\n' | sudo tee /sys/fs/cgroup/api/io.latency

# Batch gets looser target and lower weight
printf '259:0 target=20000\n' | sudo tee /sys/fs/cgroup/batch/io.latency
echo 100 | sudo tee /sys/fs/cgroup/batch/io.weight

Interpretation:

API aims for 5 ms,
batch is allowed 20 ms,
when API misses target, batch gets throttled first.

Pattern 2: cap a backup job so it never dominates

echo '259:0 rbps=20971520 wbps=20971520 riops=200 wiops=200' | sudo tee /sys/fs/cgroup/backup/io.max

For direct writes to io.max, use raw numeric bytes/IOPS values. Human-friendly suffixes are more appropriate in higher-level tooling such as systemd properties.

Pattern 3: fair-share between two tenants

echo 300 | sudo tee /sys/fs/cgroup/tenant-a/io.weight
echo 100 | sudo tee /sys/fs/cgroup/tenant-b/io.weight

This gives tenant A roughly 3x the share of contested I/O time versus tenant B.

7) systemd mappings you will actually use

If the host is systemd-managed, don’t hand-build everything in /sys/fs/cgroup unless you need ad-hoc debugging. Use unit properties.

Key mappings:

IOAccounting=yes → enable I/O accounting
IOWeight= / StartupIOWeight= → io.weight
IODeviceWeight= → per-device io.weight
IOReadBandwidthMax= / IOWriteBandwidthMax= → io.max
IOReadIOPSMax= / IOWriteIOPSMax= → io.max
IODeviceLatencyTargetSec= → io.latency

Examples:

sudo systemctl set-property api.service IOAccounting=yes
sudo systemctl set-property api.service IODeviceLatencyTargetSec="/dev/nvme0n1 5ms"

sudo systemctl set-property backup.service IOAccounting=yes
sudo systemctl set-property backup.service IOReadBandwidthMax="/var/backups 20M"
sudo systemctl set-property backup.service IOWriteBandwidthMax="/var/backups 20M"

sudo systemctl set-property tenant-a.service IOWeight=300
sudo systemctl set-property tenant-b.service IOWeight=100

Nice operational detail:

systemd automatically enables needed controllers through the hierarchy,
path-based device resolution is convenient, but on complex RAID / LV / exotic storage stacks, verify which underlying block device you are really controlling.

8) Observability checklist

At minimum, track per-cgroup:

io.stat
io.pressure
memory.pressure
app p95 / p99 latency
timeout / retry rate
writeback-related counters from memory.stat

Basic commands:

cat /sys/fs/cgroup/api/io.stat
cat /sys/fs/cgroup/api/io.pressure
cat /sys/fs/cgroup/api/memory.pressure
cat /sys/fs/cgroup/api/memory.stat | egrep 'file|writeback|dirty|workingset'

When io.latency is active, watch for extra io.stat fields such as:

depth
avg_lat
win
delay-related fields on throttled groups

Interpretation hints:

high avg_lat on protected group + peer throttling → target may be too tight or device is actually saturated,
rising io.pressure but modest throughput → queueing/latency problem, not just bandwidth,
dirty/writeback growth plus memory pressure → reclaim/writeback loop, not purely an I/O scheduler issue,
good averages but ugly p99 → classic contention-shape problem; io.latency usually beats blind bandwidth caps.

9) Rollout sequence that usually works

Measure first: 24h baseline with service latency + io.stat / PSI snapshots.
Separate workloads into sibling cgroups before tuning knobs.
For online critical services, start with io.latency.
For obvious noisy jobs, add io.max ceilings.
Use io.weight for ongoing tenant fairness.
Reach for io.cost only after the simple controls prove insufficient.
Roll out canary → partial fleet → full fleet.
Recalibrate after storage hardware, kernel, filesystem, or workload changes.

10) Common mistakes

Using only io.max everywhere
You get rigid ceilings but poor latency behavior.
Setting impossible io.latency targets
If the device cannot do 2 ms under real load, pretending it can just causes constant throttling noise.
Ignoring sibling topology
io.latency works among peers. Wrong tree shape means wrong protection behavior.
Debugging I/O without memory context
Many “disk issues” are dirty-page / reclaim issues wearing a disk-shaped mask.
Relying on path-to-device resolution blindly in systemd
Works nicely for simple storage, gets ambiguous on layered setups.
Going straight to io.cost on day one
Great paper-cutting tool, bad first-response tool.

11) One-page starter policy

If you need a practical default today:

latency-sensitive API / DB frontends:
- separate cgroup,
- IOAccounting=yes,
- conservative IODeviceLatencyTargetSec= target based on measured device latency + safety margin.
background maintenance / compaction / backups:
- separate cgroup,
- lower IOWeight=,
- add IOReadBandwidthMax= / IOWriteBandwidthMax= if they can still cause p99 damage.
multi-tenant shared host:
- start with IOWeight= ratios,
- add io.latency only for truly user-facing critical groups,
- evaluate io.cost only for fleets where device model tuning is realistic.

If you are unsure, protect with latency first, cap second, model last.

Source notes

Primary references used for this note:

Linux kernel cgroup v2 documentation (io.max, io.latency, io.cost, io.stat)
Meta/Facebook cgroup2 IO controller guide
systemd resource-control documentation / Arch manpage mapping for IOWeight, IOReadBandwidthMax, IODeviceLatencyTargetSec
Viacheslav Biriukov’s write-up on cgroup v2, page cache, writeback, and PSI

Linux cgroup v2 I/O Controller Playbook (io.max, io.latency, io.cost, systemd mappings)

Linux cgroup v2 I/O Controller Playbook (io.max, io.latency, io.cost, systemd mappings)

Why this matters

1) Quick mental model

2) Decision matrix

A) Backup / batch / compaction hurts API p99

B) Several tenants need fair disk sharing

C) You must guarantee background work never exceeds a ceiling

D) Host is oscillating between memory pressure and writeback pressure

3) What each knob really does

io.max: hard cap

io.latency: work-conserving latency protection

io.weight: simple proportional share

io.cost: advanced model-based proportional control

4) The memory / writeback connection people miss

5) Setup and discovery

6) Minimal safe patterns

Pattern 1: protect API from background jobs

Pattern 2: cap a backup job so it never dominates

Pattern 3: fair-share between two tenants

7) systemd mappings you will actually use

8) Observability checklist

9) Rollout sequence that usually works

10) Common mistakes

11) One-page starter policy

Source notes

`io.max`: hard cap

`io.latency`: work-conserving latency protection

`io.weight`: simple proportional share

`io.cost`: advanced model-based proportional control