Linux cgroup v2 I/O Controller Playbook (io.max, io.latency, io.cost, systemd mappings)
Date: 2026-04-05
Category: knowledge
Why this matters
A lot of Linux incidents that look like “the disk is slow” are actually contention-shape problems:
- backup or compaction jobs eat queue depth and API latency blows up,
- one noisy writer turns page-cache reclaim into writeback storms,
- everyone looks fine on average, but p99 falls off a cliff,
- operators overreact with crude per-process
ioniceor global device tuning.
cgroup v2 gives a cleaner toolbox:
- hard caps when a workload must not exceed a ceiling,
- latency protection when a critical service must stay responsive,
- proportional sharing when multiple tenants should compete fairly,
- per-cgroup observability so you can see who is causing pain.
The big upgrade over older setups is that v2 can account and control buffered writeback and page-cache related I/O much more coherently, not just direct I/O.
1) Quick mental model
Use different knobs for different jobs:
io.max: absolute cap on read/write bandwidth or IOPS.
“This group may not exceed this ceiling.”io.latency: protect latency-sensitive peers by throttling slower-priority peers when the protected group misses target latency.
“If this service starts missing its latency target, squeeze the others.”io.weight: proportional sharing knob.
“When the device is contested, who gets a bigger slice?”io.cost.qos/io.cost.model: advanced controller for model-based proportional control.
“Use a device cost model plus QoS bounds to shape proportional sharing more intelligently.”io.stat/io.pressure: attribution + pressure signals.
“What is this cgroup doing, and is it suffering?”
Rule of thumb:
- start with
io.latencywhen protecting an online service, - use
io.maxfor obvious blast-radius containment, - use
io.weight/io.costfor multi-tenant fairness, - always observe memory + I/O together, because writeback sits between them.
2) Decision matrix
A) Backup / batch / compaction hurts API p99
- Put batch work in a separate cgroup.
- Protect API with
io.latency. - If batch is still too bursty, add
io.maxceiling.
B) Several tenants need fair disk sharing
- Start with
io.weight. - If you need tighter model-based proportional behavior on a stable fleet, evaluate
io.cost.
C) You must guarantee background work never exceeds a ceiling
- Use
io.maxfor read/write BPS and/or IOPS. - Treat it as a hard governor, not a fairness tool.
D) Host is oscillating between memory pressure and writeback pressure
- Inspect
memory.pressure,io.pressure,memory.stat, andio.stattogether. - Fixing only memory or only disk often misses the feedback loop.
3) What each knob really does
io.max: hard cap
io.max limits per-device bandwidth and/or IOPS.
Supported keys:
rbpswbpsriopswiops
Example:
echo "8:16 rbps=2097152 wiops=120" | sudo tee /sys/fs/cgroup/batch/io.max
Interpretation:
- read capped at 2 MiB/s,
- write capped at 120 IOPS,
- short bursts may still appear because enforcement is not perfectly instantaneous,
- operationally this behaves as a non-work-conserving ceiling.
Use it when the question is: “What is the most this group is allowed to consume?”
Do not use it as your only latency tool unless you already know the safe ceiling.
io.latency: work-conserving latency protection
io.latency protects a cgroup by setting a target completion latency.
If that cgroup starts missing the target, the kernel throttles peer cgroups with looser targets.
Example interface:
echo "259:0 target=5000" | sudo tee /sys/fs/cgroup/api/io.latency
That means a 5 ms target on device 259:0.
Important behaviors:
- peer-level only: only siblings influence each other,
- work-conserving: if everyone is meeting target, the controller stays out of the way,
- throttling can happen through queue depth clamping and artificial delay charging,
- protected groups with the lowest / tightest target get priority under contention.
Use it when the question is: “How do I keep the critical service responsive without wasting idle capacity?”
io.weight: simple proportional share
io.weight is the intuitive fairness lever.
Higher weight means more share relative to siblings.
Example:
echo 400 | sudo tee /sys/fs/cgroup/api/io.weight
echo 100 | sudo tee /sys/fs/cgroup/batch/io.weight
This is good for:
- multiple always-on workloads,
- environments where you want soft competition rather than hard caps,
- first-pass tuning before using heavier machinery.
io.cost: advanced model-based proportional control
io.cost is more sophisticated and more operationally demanding.
It lives at the root cgroup and uses a device cost model plus QoS inputs.
Relevant files:
io.cost.qosio.cost.model
Kernel docs expose knobs like:
enablectrl=auto|user- latency percentiles / thresholds (
rpct,rlat,wpct,wlat) - scaling bounds (
min,max)
Use io.cost only when:
- you have a relatively homogeneous fleet,
- device behavior is understood and stable enough,
- proportional fairness matters more than quick tactical containment,
- you can actually test and recalibrate it.
If not, io.latency + io.max is usually the better operational trade.
4) The memory / writeback connection people miss
This is the sneaky part.
Buffered writes, dirty page reclaim, and writeback sit between the memory and I/O domains. That means:
- memory pressure can become writeback pressure,
- writeback throttling can change application latency even when raw disk throughput looks fine,
- a cgroup flushing lots of dirty pages can hurt siblings unless the hierarchy is shaped correctly.
Practical consequence:
- when you tune I/O policy, also watch
memory.pressureandmemory.stat, - when a workload is under memory reclaim, you may be debugging an I/O symptom caused by memory policy.
This is why cgroup v2 is much more useful than older approaches: page-cache writeback attribution is materially better.
5) Setup and discovery
Check that you are on unified hierarchy:
stat -fc %T /sys/fs/cgroup
# should print: cgroup2fs
See available controllers:
cat /sys/fs/cgroup/cgroup.controllers
cat /sys/fs/cgroup/cgroup.subtree_control
Enable I/O controller on the parent level:
sudo sh -c 'echo "+io" > /sys/fs/cgroup/cgroup.subtree_control'
Create sibling groups:
sudo mkdir -p /sys/fs/cgroup/api
sudo mkdir -p /sys/fs/cgroup/batch
Move processes:
echo <PID> | sudo tee /sys/fs/cgroup/api/cgroup.procs
echo <PID> | sudo tee /sys/fs/cgroup/batch/cgroup.procs
Remember the cgroup v2 rule: controllers distribute resources to children. Shape the tree deliberately.
6) Minimal safe patterns
Pattern 1: protect API from background jobs
# API gets tighter latency target
printf '259:0 target=5000\n' | sudo tee /sys/fs/cgroup/api/io.latency
# Batch gets looser target and lower weight
printf '259:0 target=20000\n' | sudo tee /sys/fs/cgroup/batch/io.latency
echo 100 | sudo tee /sys/fs/cgroup/batch/io.weight
Interpretation:
- API aims for 5 ms,
- batch is allowed 20 ms,
- when API misses target, batch gets throttled first.
Pattern 2: cap a backup job so it never dominates
echo '259:0 rbps=20971520 wbps=20971520 riops=200 wiops=200' | sudo tee /sys/fs/cgroup/backup/io.max
For direct writes to io.max, use raw numeric bytes/IOPS values. Human-friendly suffixes are more appropriate in higher-level tooling such as systemd properties.
Pattern 3: fair-share between two tenants
echo 300 | sudo tee /sys/fs/cgroup/tenant-a/io.weight
echo 100 | sudo tee /sys/fs/cgroup/tenant-b/io.weight
This gives tenant A roughly 3x the share of contested I/O time versus tenant B.
7) systemd mappings you will actually use
If the host is systemd-managed, don’t hand-build everything in /sys/fs/cgroup unless you need ad-hoc debugging.
Use unit properties.
Key mappings:
IOAccounting=yes→ enable I/O accountingIOWeight=/StartupIOWeight=→io.weightIODeviceWeight=→ per-deviceio.weightIOReadBandwidthMax=/IOWriteBandwidthMax=→io.maxIOReadIOPSMax=/IOWriteIOPSMax=→io.maxIODeviceLatencyTargetSec=→io.latency
Examples:
sudo systemctl set-property api.service IOAccounting=yes
sudo systemctl set-property api.service IODeviceLatencyTargetSec="/dev/nvme0n1 5ms"
sudo systemctl set-property backup.service IOAccounting=yes
sudo systemctl set-property backup.service IOReadBandwidthMax="/var/backups 20M"
sudo systemctl set-property backup.service IOWriteBandwidthMax="/var/backups 20M"
sudo systemctl set-property tenant-a.service IOWeight=300
sudo systemctl set-property tenant-b.service IOWeight=100
Nice operational detail:
- systemd automatically enables needed controllers through the hierarchy,
- path-based device resolution is convenient, but on complex RAID / LV / exotic storage stacks, verify which underlying block device you are really controlling.
8) Observability checklist
At minimum, track per-cgroup:
io.statio.pressurememory.pressure- app p95 / p99 latency
- timeout / retry rate
- writeback-related counters from
memory.stat
Basic commands:
cat /sys/fs/cgroup/api/io.stat
cat /sys/fs/cgroup/api/io.pressure
cat /sys/fs/cgroup/api/memory.pressure
cat /sys/fs/cgroup/api/memory.stat | egrep 'file|writeback|dirty|workingset'
When io.latency is active, watch for extra io.stat fields such as:
depthavg_latwin- delay-related fields on throttled groups
Interpretation hints:
- high
avg_laton protected group + peer throttling → target may be too tight or device is actually saturated, - rising
io.pressurebut modest throughput → queueing/latency problem, not just bandwidth, - dirty/writeback growth plus memory pressure → reclaim/writeback loop, not purely an I/O scheduler issue,
- good averages but ugly p99 → classic contention-shape problem;
io.latencyusually beats blind bandwidth caps.
9) Rollout sequence that usually works
- Measure first: 24h baseline with service latency +
io.stat/ PSI snapshots. - Separate workloads into sibling cgroups before tuning knobs.
- For online critical services, start with
io.latency. - For obvious noisy jobs, add
io.maxceilings. - Use
io.weightfor ongoing tenant fairness. - Reach for
io.costonly after the simple controls prove insufficient. - Roll out canary → partial fleet → full fleet.
- Recalibrate after storage hardware, kernel, filesystem, or workload changes.
10) Common mistakes
Using only
io.maxeverywhere
You get rigid ceilings but poor latency behavior.Setting impossible
io.latencytargets
If the device cannot do 2 ms under real load, pretending it can just causes constant throttling noise.Ignoring sibling topology
io.latencyworks among peers. Wrong tree shape means wrong protection behavior.Debugging I/O without memory context
Many “disk issues” are dirty-page / reclaim issues wearing a disk-shaped mask.Relying on path-to-device resolution blindly in systemd
Works nicely for simple storage, gets ambiguous on layered setups.Going straight to
io.coston day one
Great paper-cutting tool, bad first-response tool.
11) One-page starter policy
If you need a practical default today:
latency-sensitive API / DB frontends:
- separate cgroup,
IOAccounting=yes,- conservative
IODeviceLatencyTargetSec=target based on measured device latency + safety margin.
background maintenance / compaction / backups:
- separate cgroup,
- lower
IOWeight=, - add
IOReadBandwidthMax=/IOWriteBandwidthMax=if they can still cause p99 damage.
multi-tenant shared host:
- start with
IOWeight=ratios, - add
io.latencyonly for truly user-facing critical groups, - evaluate
io.costonly for fleets where device model tuning is realistic.
- start with
If you are unsure, protect with latency first, cap second, model last.
Source notes
Primary references used for this note:
- Linux kernel cgroup v2 documentation (
io.max,io.latency,io.cost,io.stat) - Meta/Facebook cgroup2 IO controller guide
- systemd resource-control documentation / Arch manpage mapping for
IOWeight,IOReadBandwidthMax,IODeviceLatencyTargetSec - Viacheslav Biriukov’s write-up on cgroup v2, page cache, writeback, and PSI