Linux cgroup v2 CPU Latency Protection Playbook (cpu.max, cpu.weight, cpuset, uclamp)
Date: 2026-03-17
Category: knowledge
Why this matters
Most Linux performance incidents are not “CPU ran out” incidents. They are contention-shape incidents:
- one background task monopolizes runqueue time,
- bursty workers create p99 spikes for latency-sensitive services,
- autoscaling looks fine while user-facing latency silently degrades.
cgroup v2 gives practical controls to protect latency-critical workloads without pretending all services are equal.
1) Quick mental model
Use these knobs for different jobs:
cpu.max: hard cap (quota/period).
"Never exceed this much CPU time."cpu.weight: proportional share under contention (1–10000, default 100).
"When busy, who gets more turns?"cpuset.cpus(with cpuset controller): CPU affinity partitioning.
"Which cores may this group run on?"cpu.uclamp.min/max(if supported): scheduler utilization clamp hints.
"Bias DVFS/scheduling toward minimum responsiveness or max cap."
Rule of thumb:
cpu.maxfor blast-radius limits,cpu.weightfor fairness among always-on services,cpusetfor hard isolation,uclampfor latency-sensitive policy shaping.
2) Fast decision matrix
A) Background batch jobs hurting API p99
- Put batch in separate cgroup.
- Set
cpu.maxcap first. - Give API higher
cpu.weight.
B) Multiple online services contending on same node
- Keep all uncapped initially.
- Tune via
cpu.weightratios (e.g., 100/200/400). - Add caps only if one tenant can still explode.
C) Strict noisy-neighbor isolation required
- Use
cpuset.cpusto carve dedicated cores. - Optionally pair with
cpu.maxfor safety.
D) Latency-sensitive service with frequency-droop risk
- If kernel supports it, apply
cpu.uclamp.minto critical group. - Validate power/thermal side effects before broad rollout.
3) Setup & discovery (10 minutes)
Check cgroup mode:
stat -fc %T /sys/fs/cgroup
# should be cgroup2fs
See available controllers:
cat /sys/fs/cgroup/cgroup.controllers
cat /sys/fs/cgroup/cgroup.subtree_control
Enable controllers at parent (example root-level):
# enable cpu + cpuset for child groups
sudo sh -c 'echo "+cpu +cpuset" > /sys/fs/cgroup/cgroup.subtree_control'
Create groups:
sudo mkdir -p /sys/fs/cgroup/api
sudo mkdir -p /sys/fs/cgroup/batch
Move a PID:
echo <PID> | sudo tee /sys/fs/cgroup/api/cgroup.procs
4) Minimal safe baseline policy
Example: protect api, constrain batch.
# API: higher share, no hard cap
echo 400 | sudo tee /sys/fs/cgroup/api/cpu.weight
echo "max 100000" | sudo tee /sys/fs/cgroup/api/cpu.max
# Batch: lower share + 2 CPU cap (period 100ms)
echo 100 | sudo tee /sys/fs/cgroup/batch/cpu.weight
echo "200000 100000" | sudo tee /sys/fs/cgroup/batch/cpu.max
Interpretation:
- API can expand when idle capacity exists.
- Batch is explicitly bounded and deprioritized under contention.
5) cpuset isolation pattern (when fairness is not enough)
# ensure parent has a valid cpuset first
cat /sys/fs/cgroup/cpuset.cpus
cat /sys/fs/cgroup/cpuset.mems
# isolate API to cores 0-3, batch to 4-7 (example)
echo 0-3 | sudo tee /sys/fs/cgroup/api/cpuset.cpus
echo 0 | sudo tee /sys/fs/cgroup/api/cpuset.mems
echo 4-7 | sudo tee /sys/fs/cgroup/batch/cpuset.cpus
echo 0 | sudo tee /sys/fs/cgroup/batch/cpuset.mems
Use cpuset when strict latency SLOs justify lower average utilization efficiency.
6) uclamp usage (advanced, kernel-dependent)
If cpu.uclamp.min/cpu.uclamp.max files exist:
# keep critical group from dropping too low (example)
echo 25 | sudo tee /sys/fs/cgroup/api/cpu.uclamp.min
# cap non-critical burst aggressiveness (example)
echo 60 | sudo tee /sys/fs/cgroup/batch/cpu.uclamp.max
Caution: this influences scheduler utilization signals and can increase power draw. Treat as a canary-only feature first.
7) Observability checklist
At minimum, track per-cgroup:
cpu.stat(usage_usec,nr_periods,nr_throttled,throttled_usec)- app p95/p99 latency
- timeout/retry rate
- runqueue pressure (
/proc/pressure/cpu) - host power/thermal if using uclamp
Quick read:
cat /sys/fs/cgroup/api/cpu.stat
cat /sys/fs/cgroup/batch/cpu.stat
cat /proc/pressure/cpu
Interpretation:
- rising
nr_throttled+ p99 spikes => cap too aggressive, - low throttling but bad p99 => contention shape issue (weights/cpuset),
- good averages + bad tails => likely scheduling burstiness, not raw CPU shortage.
8) Rollout sequence (practical)
- Measure baseline: 24h diurnal p95/p99 + cpu.stat snapshots.
- Apply weights only on canary nodes.
- Add cpu.max caps to noisy batch classes.
- Use cpuset only where SLO still unstable.
- Use uclamp last, and only with power/thermal guardrails.
- Promote gradually (10% → 30% → 100%) with rollback script ready.
9) Common mistakes
Using only quota (
cpu.max) for everything
Leads to throttle storms and p99 cliffs.Skipping weight tuning
Misses the easiest contention control lever.cpuset without parent/memory sanity
Causes confusing task placement behavior.Treating cgroup controls as static
Workload mix changes; policy should be periodically recalibrated.No per-cgroup telemetry
You cannot tune what you cannot attribute.
10) One-page starter policy
If you need a default today:
- user-facing services:
cpu.weight=300~500,cpu.max=max - async/background workers:
cpu.weight=50~150, cappedcpu.max - batch/maintenance: low weight + strict cap + optional dedicated cpuset
- monitor
nr_throttled+ p99 together, not separately
This alone removes a large fraction of "mystery latency" incidents on shared Linux nodes.
Closing
cgroup v2 CPU control is best treated as a latency-shaping system, not just a resource limiter.
When teams combine cpu.weight (fairness), cpu.max (blast-radius), and selective cpuset/uclamp (hard protection), they usually get:
- fewer tail-latency surprises,
- cleaner noisy-neighbor boundaries,
- and less reactive overprovisioning.