Linux cgroup v2 CPU Latency Protection Playbook (cpu.max, cpu.weight, cpuset, uclamp)

Date: 2026-03-17
Category: knowledge

Why this matters

Most Linux performance incidents are not “CPU ran out” incidents. They are contention-shape incidents:

one background task monopolizes runqueue time,
bursty workers create p99 spikes for latency-sensitive services,
autoscaling looks fine while user-facing latency silently degrades.

cgroup v2 gives practical controls to protect latency-critical workloads without pretending all services are equal.

1) Quick mental model

Use these knobs for different jobs:

cpu.max: hard cap (quota/period).
"Never exceed this much CPU time."
cpu.weight: proportional share under contention (1–10000, default 100).
"When busy, who gets more turns?"
cpuset.cpus (with cpuset controller): CPU affinity partitioning.
"Which cores may this group run on?"
cpu.uclamp.min/max (if supported): scheduler utilization clamp hints.
"Bias DVFS/scheduling toward minimum responsiveness or max cap."

Rule of thumb:

cpu.max for blast-radius limits,
cpu.weight for fairness among always-on services,
cpuset for hard isolation,
uclamp for latency-sensitive policy shaping.

2) Fast decision matrix

A) Background batch jobs hurting API p99

Put batch in separate cgroup.
Set cpu.max cap first.
Give API higher cpu.weight.

B) Multiple online services contending on same node

Keep all uncapped initially.
Tune via cpu.weight ratios (e.g., 100/200/400).
Add caps only if one tenant can still explode.

C) Strict noisy-neighbor isolation required

Use cpuset.cpus to carve dedicated cores.
Optionally pair with cpu.max for safety.

D) Latency-sensitive service with frequency-droop risk

If kernel supports it, apply cpu.uclamp.min to critical group.
Validate power/thermal side effects before broad rollout.

3) Setup & discovery (10 minutes)

Check cgroup mode:

stat -fc %T /sys/fs/cgroup
# should be cgroup2fs

See available controllers:

cat /sys/fs/cgroup/cgroup.controllers
cat /sys/fs/cgroup/cgroup.subtree_control

Enable controllers at parent (example root-level):

# enable cpu + cpuset for child groups
sudo sh -c 'echo "+cpu +cpuset" > /sys/fs/cgroup/cgroup.subtree_control'

Create groups:

sudo mkdir -p /sys/fs/cgroup/api
sudo mkdir -p /sys/fs/cgroup/batch

Move a PID:

echo <PID> | sudo tee /sys/fs/cgroup/api/cgroup.procs

4) Minimal safe baseline policy

Example: protect api, constrain batch.

# API: higher share, no hard cap
echo 400 | sudo tee /sys/fs/cgroup/api/cpu.weight
echo "max 100000" | sudo tee /sys/fs/cgroup/api/cpu.max

# Batch: lower share + 2 CPU cap (period 100ms)
echo 100 | sudo tee /sys/fs/cgroup/batch/cpu.weight
echo "200000 100000" | sudo tee /sys/fs/cgroup/batch/cpu.max

Interpretation:

API can expand when idle capacity exists.
Batch is explicitly bounded and deprioritized under contention.

5) cpuset isolation pattern (when fairness is not enough)

# ensure parent has a valid cpuset first
cat /sys/fs/cgroup/cpuset.cpus
cat /sys/fs/cgroup/cpuset.mems

# isolate API to cores 0-3, batch to 4-7 (example)
echo 0-3 | sudo tee /sys/fs/cgroup/api/cpuset.cpus
echo 0   | sudo tee /sys/fs/cgroup/api/cpuset.mems

echo 4-7 | sudo tee /sys/fs/cgroup/batch/cpuset.cpus
echo 0   | sudo tee /sys/fs/cgroup/batch/cpuset.mems

Use cpuset when strict latency SLOs justify lower average utilization efficiency.

6) uclamp usage (advanced, kernel-dependent)

If cpu.uclamp.min/cpu.uclamp.max files exist:

# keep critical group from dropping too low (example)
echo 25 | sudo tee /sys/fs/cgroup/api/cpu.uclamp.min

# cap non-critical burst aggressiveness (example)
echo 60 | sudo tee /sys/fs/cgroup/batch/cpu.uclamp.max

Caution: this influences scheduler utilization signals and can increase power draw. Treat as a canary-only feature first.

7) Observability checklist

At minimum, track per-cgroup:

cpu.stat (usage_usec, nr_periods, nr_throttled, throttled_usec)
app p95/p99 latency
timeout/retry rate
runqueue pressure (/proc/pressure/cpu)
host power/thermal if using uclamp

Quick read:

cat /sys/fs/cgroup/api/cpu.stat
cat /sys/fs/cgroup/batch/cpu.stat
cat /proc/pressure/cpu

Interpretation:

rising nr_throttled + p99 spikes => cap too aggressive,
low throttling but bad p99 => contention shape issue (weights/cpuset),
good averages + bad tails => likely scheduling burstiness, not raw CPU shortage.

8) Rollout sequence (practical)

Measure baseline: 24h diurnal p95/p99 + cpu.stat snapshots.
Apply weights only on canary nodes.
Add cpu.max caps to noisy batch classes.
Use cpuset only where SLO still unstable.
Use uclamp last, and only with power/thermal guardrails.
Promote gradually (10% → 30% → 100%) with rollback script ready.

9) Common mistakes

Using only quota (cpu.max) for everything
Leads to throttle storms and p99 cliffs.
Skipping weight tuning
Misses the easiest contention control lever.
cpuset without parent/memory sanity
Causes confusing task placement behavior.
Treating cgroup controls as static
Workload mix changes; policy should be periodically recalibrated.
No per-cgroup telemetry
You cannot tune what you cannot attribute.

10) One-page starter policy

If you need a default today:

user-facing services: cpu.weight=300~500, cpu.max=max
async/background workers: cpu.weight=50~150, capped cpu.max
batch/maintenance: low weight + strict cap + optional dedicated cpuset
monitor nr_throttled + p99 together, not separately

This alone removes a large fraction of "mystery latency" incidents on shared Linux nodes.

Closing

cgroup v2 CPU control is best treated as a latency-shaping system, not just a resource limiter.

When teams combine cpu.weight (fairness), cpu.max (blast-radius), and selective cpuset/uclamp (hard protection), they usually get:

fewer tail-latency surprises,
cleaner noisy-neighbor boundaries,
and less reactive overprovisioning.