Linux PSI (Pressure Stall Information) Overload-Control Playbook
Date: 2026-03-16
Category: knowledge
Why this matters
Most teams notice overload too late.
- CPU% can look “fine” while runnable queues are already congested.
- Memory usage can look “acceptable” while reclaim/compaction stalls are exploding tail latency.
- I/O utilization can look moderate while requests are waiting forever.
Linux PSI gives a direct signal of time lost to resource pressure.
That makes it a better control input for SLO protection than raw utilization alone.
1) PSI mental model in one paragraph
PSI tells you: “What fraction of wall-clock time were tasks stalled because a resource was unavailable?”
For each resource (cpu, memory, io), kernel reports moving averages:
avg10,avg60,avg300(percent stall time)total(cumulative stall time in microseconds)
And two stall classes:
- some: at least one runnable task was stalled
- full: all non-idle tasks were stalled (system-wide progress collapse)
full is most critical for memory/io.
For CPU, some is usually the main operational signal.
2) Where to read PSI
/proc/pressure/cpu/proc/pressure/memory/proc/pressure/io
Example shape:
some avg10=12.45 avg60=8.12 avg300=3.04 total=123456789
full avg10=1.30 avg60=0.77 avg300=0.19 total=9876543
Interpretation:
memory.some avg10=20→ during last 10s, tasks lost ~20% of time to memory pressure.io.full avg60>0sustained → dangerous: periods where no task made progress due to I/O pressure.
3) Why load average and CPU% are not enough
Load average is queue length pressure, not progress loss. CPU% is busy-time, not wait-time quality.
Typical failure pattern:
- Cache miss storm / GC / reclaim starts.
- Threads block on memory+I/O.
- Throughput drops, queue depth grows, retries amplify.
- CPU% may drop while user latency worsens.
PSI catches this because stall time rises exactly when users feel pain.
4) SLO-first PSI signal set
Track this minimal bundle:
cpu.some avg10/avg60memory.some avg10andmemory.full avg10io.some avg10andio.full avg10- request p95/p99 latency
- queue depth / in-flight count
- error rate / timeout rate
Rule: never alert on PSI alone. Alert on PSI + user impact or PSI + queue growth.
5) Practical thresholds (starting points, tune by service)
These are conservative defaults for online services:
CPU pressure
cpu.some avg10 > 20for 60s → CAUTIONcpu.some avg10 > 35for 30s + p99 rising → DEFENSIVE
Memory pressure
memory.some avg10 > 10for 60s → watch reclaim churnmemory.full avg10 > 1for 30s → serious contentionmemory.full avg10 > 5for 15s → emergency shedding/scale-up
I/O pressure
io.some avg10 > 10for 60s → storage path saturation riskio.full avg10 > 1for 30s → system-wide progress collapse risk
Treat full sustained non-zero as “do not ignore.”
6) Control policy: GREEN → AMBER → RED
GREEN (normal)
- standard admission limits
- normal background jobs
- normal retry policy
AMBER (rising pressure)
Trigger: PSI thresholds crossed without severe user impact yet.
Actions:
- tighten concurrency caps (10–20%)
- slow/stop low-priority background jobs
- reduce speculative retries / hedging
- enable short-term cache TTL relaxation if safe
RED (impacting users)
Trigger: PSI high + p99/timeout/queue alarm.
Actions:
- strict admission control (protect core endpoints)
- shed non-critical traffic/features
- freeze expensive async pipelines
- force retry-budget throttling
- scale out / rebalance workload if available
Recovery with hysteresis
Exit RED only after:
- PSI below recovery thresholds for N minutes
- queue depth and p99 stabilized
Avoid flap loops by using stricter recovery thresholds than trigger thresholds.
7) Integration patterns that work in production
A) PSI-driven adaptive concurrency
Use PSI as feedback for concurrency limiters:
- high
cpu.some/memory.some→ lower in-flight target - stable low PSI for cooldown window → slowly raise target
Think PI/PID-like control, but with guardrails to avoid oscillation.
B) PSI-gated autoscaling
Do not autoscale on CPU% alone. Scale when PSI + latency + queue agree.
Good anti-pattern fix:
- bad: scale only if CPU > 70%
- better: scale if
cpu.some avg60 highORmemory.full > 0with queue growth
C) PSI-aware background scheduling
Batch jobs, compaction, indexing, backfills:
- run when PSI low
- pause when
memory.someorio.somespikes
This prevents “maintenance jobs eating foreground SLOs.”
8) Common mistakes
Using one global threshold for every service
Different latency budgets need different cutoffs.Alerting on every PSI blip
Short bursts are normal; use duration + multi-signal conditions.No retry governance
Under pressure, retries can multiply pressure. Enforce retry budgets.Ignoring memory.full/io.full
These are often the earliest collapse signatures.No post-incident PSI timeline review
If you only review app logs, you miss resource-coupling root causes.
9) 30-minute incident runbook (PSI spike)
- Confirm which PSI dimension is dominant (cpu/memory/io).
- Check queue depth, p99, timeout trend.
- If memory/io full pressure is sustained:
- shed optional traffic,
- pause background jobs,
- throttle retries immediately.
- Lower concurrency caps temporarily.
- If available, scale out / move noisy neighbors.
- After stabilization, capture:
- PSI timeline,
- control actions,
- latency/throughput response.
- Turn findings into threshold/hysteresis updates.
10) What “good” looks like
A healthy service usually has:
- low baseline PSI,
- brief pressure spikes that self-recover,
- no prolonged
memory.full/io.full, - and predictable p99 under burst.
If PSI keeps oscillating between high and low with recurring p99 spikes, your control loop is underdamped (or your retry/background policy is fighting your own system).
References (operator-facing)
- Linux kernel PSI documentation (
Documentation/accounting/psi.rst) - Facebook PSI overview and production motivation
- systemd-oomd design/use of PSI signals
- SRE overload-control patterns (admission, retry budgets, load shedding)