Linux PSI (Pressure Stall Information) Overload-Control Playbook

Date: 2026-03-16
Category: knowledge

Why this matters

Most teams notice overload too late.

CPU% can look “fine” while runnable queues are already congested.
Memory usage can look “acceptable” while reclaim/compaction stalls are exploding tail latency.
I/O utilization can look moderate while requests are waiting forever.

Linux PSI gives a direct signal of time lost to resource pressure.
That makes it a better control input for SLO protection than raw utilization alone.

1) PSI mental model in one paragraph

PSI tells you: “What fraction of wall-clock time were tasks stalled because a resource was unavailable?”

For each resource (cpu, memory, io), kernel reports moving averages:

avg10, avg60, avg300 (percent stall time)
total (cumulative stall time in microseconds)

And two stall classes:

some: at least one runnable task was stalled
full: all non-idle tasks were stalled (system-wide progress collapse)

full is most critical for memory/io.
For CPU, some is usually the main operational signal.

2) Where to read PSI

/proc/pressure/cpu
/proc/pressure/memory
/proc/pressure/io

Example shape:

some avg10=12.45 avg60=8.12 avg300=3.04 total=123456789
full avg10=1.30 avg60=0.77 avg300=0.19 total=9876543

Interpretation:

memory.some avg10=20 → during last 10s, tasks lost ~20% of time to memory pressure.
io.full avg60>0 sustained → dangerous: periods where no task made progress due to I/O pressure.

3) Why load average and CPU% are not enough

Load average is queue length pressure, not progress loss. CPU% is busy-time, not wait-time quality.

Typical failure pattern:

Cache miss storm / GC / reclaim starts.
Threads block on memory+I/O.
Throughput drops, queue depth grows, retries amplify.
CPU% may drop while user latency worsens.

PSI catches this because stall time rises exactly when users feel pain.

4) SLO-first PSI signal set

Track this minimal bundle:

cpu.some avg10/avg60
memory.some avg10 and memory.full avg10
io.some avg10 and io.full avg10
request p95/p99 latency
queue depth / in-flight count
error rate / timeout rate

Rule: never alert on PSI alone. Alert on PSI + user impact or PSI + queue growth.

5) Practical thresholds (starting points, tune by service)

These are conservative defaults for online services:

CPU pressure

cpu.some avg10 > 20 for 60s → CAUTION
cpu.some avg10 > 35 for 30s + p99 rising → DEFENSIVE

Memory pressure

memory.some avg10 > 10 for 60s → watch reclaim churn
memory.full avg10 > 1 for 30s → serious contention
memory.full avg10 > 5 for 15s → emergency shedding/scale-up

I/O pressure

io.some avg10 > 10 for 60s → storage path saturation risk
io.full avg10 > 1 for 30s → system-wide progress collapse risk

Treat full sustained non-zero as “do not ignore.”

6) Control policy: GREEN → AMBER → RED

GREEN (normal)

standard admission limits
normal background jobs
normal retry policy

AMBER (rising pressure)

Trigger: PSI thresholds crossed without severe user impact yet.

Actions:

tighten concurrency caps (10–20%)
slow/stop low-priority background jobs
reduce speculative retries / hedging
enable short-term cache TTL relaxation if safe

RED (impacting users)

Trigger: PSI high + p99/timeout/queue alarm.

Actions:

strict admission control (protect core endpoints)
shed non-critical traffic/features
freeze expensive async pipelines
force retry-budget throttling
scale out / rebalance workload if available

Recovery with hysteresis

Exit RED only after:

PSI below recovery thresholds for N minutes
queue depth and p99 stabilized

Avoid flap loops by using stricter recovery thresholds than trigger thresholds.

7) Integration patterns that work in production

A) PSI-driven adaptive concurrency

Use PSI as feedback for concurrency limiters:

high cpu.some / memory.some → lower in-flight target
stable low PSI for cooldown window → slowly raise target

Think PI/PID-like control, but with guardrails to avoid oscillation.

B) PSI-gated autoscaling

Do not autoscale on CPU% alone. Scale when PSI + latency + queue agree.

Good anti-pattern fix:

bad: scale only if CPU > 70%
better: scale if cpu.some avg60 high OR memory.full > 0 with queue growth

C) PSI-aware background scheduling

Batch jobs, compaction, indexing, backfills:

run when PSI low
pause when memory.some or io.some spikes

This prevents “maintenance jobs eating foreground SLOs.”

8) Common mistakes

Using one global threshold for every service
Different latency budgets need different cutoffs.
Alerting on every PSI blip
Short bursts are normal; use duration + multi-signal conditions.
No retry governance
Under pressure, retries can multiply pressure. Enforce retry budgets.
Ignoring memory.full/io.full
These are often the earliest collapse signatures.
No post-incident PSI timeline review
If you only review app logs, you miss resource-coupling root causes.

9) 30-minute incident runbook (PSI spike)

Confirm which PSI dimension is dominant (cpu/memory/io).
Check queue depth, p99, timeout trend.
If memory/io full pressure is sustained:
- shed optional traffic,
- pause background jobs,
- throttle retries immediately.
Lower concurrency caps temporarily.
If available, scale out / move noisy neighbors.
After stabilization, capture:
- PSI timeline,
- control actions,
- latency/throughput response.
Turn findings into threshold/hysteresis updates.

10) What “good” looks like

A healthy service usually has:

low baseline PSI,
brief pressure spikes that self-recover,
no prolonged memory.full/io.full,
and predictable p99 under burst.

If PSI keeps oscillating between high and low with recurring p99 spikes, your control loop is underdamped (or your retry/background policy is fighting your own system).

References (operator-facing)

Linux kernel PSI documentation (Documentation/accounting/psi.rst)
Facebook PSI overview and production motivation
systemd-oomd design/use of PSI signals
SRE overload-control patterns (admission, retry budgets, load shedding)