Linux PSI (Pressure Stall Information) Overload-Control Playbook

2026-03-16 · software

Linux PSI (Pressure Stall Information) Overload-Control Playbook

Date: 2026-03-16
Category: knowledge

Why this matters

Most teams notice overload too late.

Linux PSI gives a direct signal of time lost to resource pressure.
That makes it a better control input for SLO protection than raw utilization alone.


1) PSI mental model in one paragraph

PSI tells you: “What fraction of wall-clock time were tasks stalled because a resource was unavailable?”

For each resource (cpu, memory, io), kernel reports moving averages:

And two stall classes:

full is most critical for memory/io.
For CPU, some is usually the main operational signal.


2) Where to read PSI

Example shape:

some avg10=12.45 avg60=8.12 avg300=3.04 total=123456789
full avg10=1.30 avg60=0.77 avg300=0.19 total=9876543

Interpretation:


3) Why load average and CPU% are not enough

Load average is queue length pressure, not progress loss. CPU% is busy-time, not wait-time quality.

Typical failure pattern:

  1. Cache miss storm / GC / reclaim starts.
  2. Threads block on memory+I/O.
  3. Throughput drops, queue depth grows, retries amplify.
  4. CPU% may drop while user latency worsens.

PSI catches this because stall time rises exactly when users feel pain.


4) SLO-first PSI signal set

Track this minimal bundle:

Rule: never alert on PSI alone. Alert on PSI + user impact or PSI + queue growth.


5) Practical thresholds (starting points, tune by service)

These are conservative defaults for online services:

CPU pressure

Memory pressure

I/O pressure

Treat full sustained non-zero as “do not ignore.”


6) Control policy: GREEN → AMBER → RED

GREEN (normal)

AMBER (rising pressure)

Trigger: PSI thresholds crossed without severe user impact yet.

Actions:

RED (impacting users)

Trigger: PSI high + p99/timeout/queue alarm.

Actions:

Recovery with hysteresis

Exit RED only after:

Avoid flap loops by using stricter recovery thresholds than trigger thresholds.


7) Integration patterns that work in production

A) PSI-driven adaptive concurrency

Use PSI as feedback for concurrency limiters:

Think PI/PID-like control, but with guardrails to avoid oscillation.

B) PSI-gated autoscaling

Do not autoscale on CPU% alone. Scale when PSI + latency + queue agree.

Good anti-pattern fix:

C) PSI-aware background scheduling

Batch jobs, compaction, indexing, backfills:

This prevents “maintenance jobs eating foreground SLOs.”


8) Common mistakes

  1. Using one global threshold for every service
    Different latency budgets need different cutoffs.

  2. Alerting on every PSI blip
    Short bursts are normal; use duration + multi-signal conditions.

  3. No retry governance
    Under pressure, retries can multiply pressure. Enforce retry budgets.

  4. Ignoring memory.full/io.full
    These are often the earliest collapse signatures.

  5. No post-incident PSI timeline review
    If you only review app logs, you miss resource-coupling root causes.


9) 30-minute incident runbook (PSI spike)

  1. Confirm which PSI dimension is dominant (cpu/memory/io).
  2. Check queue depth, p99, timeout trend.
  3. If memory/io full pressure is sustained:
    • shed optional traffic,
    • pause background jobs,
    • throttle retries immediately.
  4. Lower concurrency caps temporarily.
  5. If available, scale out / move noisy neighbors.
  6. After stabilization, capture:
    • PSI timeline,
    • control actions,
    • latency/throughput response.
  7. Turn findings into threshold/hysteresis updates.

10) What “good” looks like

A healthy service usually has:

If PSI keeps oscillating between high and low with recurring p99 spikes, your control loop is underdamped (or your retry/background policy is fighting your own system).


References (operator-facing)