PostgreSQL MVCC + Autovacuum Bloat Control Playbook

2026-03-13 · software

PostgreSQL MVCC + Autovacuum Bloat Control Playbook

Why this matters

PostgreSQL gets write performance from MVCC: updates/deletes create dead tuples, and readers keep going without blocking writers.

That convenience becomes a hidden tax if vacuum discipline is weak:

This playbook treats bloat control as an operations loop, not occasional cleanup.


Core model (what to optimize)

MVCC health is a 3-way balance:

  1. Dead tuple creation rate (writes/updates/deletes)
  2. Cleanup throughput (autovacuum + vacuum cost budget)
  3. Tuple visibility constraints (long transactions, replicas, old snapshots)

If creation > cleanup for long enough, you enter debt mode.

Vacuum trigger math (per table)

Autovacuum vacuum trigger is roughly:

autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * reltuples

Analyze trigger is similar with analyze settings.

Global defaults are often too loose for very large or very hot tables.


Failure modes to watch

  1. Scale-factor trap on huge tables
    A high scale factor means millions of dead tuples before cleanup starts.

  2. Long-running transactions pin visibility horizon
    Vacuum runs, but cannot reclaim enough dead tuples.

  3. Index bloat dominates table bloat
    Heap looks okay, but indexes become fat and random I/O heavy.

  4. Freeze debt accumulation
    Waiting too long near XID age limits forces aggressive work at bad times.

  5. Autovacuum worker starvation
    Too few workers and too little I/O budget for write volume.


Observability: minimum dashboard

Track these as first-class signals:

Practical derived metrics

You do not need perfect bloat estimation to detect danger early; trend and slope are enough.


State machine for operations

Use explicit states so actions are predictable:

GREEN

Action:

AMBER

Action:

RED

Action:

SAFE

Action:


Tuning playbook (high leverage)

1) Per-table overrides before global aggression

For high-churn large tables, set tighter table-local knobs:

This avoids over-vacuuming cold tables while protecting hot ones.

2) Keep long transactions on a leash

Operational policy:

MVCC hygiene is impossible if horizons remain pinned.

3) Budget enough parallel cleanup capacity

Review:

If write traffic rose but vacuum budget stayed static, debt is expected.

4) Separate index remediation from heap remediation

If query latency is index-random-I/O dominated:

5) Freeze-risk governance

Don’t wait for anti-wraparound panic mode.


Incident runbook (when latency is already burning)

  1. Triage top offenders

    • rank by hot query impact × relation size growth × DTR.
  2. Remove blockers

    • identify long transactions / pinned snapshots.
  3. Stabilize write pressure

    • rate-limit nonessential writers where possible.
  4. Apply targeted cleanup

    • manual vacuum/analyze on top offenders first.
  5. Handle index bloat specifically

    • schedule REINDEX CONCURRENTLY for worst index offenders.
  6. Post-incident policy update

    • move successful hotfix knobs into persistent per-table settings.

30-day rollout template

Week 1: Baseline

Week 2: Targeted tuning

Week 3: Chaos-proofing

Week 4: Governance


Anti-patterns


Bottom line

Autovacuum is not a background detail. It is a production control loop.

When tuple-debt velocity, cleanup capacity, and visibility blockers are managed together, PostgreSQL stays boring. When they are not, bloat converts quietly into p95/p99 latency tax and surprise maintenance risk.