PostgreSQL MVCC + Autovacuum Bloat Control Playbook
Why this matters
PostgreSQL gets write performance from MVCC: updates/deletes create dead tuples, and readers keep going without blocking writers.
That convenience becomes a hidden tax if vacuum discipline is weak:
- table/index bloat grows quietly,
- cache hit quality degrades,
- WAL/checkpoint pressure rises,
VACUUMdebt eventually arrives as latency cliffs.
This playbook treats bloat control as an operations loop, not occasional cleanup.
Core model (what to optimize)
MVCC health is a 3-way balance:
- Dead tuple creation rate (writes/updates/deletes)
- Cleanup throughput (autovacuum + vacuum cost budget)
- Tuple visibility constraints (long transactions, replicas, old snapshots)
If creation > cleanup for long enough, you enter debt mode.
Vacuum trigger math (per table)
Autovacuum vacuum trigger is roughly:
autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * reltuples
Analyze trigger is similar with analyze settings.
Global defaults are often too loose for very large or very hot tables.
Failure modes to watch
Scale-factor trap on huge tables
A high scale factor means millions of dead tuples before cleanup starts.Long-running transactions pin visibility horizon
Vacuum runs, but cannot reclaim enough dead tuples.Index bloat dominates table bloat
Heap looks okay, but indexes become fat and random I/O heavy.Freeze debt accumulation
Waiting too long near XID age limits forces aggressive work at bad times.Autovacuum worker starvation
Too few workers and too little I/O budget for write volume.
Observability: minimum dashboard
Track these as first-class signals:
n_dead_tup,n_live_tupby table (pg_stat_all_tables)last_autovacuum,last_autoanalyze- transaction age and oldest active transaction
- autovacuum throughput and duration
pg_stat_progress_vacuumfor live activity- table/index size growth slope (not only absolute size)
- query p95/p99 for hot relations
Practical derived metrics
- Dead Tuple Ratio (DTR) =
n_dead_tup / (n_live_tup + 1) - Vacuum Lag Age (VLA) =
now - last_autovacuum - Bloat Velocity (BV) =
d(size)/dtunder stable data cardinality
You do not need perfect bloat estimation to detect danger early; trend and slope are enough.
State machine for operations
Use explicit states so actions are predictable:
GREEN
- DTR low/stable
- vacuum cadence steady
- latency stable
Action:
- Keep baseline settings.
- Weekly review only.
AMBER
- DTR rising for hot tables
- VLA widening
- index/table size slope accelerating
Action:
- Lower per-table vacuum scale factor on offenders.
- Increase autovacuum workers/cost budget.
- Check long transactions.
RED
- sustained DTR elevation + latency degradation
- autovacuum cannot catch up
- freeze age trending dangerously high
Action:
- Targeted manual
VACUUM (ANALYZE)windows. - Throttle noisy writers if possible.
- Kill or fix runaway long transactions.
- Prioritize hottest tables first.
SAFE
- emergency containment mode
Action:
- Protect availability first.
- Defer non-critical write workloads.
- Execute controlled remediation (including
REINDEX CONCURRENTLYwhen index bloat dominates).
Tuning playbook (high leverage)
1) Per-table overrides before global aggression
For high-churn large tables, set tighter table-local knobs:
- lower
autovacuum_vacuum_scale_factor - lower
autovacuum_analyze_scale_factor - optionally raise per-table autovacuum cost limit
This avoids over-vacuuming cold tables while protecting hot ones.
2) Keep long transactions on a leash
Operational policy:
- alert on transaction age above threshold,
- enforce timeout policies for app sessions,
- treat orphaned idle-in-transaction sessions as incidents.
MVCC hygiene is impossible if horizons remain pinned.
3) Budget enough parallel cleanup capacity
Review:
autovacuum_max_workersautovacuum_naptime- vacuum cost settings (
vacuum_cost_limit,vacuum_cost_delay)
If write traffic rose but vacuum budget stayed static, debt is expected.
4) Separate index remediation from heap remediation
If query latency is index-random-I/O dominated:
- use
REINDEX CONCURRENTLYin controlled batches, - keep fallback plans (cancel windows, rollback sequencing),
- verify post-remediation with both size and p95 latency deltas.
5) Freeze-risk governance
Don’t wait for anti-wraparound panic mode.
- monitor relation age continuously,
- schedule freeze-friendly maintenance windows,
- treat rising freeze debt as capacity risk, not housekeeping.
Incident runbook (when latency is already burning)
Triage top offenders
- rank by hot query impact × relation size growth × DTR.
Remove blockers
- identify long transactions / pinned snapshots.
Stabilize write pressure
- rate-limit nonessential writers where possible.
Apply targeted cleanup
- manual vacuum/analyze on top offenders first.
Handle index bloat specifically
- schedule
REINDEX CONCURRENTLYfor worst index offenders.
- schedule
Post-incident policy update
- move successful hotfix knobs into persistent per-table settings.
30-day rollout template
Week 1: Baseline
- classify top 20 write-heavy tables,
- capture DTR/VLA/BV baseline,
- map long-transaction sources by service.
Week 2: Targeted tuning
- apply per-table autovacuum overrides to top offenders,
- raise cleanup capacity only as needed,
- add transaction-age alerts.
Week 3: Chaos-proofing
- run stress rehearsal with elevated write rate,
- verify vacuum catch-up behavior,
- test incident runbook timing.
Week 4: Governance
- lock SLOs for MVCC health,
- codify thresholds/states/actions,
- schedule periodic bloat and freeze debt reviews.
Anti-patterns
- one-size-fits-all global scale factor
- waiting for visible disk pain before acting
- treating long transactions as app-team-only issue
- running heavy blocking maintenance without rollback plan
- celebrating one-off cleanup without feedback-loop tuning
Bottom line
Autovacuum is not a background detail. It is a production control loop.
When tuple-debt velocity, cleanup capacity, and visibility blockers are managed together, PostgreSQL stays boring. When they are not, bloat converts quietly into p95/p99 latency tax and surprise maintenance risk.