Raft Snapshot + Log-Compaction Operations Playbook

2026-03-16 · software

Raft Snapshot + Log-Compaction Operations Playbook

Date: 2026-03-16
Category: knowledge

Why this matters

Raft clusters do not fail only because of elections. In production, many incidents come from state growth mechanics:

If you run etcd/Consul/Vault-style Raft storage, snapshot + compaction tuning is a first-class reliability task, not housekeeping.


1) Mental model in one paragraph

Raft gives followers two ways to catch up:

  1. Replay missing log entries (cheap when lag is small)
  2. Install a snapshot (expensive but bounded when lag is huge)

Compaction trims old logs after snapshot boundaries. The operator problem is balancing:

Think of it as a recovery-latency budget problem.


2) The three knobs that matter most

Names differ by implementation, but the control surface is usually:

  1. Snapshot trigger threshold
    (e.g., entries between snapshots)
  2. Trailing log retention after snapshot
    (how much replay window remains)
  3. Snapshot transfer/write behavior
    (chunking, IO pacing, install timeout pressure)

Practical effects:


3) SLO-first signals to watch

Do not tune by one metric (like disk size) alone. Track a small bundle:

If p99 follower lag regularly crosses retained trailing logs, you are in a dangerous zone where normal recovery degenerates into repeated snapshot installs.


4) Failure modes you will actually see

A) Snapshot ping-pong

Symptom:

Typical causes:

B) Compaction thrash

Symptom:

Typical causes:

C) Log hoarding drift

Symptom:

Typical causes:


5) A practical tuning workflow (safe sequence)

Step 1 — Measure your recovery envelope

Estimate:

From this, derive whether your trailing log window can cover normal lag spikes.

Step 2 — Size replay window first

Start by ensuring trailing logs are large enough that typical lag events recover by replay, not snapshot.

Rule of thumb:

Step 3 — Tune snapshot threshold second

Set threshold to avoid excessive snapshot churn while keeping snapshot size installable within your recovery SLO.

If snapshots are frequent and tiny: raise threshold. If snapshots are huge/slow and often time out: lower threshold and/or reduce state churn.

Step 4 — Validate under burst + degraded node test

Run controlled tests:

Pass criteria:

Step 5 — Add guardrail alerts

Alert on:


6) Operator decision table

Observed pattern Likely issue First action
Frequent snapshot installs for mildly lagging followers Replay window too short Increase trailing logs (carefully)
Very large, slow snapshots Snapshot threshold too high and/or state too large Lower threshold; reduce write amplification
High IO spikes during snapshot/compaction Maintenance contention Stagger/smooth maintenance; revisit threshold
Disk growth without bounds Compaction too conservative Tighten compaction/snapshot policy
Followers never fully catch up after restart Snapshot install path bottleneck Improve disk/network path; reduce snapshot size

7) Implementation notes across common stacks

Do not cargo-cult raw defaults across clusters with very different write rates or follower hardware.


8) 30-minute incident runbook (snapshot/compaction instability)

  1. Confirm leader health and quorum first.
  2. Identify whether lagging followers are replaying logs or stuck in repeated snapshot installs.
  3. Check snapshot size + install duration trend.
  4. If ping-pong is active:
    • temporarily reduce write pressure if possible,
    • increase retained trailing logs (short-term relief),
    • then tune snapshot threshold for sustainable size/frequency.
  5. If disk/IO pressure dominates:
    • avoid blind threshold increases,
    • first reduce concurrent maintenance contention.
  6. After stabilization, capture pre/post metrics and codify new guardrails.

9) What “good” looks like

A healthy Raft cluster usually shows:

If snapshot install becomes routine, your replay window and snapshot economics are misaligned with workload reality.


References