Raft Snapshot + Log-Compaction Operations Playbook

Date: 2026-03-16
Category: knowledge

Why this matters

Raft clusters do not fail only because of elections. In production, many incidents come from state growth mechanics:

logs grow faster than followers can replay,
snapshots are too large/slow to install,
compaction is too aggressive (followers fall off the replay window),
or too conservative (disk/IO/GC pressure explodes).

If you run etcd/Consul/Vault-style Raft storage, snapshot + compaction tuning is a first-class reliability task, not housekeeping.

1) Mental model in one paragraph

Raft gives followers two ways to catch up:

Replay missing log entries (cheap when lag is small)
Install a snapshot (expensive but bounded when lag is huge)

Compaction trims old logs after snapshot boundaries. The operator problem is balancing:

enough retained logs for normal lag recovery,
small enough snapshots to install quickly,
and bounded memory/disk overhead.

Think of it as a recovery-latency budget problem.

2) The three knobs that matter most

Names differ by implementation, but the control surface is usually:

Snapshot trigger threshold
(e.g., entries between snapshots)
Trailing log retention after snapshot
(how much replay window remains)
Snapshot transfer/write behavior
(chunking, IO pacing, install timeout pressure)

Practical effects:

Higher snapshot threshold:
- fewer snapshots, less snapshot churn,
- but larger log growth and longer replay tail.
Higher trailing logs:
- fewer forced snapshot installs for moderately slow followers,
- but more disk/memory footprint.
Faster snapshot install path:
- safer recovery from large lag,
- but can contend with foreground latency if unthrottled.

3) SLO-first signals to watch

Do not tune by one metric (like disk size) alone. Track a small bundle:

Follower lag distribution (p50/p95/p99)
Snapshot install frequency + duration
Replay catch-up success ratio (log replay vs snapshot fallback)
Compaction cadence (too frequent / too sparse)
Leader append latency under snapshot load
Storage pressure (raft log size, fs latency, fragmentation)

If p99 follower lag regularly crosses retained trailing logs, you are in a dangerous zone where normal recovery degenerates into repeated snapshot installs.

4) Failure modes you will actually see

A) Snapshot ping-pong

Symptom:

follower repeatedly starts snapshot install but falls behind again before completion.

Typical causes:

snapshot too large for network/disk throughput,
write throughput too high versus install bandwidth,
trailing logs too short to bridge install window.

B) Compaction thrash

Symptom:

frequent snapshot/compaction cycles, rising latency variance, little net stability gain.

Typical causes:

snapshot threshold too low for workload burstiness,
disk IO saturation causing overlapping maintenance pressure.

C) Log hoarding drift

Symptom:

“looks stable” until disk/GC/IO cliffs appear.

Typical causes:

thresholds set high and never revisited after traffic growth,
no alerting on replay-window exhaustion risk.

5) A practical tuning workflow (safe sequence)

Step 1 — Measure your recovery envelope

Estimate:

worst-case sustained write rate (entries/s),
follower catch-up throughput (entries/s replay, MB/s snapshot install),
maximum acceptable recovery time.

From this, derive whether your trailing log window can cover normal lag spikes.

Step 2 — Size replay window first

Start by ensuring trailing logs are large enough that typical lag events recover by replay, not snapshot.

Rule of thumb:

target replay coverage for at least a p99 lag burst window,
reserve snapshot path for true outliers or node replacement.

Step 3 — Tune snapshot threshold second

Set threshold to avoid excessive snapshot churn while keeping snapshot size installable within your recovery SLO.

If snapshots are frequent and tiny: raise threshold. If snapshots are huge/slow and often time out: lower threshold and/or reduce state churn.

Step 4 — Validate under burst + degraded node test

Run controlled tests:

one slow follower (CPU throttled or IO-limited),
one network-impaired follower,
realistic write burst.

Pass criteria:

follower rejoins without repeated snapshot ping-pong,
leader latency remains within SLO during install,
no compaction thrash loop.

Step 5 — Add guardrail alerts

Alert on:

rising ratio of snapshot-based catch-ups,
snapshot install p95 duration crossing budget,
follower lag approaching replay-window ceiling,
compaction frequency spikes.

6) Operator decision table

Observed pattern	Likely issue	First action
Frequent snapshot installs for mildly lagging followers	Replay window too short	Increase trailing logs (carefully)
Very large, slow snapshots	Snapshot threshold too high and/or state too large	Lower threshold; reduce write amplification
High IO spikes during snapshot/compaction	Maintenance contention	Stagger/smooth maintenance; revisit threshold
Disk growth without bounds	Compaction too conservative	Tighten compaction/snapshot policy
Followers never fully catch up after restart	Snapshot install path bottleneck	Improve disk/network path; reduce snapshot size

7) Implementation notes across common stacks

Raft algorithm baseline: InstallSnapshot is the safety valve when logs are unavailable for replay; snapshot contains a state baseline + log position metadata.
etcd ops guidance: compaction and snapshot retention settings are explicit tradeoffs between memory/throughput and slow-follower recoverability.
HashiCorp Raft-backed systems (e.g., Vault/Consul): knobs like snapshot_threshold and trailing_logs map directly to this replay-vs-snapshot balance.

Do not cargo-cult raw defaults across clusters with very different write rates or follower hardware.

8) 30-minute incident runbook (snapshot/compaction instability)

Confirm leader health and quorum first.
Identify whether lagging followers are replaying logs or stuck in repeated snapshot installs.
Check snapshot size + install duration trend.
If ping-pong is active:
- temporarily reduce write pressure if possible,
- increase retained trailing logs (short-term relief),
- then tune snapshot threshold for sustainable size/frequency.
If disk/IO pressure dominates:
- avoid blind threshold increases,
- first reduce concurrent maintenance contention.
After stabilization, capture pre/post metrics and codify new guardrails.

9) What “good” looks like

A healthy Raft cluster usually shows:

replay-based follower catch-up as the normal path,
snapshot install as occasional exception,
bounded log growth,
predictable snapshot durations,
and no strong coupling between maintenance events and tail-latency spikes.

If snapshot install becomes routine, your replay window and snapshot economics are misaligned with workload reality.

References

Ongaro, D., Ousterhout, J. In Search of an Understandable Consensus Algorithm (Raft, extended version).
etcd docs: Maintenance (compaction, snapshot-count tradeoffs).
HashiCorp Vault docs: Integrated storage (Raft) configuration (snapshot_threshold, trailing_logs).
HashiCorp Consul docs: Raft configuration / LogStore operational considerations.