Raft Snapshot + Log-Compaction Operations Playbook
Date: 2026-03-16
Category: knowledge
Why this matters
Raft clusters do not fail only because of elections. In production, many incidents come from state growth mechanics:
- logs grow faster than followers can replay,
- snapshots are too large/slow to install,
- compaction is too aggressive (followers fall off the replay window),
- or too conservative (disk/IO/GC pressure explodes).
If you run etcd/Consul/Vault-style Raft storage, snapshot + compaction tuning is a first-class reliability task, not housekeeping.
1) Mental model in one paragraph
Raft gives followers two ways to catch up:
- Replay missing log entries (cheap when lag is small)
- Install a snapshot (expensive but bounded when lag is huge)
Compaction trims old logs after snapshot boundaries. The operator problem is balancing:
- enough retained logs for normal lag recovery,
- small enough snapshots to install quickly,
- and bounded memory/disk overhead.
Think of it as a recovery-latency budget problem.
2) The three knobs that matter most
Names differ by implementation, but the control surface is usually:
- Snapshot trigger threshold
(e.g., entries between snapshots) - Trailing log retention after snapshot
(how much replay window remains) - Snapshot transfer/write behavior
(chunking, IO pacing, install timeout pressure)
Practical effects:
- Higher snapshot threshold:
- fewer snapshots, less snapshot churn,
- but larger log growth and longer replay tail.
- Higher trailing logs:
- fewer forced snapshot installs for moderately slow followers,
- but more disk/memory footprint.
- Faster snapshot install path:
- safer recovery from large lag,
- but can contend with foreground latency if unthrottled.
3) SLO-first signals to watch
Do not tune by one metric (like disk size) alone. Track a small bundle:
- Follower lag distribution (p50/p95/p99)
- Snapshot install frequency + duration
- Replay catch-up success ratio (log replay vs snapshot fallback)
- Compaction cadence (too frequent / too sparse)
- Leader append latency under snapshot load
- Storage pressure (raft log size, fs latency, fragmentation)
If p99 follower lag regularly crosses retained trailing logs, you are in a dangerous zone where normal recovery degenerates into repeated snapshot installs.
4) Failure modes you will actually see
A) Snapshot ping-pong
Symptom:
- follower repeatedly starts snapshot install but falls behind again before completion.
Typical causes:
- snapshot too large for network/disk throughput,
- write throughput too high versus install bandwidth,
- trailing logs too short to bridge install window.
B) Compaction thrash
Symptom:
- frequent snapshot/compaction cycles, rising latency variance, little net stability gain.
Typical causes:
- snapshot threshold too low for workload burstiness,
- disk IO saturation causing overlapping maintenance pressure.
C) Log hoarding drift
Symptom:
- “looks stable” until disk/GC/IO cliffs appear.
Typical causes:
- thresholds set high and never revisited after traffic growth,
- no alerting on replay-window exhaustion risk.
5) A practical tuning workflow (safe sequence)
Step 1 — Measure your recovery envelope
Estimate:
- worst-case sustained write rate (entries/s),
- follower catch-up throughput (entries/s replay, MB/s snapshot install),
- maximum acceptable recovery time.
From this, derive whether your trailing log window can cover normal lag spikes.
Step 2 — Size replay window first
Start by ensuring trailing logs are large enough that typical lag events recover by replay, not snapshot.
Rule of thumb:
- target replay coverage for at least a p99 lag burst window,
- reserve snapshot path for true outliers or node replacement.
Step 3 — Tune snapshot threshold second
Set threshold to avoid excessive snapshot churn while keeping snapshot size installable within your recovery SLO.
If snapshots are frequent and tiny: raise threshold. If snapshots are huge/slow and often time out: lower threshold and/or reduce state churn.
Step 4 — Validate under burst + degraded node test
Run controlled tests:
- one slow follower (CPU throttled or IO-limited),
- one network-impaired follower,
- realistic write burst.
Pass criteria:
- follower rejoins without repeated snapshot ping-pong,
- leader latency remains within SLO during install,
- no compaction thrash loop.
Step 5 — Add guardrail alerts
Alert on:
- rising ratio of snapshot-based catch-ups,
- snapshot install p95 duration crossing budget,
- follower lag approaching replay-window ceiling,
- compaction frequency spikes.
6) Operator decision table
| Observed pattern | Likely issue | First action |
|---|---|---|
| Frequent snapshot installs for mildly lagging followers | Replay window too short | Increase trailing logs (carefully) |
| Very large, slow snapshots | Snapshot threshold too high and/or state too large | Lower threshold; reduce write amplification |
| High IO spikes during snapshot/compaction | Maintenance contention | Stagger/smooth maintenance; revisit threshold |
| Disk growth without bounds | Compaction too conservative | Tighten compaction/snapshot policy |
| Followers never fully catch up after restart | Snapshot install path bottleneck | Improve disk/network path; reduce snapshot size |
7) Implementation notes across common stacks
- Raft algorithm baseline: InstallSnapshot is the safety valve when logs are unavailable for replay; snapshot contains a state baseline + log position metadata.
- etcd ops guidance: compaction and snapshot retention settings are explicit tradeoffs between memory/throughput and slow-follower recoverability.
- HashiCorp Raft-backed systems (e.g., Vault/Consul): knobs like
snapshot_thresholdandtrailing_logsmap directly to this replay-vs-snapshot balance.
Do not cargo-cult raw defaults across clusters with very different write rates or follower hardware.
8) 30-minute incident runbook (snapshot/compaction instability)
- Confirm leader health and quorum first.
- Identify whether lagging followers are replaying logs or stuck in repeated snapshot installs.
- Check snapshot size + install duration trend.
- If ping-pong is active:
- temporarily reduce write pressure if possible,
- increase retained trailing logs (short-term relief),
- then tune snapshot threshold for sustainable size/frequency.
- If disk/IO pressure dominates:
- avoid blind threshold increases,
- first reduce concurrent maintenance contention.
- After stabilization, capture pre/post metrics and codify new guardrails.
9) What “good” looks like
A healthy Raft cluster usually shows:
- replay-based follower catch-up as the normal path,
- snapshot install as occasional exception,
- bounded log growth,
- predictable snapshot durations,
- and no strong coupling between maintenance events and tail-latency spikes.
If snapshot install becomes routine, your replay window and snapshot economics are misaligned with workload reality.
References
- Ongaro, D., Ousterhout, J. In Search of an Understandable Consensus Algorithm (Raft, extended version).
- etcd docs: Maintenance (compaction, snapshot-count tradeoffs).
- HashiCorp Vault docs: Integrated storage (Raft) configuration (
snapshot_threshold,trailing_logs). - HashiCorp Consul docs: Raft configuration / LogStore operational considerations.