Merkle-Tree Anti-Entropy Repair Operations Playbook
How to keep eventually consistent replicas converged without melting disk, network, and compaction.
Why this matters
If you run an eventually consistent replicated store, entropy is not an edge case—it is normal operation:
- transient node/network failures,
- dropped writes/timeouts,
- replica lag,
- divergent versions,
- occasional corruption.
Read repair and hinted handoff help, but they are opportunistic. If a key is cold, drift can live for a long time. That is why production systems (Dynamo lineage, Cassandra, Riak, ScyllaDB) rely on anti-entropy repair.
The practical challenge: repair can be one of the most expensive maintenance jobs in your cluster.
This playbook focuses on running Merkle-tree-based repair safely and predictably.
1) Mental model: 3 convergence loops
Treat replica convergence as three different loops with different latency/cost profiles:
- Hinted handoff (fast path for short outages)
- Read repair (on-demand, only for hot keys)
- Active anti-entropy repair (background, covers cold data)
Anti-entropy is the only loop that gives you broad, periodic coverage across the full keyspace.
2) How Merkle anti-entropy works (operator view)
For each repair range (typically token/vnode range):
- Each replica builds a hash tree over data in that range.
- Coordinator compares trees replica-to-replica.
- If root hashes match, the whole range is assumed equal.
- If not, descend recursively into child nodes.
- Stream only mismatching subranges/rows and reconcile.
Why it scales: you compare compact hashes first and only stream where trees disagree.
3) Design knobs that decide whether repair is cheap or painful
A) Hashing granularity
- Coarse (partition/subrange): fewer hashes, less metadata, but can over-stream on tiny mismatches.
- Fine (row-level): far less over-streaming, better network efficiency, more bookkeeping.
Rule of thumb:
- If your partitions can get large or skewed, favor finer granularity.
B) Tree shape (depth/fanout)
- Shallow trees: lighter memory/CPU overhead but less precise mismatch localization.
- Deeper trees: better pinpointing but more hash computation/state.
A static “nice” depth is not universal; tune against your partition count distribution and repair window.
C) Tree storage model
- In-memory trees: simple but expensive to rebuild after restart.
- Persistent/on-disk trees: better restart behavior and continuous background comparison.
If your workload has lots of cold data + frequent restarts, persistent trees usually pay off operationally.
D) Hash function choice
Repair hashes are integrity detectors for synchronization, not cryptographic trust boundaries. Use fast, low-collision non-crypto hashes where supported; reserve cryptographic hashes only where required.
4) Repair scheduling strategy (the part most teams get wrong)
A) Define a hard max interval per range
Set a maximum time between successful repairs for every replica range (e.g., daily/weekly depending on workload and tombstone policy).
Don’t schedule “best effort”; schedule to a coverage SLO.
B) Prioritize by risk, not by alphabetical keyspace order
Repair first:
- high-write / high-delete tables,
- business-critical keyspaces,
- ranges recently impacted by node instability.
C) Keep jobs topology-aware
- Prefer local-DC repairs when possible before cross-DC operations.
- Avoid large all-cluster repair blasts during peak traffic.
- Stagger ranges to flatten I/O and network usage.
D) Pair repair cadence with tombstone/GC policy
If repair cadence is slower than your data-retention/tombstone safety assumptions, resurrection and divergence risks rise.
5) Throughput controls: avoid repair-induced incidents
Repair consumes the same resources your production traffic needs.
Use explicit guardrails:
- cap concurrent repair ranges,
- cap streaming bandwidth,
- cap validation compaction concurrency,
- pause/slow repair on cluster stress signals (latency, disk queue depth, pending compactions).
Simple control loop:
- Green: normal concurrency
- Yellow: halve concurrency/bandwidth
- Red: pause repair, preserve SLO traffic
6) Observability: what to graph and alert on
Minimum repair dashboard:
- Coverage age: max/avg time since last successful repair per range
- Repair duration: p50/p95 per keyspace/table
- Bytes streamed per repaired GB (efficiency signal)
- Mismatch ratio: mismatched ranges / checked ranges
- Compaction pressure during repair windows
- Read/write latency shift during repair windows
- Repair failure rate by reason (timeout, stream error, topology change)
High-value alerts:
- any range exceeds max repair age,
- repeated aborts on same range,
- streaming bytes spike vs baseline,
- repair jobs overlapping with incident-level latency.
7) Common failure patterns and fixes
Symptom: repair “completes” but drift returns quickly
Likely causes:
- repair interval too long,
- ongoing replica instability,
- workload skew concentrating entropy on hot ranges.
Fix: tighten interval for hot keyspaces + stabilize failing nodes + bias scheduler to problematic ranges.
Symptom: huge streaming for tiny logical drift
Likely causes:
- overly coarse hash granularity,
- skewed large partitions,
- poor range split strategy.
Fix: finer-grain repair mode where supported, rebalance/split hot partitions, tune range chunking.
Symptom: repair causes user-facing latency spikes
Likely causes:
- no throttling,
- compaction + repair contention,
- excessive parallelism.
Fix: add adaptive throttling + move repair windows + separate high-risk tables into smaller batches.
Symptom: perpetual re-repair after topology churn
Likely causes:
- token/vnode movement invalidating previous tree assumptions,
- interrupted jobs during bootstrap/decommission.
Fix: trigger post-topology-change targeted repair plans and reset stale range schedules.
8) Recommended rollout plan (for an existing cluster)
- Inventory keyspaces/tables by write+delete intensity.
- Set coverage SLO (max repair age per range).
- Start conservative: low concurrency + strict throttles.
- Measure efficiency: bytes streamed per repaired GB.
- Tune granularity/splits for outlier tables.
- Automate recurring schedule + failure retries with jitter.
- Bake into incident runbooks (pause/resume policy on degradation).
9) Compact operator checklist
- Every replica range has a max allowed repair age
- Repair scheduler is topology-aware (not cluster-wide blast-by-default)
- Explicit I/O/network/parallelism caps are configured
- Repair can auto-throttle/pause on latency/compaction stress
- Post-topology-change targeted repair is defined
- Dashboard includes coverage age + stream efficiency + failure reasons
- Cold-data keyspaces are included (not only hot-read sets)
References
- Dynamo paper/blog (anti-entropy, key-range-oriented replica sync)
- Cassandra anti-entropy repair (Merkle trees, operational modes)
- Riak Active Anti-Entropy (AAE, persistent on-disk hash trees)
- ScyllaDB row-level repair overview (network and runtime efficiency)