Raft Joint Consensus Membership Change Operations Playbook
Why this exists
Changing cluster membership is one of the easiest ways to accidentally break a healthy Raft system. The dangerous pattern is simple: remove/add nodes too aggressively, lose quorum overlap, and trigger election chaos right when you need stability.
This note is an operator-first playbook for safe membership changes using joint consensus (or equivalent staged approaches with learners).
Core idea in one paragraph
Raft membership changes must preserve quorum intersection across configuration transitions. Joint consensus does this by temporarily requiring agreement from both old and new configs (C_old,new), then finalizing to C_new. In practice, modern implementations often use a learner/non-voter stage before promoting a node to voter, because catching up logs before voting reduces instability and risk.
Failure modes to prevent
No quorum overlap during transition
- Removing a voter before replacement is caught up can strand the cluster.
Leader on the wrong side of topology
- Membership change completes, then leader loses reachability to new majority.
Catch-up debt promoted too early
- New node becomes voter while far behind, increasing commit latency and election risk.
Multiple concurrent reconfigs
- Overlapping changes create ambiguous operator state and rollback confusion.
Latency cliff from cross-AZ/region voting path
- New quorum geometry raises commit RTT enough to destabilize client timeouts.
Safety invariants (print this in your runbook)
- Only one membership change in flight at a time.
- Never remove a healthy voter until replacement is caught up (or capacity is proven).
- Maintain quorum overlap from start to finish.
- Track leader placement before and after change.
- Abort transition if lag/election churn breaches predefined thresholds.
Preflight checklist
Before touching membership:
- Cluster has stable leader (no recent term churn).
- Commit latency and apply lag are within normal envelope.
- No snapshot backlog / disk saturation / compaction stall.
- New node passes health checks (disk, network, clock sync, TLS).
- New node can replicate at near-line rate (not permanently lagging).
- Alerting is active for: leader changes, replication lag, failed proposals.
- Clear rollback decision and operator owner are assigned.
Recommended freeze: avoid concurrent high-risk changes (schema migrations, network policy rollout, storage maintenance).
Operational sequence (safe default)
Phase 1) Add as learner/non-voter
- Add target node as learner.
- Verify sustained catch-up:
- replication lag trending to ~0
- snapshot transfer completed (if needed)
- no repeated disconnect/restart cycles
Gate to continue: learner remains near-tail for N minutes under realistic write load.
Phase 2) Enter joint config / promote to voter
- Promote learner via joint-consensus transition (implementation-specific command).
- During this window, monitor:
- proposal commit latency
- leader heartbeat stability
- election timeout events
Abort conditions: sudden term bump, persistent high lag, client timeout surge.
Phase 3) Finalize to new config
- Complete transition to C_new.
- Confirm final voter set from cluster metadata, not just CLI success text.
Phase 4) Optional removal of old voter (if replacing)
- Remove old voter only after C_new is stable.
- Re-check quorum math after removal.
Phase 5) Post-change soak
- Run soak (e.g., 15–60 min) with normal write traffic.
- Validate no hidden instability (leader churn, long-tail commit spikes).
Quorum math quick reference
For N voters, majority = floor(N/2) + 1.
- 3 voters -> majority 2
- 5 voters -> majority 3
- 7 voters -> majority 4
Operationally:
- Prefer odd voter counts.
- Do not “temporarily shrink then expand” under load unless explicitly planned.
- In 3-node clusters, every node is precious; replacement sequencing matters a lot.
SLO-oriented guardrails
Define hard gates before the change:
- Leader changes:
<= 0unexpected during transition window - P95 commit latency:
<= baseline * 1.5(example gate) - Max follower lag: bounded (e.g., < a few thousand entries, cluster-specific)
- Failed proposal ratio: near zero
If any guardrail breaks, pause and either rollback or hold at learner stage.
Rollback patterns
Case A: learner never catches up
- Keep as learner or remove learner.
- Do not promote.
- Diagnose bottleneck (disk/network/snapshot throttle).
Case B: instability starts during joint transition
- If implementation supports safe revert, return to old config.
- If not, stop further changes and stabilize leader/quorum first.
Case C: post-finalization latency regression
- Re-evaluate placement (AZ/region path).
- Consider planned reconfiguration back to previous topology after stabilization.
Chaos tests to run in staging
- Promote learner while injecting moderate packet loss.
- Drop leader mid-transition and ensure safe re-election.
- Delay one follower heavily and verify rollback decision logic.
- Simulate snapshot-heavy node join with sustained write load.
- Run client timeout-sensitive workload to catch tail-latency regressions.
Goal: prove your runbook survives realistic faults, not just happy-path demos.
Implementation notes (vendor-agnostic)
- etcd/Consul/Raft libraries expose different commands, but the safety model is the same.
- “Server added” message is not enough—verify voter role + health + lag over time.
- Prefer automation that enforces preflight/guardrails over manual one-off CLI edits.
60-second operator summary
- Add replacement as learner first.
- Wait for sustained catch-up.
- Promote via joint consensus.
- Finalize, then (optionally) remove old voter.
- Enforce strict latency/leader-churn guardrails.
- Never stack concurrent membership changes.
Membership change is not a metadata edit; it is a quorum-risk operation. Treat it like a controlled migration with rollback discipline.