Raft Joint Consensus Membership Change Operations Playbook

Why this exists

Changing cluster membership is one of the easiest ways to accidentally break a healthy Raft system. The dangerous pattern is simple: remove/add nodes too aggressively, lose quorum overlap, and trigger election chaos right when you need stability.

This note is an operator-first playbook for safe membership changes using joint consensus (or equivalent staged approaches with learners).

Core idea in one paragraph

Raft membership changes must preserve quorum intersection across configuration transitions. Joint consensus does this by temporarily requiring agreement from both old and new configs (C_old,new), then finalizing to C_new. In practice, modern implementations often use a learner/non-voter stage before promoting a node to voter, because catching up logs before voting reduces instability and risk.

Failure modes to prevent

No quorum overlap during transition
- Removing a voter before replacement is caught up can strand the cluster.
Leader on the wrong side of topology
- Membership change completes, then leader loses reachability to new majority.
Catch-up debt promoted too early
- New node becomes voter while far behind, increasing commit latency and election risk.
Multiple concurrent reconfigs
- Overlapping changes create ambiguous operator state and rollback confusion.
Latency cliff from cross-AZ/region voting path
- New quorum geometry raises commit RTT enough to destabilize client timeouts.

Safety invariants (print this in your runbook)

Only one membership change in flight at a time.
Never remove a healthy voter until replacement is caught up (or capacity is proven).
Maintain quorum overlap from start to finish.
Track leader placement before and after change.
Abort transition if lag/election churn breaches predefined thresholds.

Preflight checklist

Before touching membership:

Cluster has stable leader (no recent term churn).
Commit latency and apply lag are within normal envelope.
No snapshot backlog / disk saturation / compaction stall.
New node passes health checks (disk, network, clock sync, TLS).
New node can replicate at near-line rate (not permanently lagging).
Alerting is active for: leader changes, replication lag, failed proposals.
Clear rollback decision and operator owner are assigned.

Recommended freeze: avoid concurrent high-risk changes (schema migrations, network policy rollout, storage maintenance).

Operational sequence (safe default)

Phase 1) Add as learner/non-voter

Add target node as learner.
Verify sustained catch-up:
- replication lag trending to ~0
- snapshot transfer completed (if needed)
- no repeated disconnect/restart cycles

Gate to continue: learner remains near-tail for N minutes under realistic write load.

Phase 2) Enter joint config / promote to voter

Promote learner via joint-consensus transition (implementation-specific command).
During this window, monitor:
- proposal commit latency
- leader heartbeat stability
- election timeout events

Abort conditions: sudden term bump, persistent high lag, client timeout surge.

Phase 3) Finalize to new config

Complete transition to C_new.
Confirm final voter set from cluster metadata, not just CLI success text.

Phase 4) Optional removal of old voter (if replacing)

Remove old voter only after C_new is stable.
Re-check quorum math after removal.

Phase 5) Post-change soak

Run soak (e.g., 15–60 min) with normal write traffic.
Validate no hidden instability (leader churn, long-tail commit spikes).

Quorum math quick reference

For N voters, majority = floor(N/2) + 1.

3 voters -> majority 2
5 voters -> majority 3
7 voters -> majority 4

Operationally:

Prefer odd voter counts.
Do not “temporarily shrink then expand” under load unless explicitly planned.
In 3-node clusters, every node is precious; replacement sequencing matters a lot.

SLO-oriented guardrails

Define hard gates before the change:

Leader changes: <= 0 unexpected during transition window
P95 commit latency: <= baseline * 1.5 (example gate)
Max follower lag: bounded (e.g., < a few thousand entries, cluster-specific)
Failed proposal ratio: near zero

If any guardrail breaks, pause and either rollback or hold at learner stage.

Rollback patterns

Case A: learner never catches up

Keep as learner or remove learner.
Do not promote.
Diagnose bottleneck (disk/network/snapshot throttle).

Case B: instability starts during joint transition

If implementation supports safe revert, return to old config.
If not, stop further changes and stabilize leader/quorum first.

Case C: post-finalization latency regression

Re-evaluate placement (AZ/region path).
Consider planned reconfiguration back to previous topology after stabilization.

Chaos tests to run in staging

Promote learner while injecting moderate packet loss.
Drop leader mid-transition and ensure safe re-election.
Delay one follower heavily and verify rollback decision logic.
Simulate snapshot-heavy node join with sustained write load.
Run client timeout-sensitive workload to catch tail-latency regressions.

Goal: prove your runbook survives realistic faults, not just happy-path demos.

Implementation notes (vendor-agnostic)

etcd/Consul/Raft libraries expose different commands, but the safety model is the same.
“Server added” message is not enough—verify voter role + health + lag over time.
Prefer automation that enforces preflight/guardrails over manual one-off CLI edits.

60-second operator summary

Add replacement as learner first.
Wait for sustained catch-up.
Promote via joint consensus.
Finalize, then (optionally) remove old voter.
Enforce strict latency/leader-churn guardrails.
Never stack concurrent membership changes.

Membership change is not a metadata edit; it is a quorum-risk operation. Treat it like a controlled migration with rollback discipline.