Jepsen in Production CI: A Practical Consistency Verification Playbook

2026-03-09 · software

Jepsen in Production CI: A Practical Consistency Verification Playbook

Date: 2026-03-09
Category: knowledge
Domain: software / distributed systems / reliability

Why this matters

Many distributed systems fail not because throughput is low, but because correctness claims are wrong under faults.

A Jepsen-style program gives you a repeatable way to test those claims before production incidents do it for you.


Core idea

Jepsen treats your system as a black box and repeatedly runs this loop:

  1. Generate concurrent client operations (reads/writes/txns)
  2. Inject failures via a nemesis (e.g., partitions)
  3. Record full operation histories (invocations + completions)
  4. Check if the observed history is legal under your target consistency model

If your legal-model checker cannot explain a history, that’s a correctness bug (or a claim mismatch).


The four contracts you must define first

1) Claim contract (what you think you provide)

Write an explicit claim matrix:

If this matrix is vague, test outcomes become argument theater.

2) Workload contract (what clients actually do)

Use workloads that preserve forensic signal:

3) Fault contract (what can go wrong)

At minimum, test:

On real VM environments, include clock-fault scenarios when relevant.

4) Verdict contract (how you decide pass/fail)


Checker selection: when to use what

Knossos (linearizability-focused)

Use when validating single-object/register-style semantics, CAS behavior, lock-like APIs, or per-key strict real-time claims.

Practical note: treat checker output as evidence to inspect, not blind truth. Keep minimal failing histories and visualization artifacts for human review.

Elle (transaction isolation-focused)

Use for multi-key transactional stores where you need anomaly-level diagnosis (e.g., G1/G2 class anomalies, read skew classes, dependency-cycle evidence).

Elle is especially useful because it explains why a history is impossible under the promised isolation level by showing dependency cycles/minimal witnesses.


Campaign design (what to run every week)

Phase A: fast smoke (5–15 min)

Phase B: stress sweep (30–90 min)

Phase C: pre-release soak (2–12 h)

Store every failing seed/history/checker report for deterministic reruns.


Fault menu that gives high signal quickly

Start simple, then compose:

  1. Partition random halves ↔ heal
  2. Leader/node kill loops
  3. Latency + packet loss jitter
  4. Client timeout tightening (ambiguous ops pressure)
  5. Combined faults (partition + restart + high concurrency)

Most teams under-test combined faults; many real incidents are cross-fault interactions.


What to look for in failures

Don’t just ask “did it fail?” Ask “what class of failure?”

Each class maps to different remediation owners (storage engine, replication, retry semantics, docs/API claim mismatch).


Promotion policy (recommended)

Ship gates before production:

If your release process has no anomaly budget, it effectively accepts unlimited consistency regressions.


Operationalizing this beyond one-off tests

Correctness is a product contract, not just a database-internals concern.


Maelstrom as a feeder system

Before full Jepsen campaigns on real clusters, use Maelstrom to prototype algorithms and failure handling cheaply:

Maelstrom is excellent for early design iteration; Jepsen is where product claims meet real deployment conditions.


12-point readiness checklist


References


One-line takeaway

If your distributed system’s consistency claim isn’t backed by recurring fault-injection history checks, it’s marketing, not a guarantee.