Jepsen in Production CI: A Practical Consistency Verification Playbook
Date: 2026-03-09
Category: knowledge
Domain: software / distributed systems / reliability
Why this matters
Many distributed systems fail not because throughput is low, but because correctness claims are wrong under faults.
- “Serializable” might behave like snapshot isolation in edge cases.
- “Strongly consistent” may degrade to stale reads during partitions.
- Timeout-heavy periods can hide ambiguous operations your app treats as committed.
A Jepsen-style program gives you a repeatable way to test those claims before production incidents do it for you.
Core idea
Jepsen treats your system as a black box and repeatedly runs this loop:
- Generate concurrent client operations (reads/writes/txns)
- Inject failures via a nemesis (e.g., partitions)
- Record full operation histories (invocations + completions)
- Check if the observed history is legal under your target consistency model
If your legal-model checker cannot explain a history, that’s a correctness bug (or a claim mismatch).
The four contracts you must define first
1) Claim contract (what you think you provide)
Write an explicit claim matrix:
- single-key reads/writes: linearizable?
- multi-key transactions: serializable / snapshot isolation?
- client session guarantees: monotonic reads/writes?
- real-time ordering required, or only logical ordering?
If this matrix is vague, test outcomes become argument theater.
2) Workload contract (what clients actually do)
Use workloads that preserve forensic signal:
- register/CAS style ops for linearizability checks
- append-list/rw-register transaction patterns for isolation anomaly detection
- bounded key spaces to force contention (otherwise anomalies hide)
3) Fault contract (what can go wrong)
At minimum, test:
- network partitions/heals
- process crashes/restarts
- delayed/lossy networking
- sustained high concurrency + timeout pressure
On real VM environments, include clock-fault scenarios when relevant.
4) Verdict contract (how you decide pass/fail)
- Pass: checker says valid for the claimed model
- Fail: checker returns concrete anomaly/counterexample
- Unknown: analysis could not conclude (resource/complexity limit); treat as non-pass for promotions
Checker selection: when to use what
Knossos (linearizability-focused)
Use when validating single-object/register-style semantics, CAS behavior, lock-like APIs, or per-key strict real-time claims.
Practical note: treat checker output as evidence to inspect, not blind truth. Keep minimal failing histories and visualization artifacts for human review.
Elle (transaction isolation-focused)
Use for multi-key transactional stores where you need anomaly-level diagnosis (e.g., G1/G2 class anomalies, read skew classes, dependency-cycle evidence).
Elle is especially useful because it explains why a history is impossible under the promised isolation level by showing dependency cycles/minimal witnesses.
Campaign design (what to run every week)
Phase A: fast smoke (5–15 min)
- low node count
- one fault family per run
- goal: catch obvious regressions in CI
Phase B: stress sweep (30–90 min)
- varied contention (key count, txn length)
- varied nemesis intervals
- multiple seeds
- goal: probabilistic bug surfacing
Phase C: pre-release soak (2–12 h)
- long run, mixed faults
- failover and reconnect churn
- goal: expose low-frequency pathologies
Store every failing seed/history/checker report for deterministic reruns.
Fault menu that gives high signal quickly
Start simple, then compose:
- Partition random halves ↔ heal
- Leader/node kill loops
- Latency + packet loss jitter
- Client timeout tightening (ambiguous ops pressure)
- Combined faults (partition + restart + high concurrency)
Most teams under-test combined faults; many real incidents are cross-fault interactions.
What to look for in failures
Don’t just ask “did it fail?” Ask “what class of failure?”
- stale read after acknowledged newer write
- aborted/failed txn effects becoming visible
- cyclic dependency anomalies in claimed serializable mode
- contradictory version orders (impossible histories)
- large timeout bands with hidden side effects
Each class maps to different remediation owners (storage engine, replication, retry semantics, docs/API claim mismatch).
Promotion policy (recommended)
Ship gates before production:
- no new anomaly class vs baseline
- no increase in anomaly frequency over rolling N runs
- unknown-rate below threshold (or blocked)
- all known anomalies either fixed or explicitly accepted with documented customer impact
If your release process has no anomaly budget, it effectively accepts unlimited consistency regressions.
Operationalizing this beyond one-off tests
- Put Jepsen runs on a release cadence, not ad-hoc heroics.
- Keep a historical anomaly registry (first seen, fixed in, reproducibility seed).
- Version your claim matrix with product docs and SLAs.
- Feed findings into client SDK guidance (timeouts, retries, quorum/consistency flags).
Correctness is a product contract, not just a database-internals concern.
Maelstrom as a feeder system
Before full Jepsen campaigns on real clusters, use Maelstrom to prototype algorithms and failure handling cheaply:
- local, simulated network
- language-agnostic node binaries via stdin/stdout JSON protocol
- built-in workloads, fault injection, and checker outputs
Maelstrom is excellent for early design iteration; Jepsen is where product claims meet real deployment conditions.
12-point readiness checklist
- Claim matrix written (model by API/path)
- Workloads mapped to real production access patterns
- Contention intentionally forced in at least one suite
- Nemesis set includes partition + crash + delay/loss
- Checker choice mapped to claim type (Knossos/Elle)
- Seeds and full histories archived
- Unknown outcomes treated as non-pass
- Artifact triage owner assigned
- Regression baseline tracked over time
- Release gate wired to anomaly policy
- Documentation updated when claim scope changes
- Postmortems link incidents back to missing/weak tests
References
- Jepsen framework repository (architecture overview)
https://github.com/jepsen-io/jepsen - Jepsen nemesis tutorial (fault injection flow)
https://github.com/jepsen-io/jepsen/blob/main/doc/tutorial/05-nemesis.md - Jepsen consistency model map
https://jepsen.io/consistency/models - Knossos linearizability checker
https://github.com/jepsen-io/knossos - Elle transactional anomaly checker
https://github.com/jepsen-io/elle - Elle paper (VLDB 2021 preprint)
https://arxiv.org/pdf/2003.10554 - Example Jepsen transactional analysis (PostgreSQL 12.3)
https://jepsen.io/analyses/postgresql-12.3 - Adya, Liskov, O’Neil — Generalized Isolation Level Definitions
http://pmg.csail.mit.edu/papers/icde00.pdf - Maelstrom workbench (Jepsen-powered learning/test harness)
https://github.com/jepsen-io/maelstrom
One-line takeaway
If your distributed system’s consistency claim isn’t backed by recurring fault-injection history checks, it’s marketing, not a guarantee.