CAP vs PACELC: Distributed Database Selection Playbook
Date: 2026-03-05
Category: software
Purpose: A practical guide for selecting consistency/latency tradeoffs in distributed databases, and turning abstract CAP/PACELC theory into production operating rules.
Why this matters
Teams still make expensive architecture mistakes by asking only one question:
- “Is this AP or CP?”
In production, most pain happens outside hard partitions:
- p99 latency blowups from cross-region quorum paths
- stale reads causing business invariants to break quietly
- retry storms after timeout policies mismatch consistency mode
- accidental global consistency on low-value reads
PACELC is useful because it forces the second question:
- If there is a Partition (P), choose Availability (A) or Consistency (C). Else (E), choose Latency (L) or Consistency (C).
This is usually the real day-to-day tradeoff.
Mental model (operator version)
CAP (failure mode lens)
During partition or severe communication failure, you cannot guarantee both:
- immediate consistency across replicas
- and full availability for every request
So you choose where to fail:
- fail reads/writes (protect correctness)
- or accept divergence temporarily (protect availability)
PACELC (steady-state lens)
Even when the network is healthy:
- stronger consistency often puts extra round trips or quorum coordination on the critical path
- lower latency often relaxes freshness/ordering guarantees
CAP is about emergency behavior.
PACELC is about normal behavior.
Practical system archetypes
1) PA/EL style systems (availability + low latency by default)
Typical behavior:
- stay available through partitions
- prefer local/fast reads in healthy operation
- expose tunable consistency levels per request/workload
Common examples in practice:
- Dynamo-style systems
- Cassandra-family systems
- eventually consistent global KV/document setups
Good fit when:
- user-facing latency budget is strict
- stale reads are tolerable for many paths
- you can design with idempotency + reconciliation
Main risk:
- correctness bugs move from infra into app/business logic
2) PC/EC style systems (consistency first, even at latency cost)
Typical behavior:
- preserve strong semantics via coordinated writes/reads
- may reject or delay operations under partition/failure
- steady-state critical path includes stronger coordination
Common examples in practice:
- externally consistent globally replicated SQL designs
- strict serializable transaction-first systems
Good fit when:
- correctness violations are expensive/unsafe
- cross-entity invariants dominate over tail latency
- explicit unavailability is preferable to silent divergence
Main risk:
- latency and availability surprises if SLOs were built on optimistic assumptions
Decision matrix (copy this into design docs)
Use per workload, not per company.
Can stale reads cause financial/legal/safety issues?
- yes → bias toward C in E branch (EC)
- no → EL may be acceptable
Is write availability mandatory during regional impairment?
- yes → bias toward A in P branch (PA)
- no → CP/PC behavior acceptable
What is your p99 budget at peak?
- tight (<~100ms global path) often pushes EL for many read paths
Do you have mature compensation/reconciliation flows?
- if no, do not over-index on AP; hidden inconsistency debt will accumulate
Can the product tolerate explicit “try again” errors?
- if yes, stronger consistency + controlled fail-fast may be safer overall
Pattern: consistency by lane, not one-size-fits-all
Most robust deployments use mixed policy lanes:
Lane A — correctness-critical
Examples:
- money movement
- inventory hard reservations
- permission/entitlement mutations
Policy:
- strong/transactional semantics
- strict retry constraints
- explicit unavailability preferred over silent split-brain behavior
Lane B — user experience critical
Examples:
- feed reads
- counters, non-critical profile views
- recommendation materialization
Policy:
- low-latency local reads
- bounded staleness allowed
- async repair and conflict policy
Lane C — analytic/background
Examples:
- dashboards
- backfills
- nearline aggregation
Policy:
- throughput and cost optimized
- eventual consistency acceptable
- freshness SLO explicit (minutes/hours)
If every lane uses “strongest possible consistency,” you overpay in latency and cost.
If every lane uses “fastest possible reads,” you accumulate correctness incidents.
Failure policy table (must be explicit)
For each API/transaction class, define:
- partition behavior: fail-closed vs accept-divergence
- timeout budget: end-to-end and per-hop
- retry contract: idempotency key required? yes/no
- conflict handling: last-write-wins / compare-and-set / merge function
- freshness contract: strong / bounded-staleness / eventual
Without this table, teams unintentionally run a random PACELC policy.
Anti-patterns seen in real incidents
“AP infra, CP assumptions”
- App logic assumes read-your-writes, infra only gives eventual.
Global strong reads for all endpoints
- p99 explodes and retry storms amplify load.
No staleness observability
- Team cannot see replica lag until users report anomalies.
Retry without idempotency
- transient failures become duplicate side effects.
Region failover drills without consistency drills
- availability tested; correctness under failover untested.
Metrics that make PACELC concrete
Track these by workload lane:
- p50/p95/p99 read latency (local vs cross-region)
- write commit latency and quorum path distribution
- replica lag / staleness age histogram
- stale-read incidence on key invariants
- conflict/repair rate (and mean repair lag)
- fail-open vs fail-closed request ratio during incidents
If you cannot measure stale-read or repair lag, you are flying blind on the EL side.
Rollout plan (safe migration)
- Classify endpoints into A/B/C lanes.
- Add idempotency and retry policy first.
- Introduce per-request consistency controls behind flags.
- Shadow-measure latency + staleness before policy switch.
- Promote one lane at a time with rollback criteria.
- Game-day partition and region-isolation drills quarterly.
Rule-of-thumb cheatsheet
- Use strong consistency where business invariants live.
- Use low-latency eventual paths where freshness tolerance is explicit.
- Treat CAP as incident semantics and PACELC as daily SLO economics.
- Prefer intentional partial unavailability over invisible correctness drift in critical flows.
References
- CAP formalization perspective (Gilbert & Lynch):
https://groups.csail.mit.edu/tds/papers/Gilbert/Brewer2.pdf - Brewer, “CAP Twelve Years Later”:
https://sites.cs.ucsb.edu/~rich/class/cs293b-cloud/papers/brewer-cap.pdf - PACELC overview:
https://en.wikipedia.org/wiki/PACELC_theorem - Azure Cosmos DB consistency levels (PACELC context):
https://learn.microsoft.com/en-us/azure/cosmos-db/consistency-levels - Amazon DynamoDB read consistency docs:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadConsistency.html - Apache Cassandra architecture/consistency concepts:
https://cassandra.apache.org/doc/latest/cassandra/architecture/dynamo.html - Google Spanner TrueTime + external consistency docs:
https://docs.cloud.google.com/spanner/docs/true-time-external-consistency
One-line takeaway
Don’t choose a database by CAP label alone; choose workload-specific failure and latency semantics, then enforce them with metrics, retries, and lane-specific consistency contracts.