WAL + Checkpoint Durability-Latency Tradeoff: Production Playbook
Date: 2026-03-03
Category: knowledge
Domain: software / databases / reliability engineering
Why this matters
Most teams tune query plans and cache hit-rates, but ignore the write path durability contract.
That contract is where outages become either:
- a clean restart, or
- silent data loss / long recovery / corrupted confidence.
WAL and checkpoint behavior determine your real guarantees under crash, not your happy-path benchmark.
Core principle
Separate three concerns explicitly:
- Commit acknowledgment policy (when client sees success)
- Durability boundary (what survives OS/power crash)
- Recovery strategy (how strictly you replay or reject uncertain tail data)
If these are not documented, your system is running on folklore.
1) Mental model (portable across engines)
Every WAL engine has the same loop:
- Append change intent to log
- Optionally flush/sync log to durable media
- Acknowledge commit
- Later checkpoint/flush data files
- On crash, replay WAL from last safe point
What actually changes by configuration
- Latency: how often you wait on fsync/barriers
- Loss window: how much acknowledged work can disappear
- Recovery behavior: fail fast vs salvage partial tail
2) Durable mode taxonomy (use this in design docs)
Mode A — Strict durability
- Commit waits for durable WAL sync.
- Best for balances, orders, irreversible state transitions.
- Cost: higher p99 commit latency.
Mode B — Bounded-loss async durability
- Commit can return before durable sync.
- Useful for telemetry, derived views, cache-like state.
- Cost: crash can lose recent acknowledged writes.
Mode C — Best-effort / salvage-oriented
- Recovery may ignore damaged tail or recover to last consistent point.
- Useful when replicas/rebuild pipelines exist.
- Cost: explicit acceptance of partial recent loss.
Do not mix modes implicitly. Label tables/topics/streams by required mode.
3) PostgreSQL playbook
From PostgreSQL docs:
- Checkpoints start every
checkpoint_timeoutor whenmax_wal_sizeis near limit (defaults: 5 minutes / 1 GB). fsync=onprotects consistency after OS/hardware crash.synchronous_commit=offmay lose recent commits, but docs state it does not risk database inconsistency.full_page_writes=onprotects against torn-page scenarios after crash.
Practical tuning pattern
- Keep
fsync=on,full_page_writes=onin production unless DB is explicitly disposable. - Use
synchronous_commit=offonly for explicitly noncritical write paths. - Increase
max_wal_sizeand tunecheckpoint_timeoutto avoid checkpoint thrash. - Keep
checkpoint_completion_targetnear default (0.9) to smooth checkpoint I/O. - Watch if checkpoints are requested too frequently; that usually means WAL budget is too small.
Failure semantics to document
- Which transactions are allowed async commit?
- Maximum acceptable lost-ack window for those transactions?
- What runbook marks suspect time interval after crash?
4) SQLite WAL playbook
From SQLite WAL docs:
- WAL adds a third primitive: checkpointing.
- Default auto-checkpoint triggers at ~1000 pages.
- Readers and writer can run concurrently, but long-lived readers can block checkpoint progress.
PRAGMA synchronous=FULLsyncs WAL on every commit;NORMALshifts durability behavior and often avoids commit-time sync cost.
Practical tuning pattern
- Use WAL mode when read/write concurrency matters.
- Keep read transactions short; long readers can inflate WAL and degrade reads.
- Move checkpoints to controlled background/idle windows if commit-latency spikes are visible.
- Define upper WAL size SLO and alert when checkpoint starvation happens.
Common production trap
Teams enable WAL for speed, but never add checkpoint observability. Then one sticky reader causes WAL bloat and sudden latency cliffs.
5) RocksDB playbook
From RocksDB wiki:
- WAL reconstructs memtable state on failure.
WriteOptions.sync=truefsyncs WAL before returning.WriteOptions.sync=false(default) is faster but not crash-safe for latest writes.- Recovery modes encode different tradeoffs (strict, point-in-time recovery, tail-tolerant salvage).
Practical tuning pattern
- Assign CF/workload class to sync policy (strict vs noncritical).
- Use group commit expectations in p99 budgeting (throughput may improve, tails still matter).
- Pick WAL recovery mode intentionally:
- strict corruption intolerance for critical state,
- point-in-time fallback when replica backfill exists,
- salvage mode only for disaster-recovery style use.
6) Observability: minimum dashboard
Track per engine/workload:
- commit latency p50/p95/p99 split by sync policy
- WAL bytes/sec and segment churn rate
- checkpoint frequency/duration and backlog
- WAL size growth vs checkpoint progress
- recovery replay duration (from incident drills)
- count of writes by durability class (strict vs async)
If you cannot answer “what was our effective loss window at crash time?”, you are not operating durability.
7) Game-day tests (must run before launch)
- Kill -9 / process crash during write load
- OS reboot / power-cut simulation on staging
- Long-reader checkpoint starvation (SQLite-like pattern)
- WAL tail corruption drill (RocksDB mode behavior)
- Large checkpoint pressure (Postgres max_wal_size breach scenarios)
For each drill, record:
- lost acknowledged writes (count/time window)
- recovery duration
- operator decision points
- whether alerts fired early enough
8) 12-point production checklist
- Durability modes documented per data class
- Commit acknowledgment policy documented per write path
- WAL/checkpoint knobs checked into infra-as-code
- Explicit policy for async commits (where allowed / forbidden)
- Crash drill evidence captured in runbook
- WAL growth and checkpoint lag alerts configured
- Long-reader detection added (where applicable)
- Recovery mode choice documented (e.g., RocksDB)
- Incident template includes “acknowledged-but-lost window”
- Recovery SLO (RTO) and data-loss SLO (RPO) agreed
- Non-durable settings prohibited by default policy
- Quarterly re-validation under realistic write burst
Common anti-patterns
- “fsync off only for now” becoming permanent in prod
- using async commit without tagging noncritical data classes
- treating checkpoint storms as infra noise, not durability debt
- no fault-injection rehearsal for crash-recovery path
- assuming replication removes need for local durability policy clarity
References
- PostgreSQL docs — WAL Configuration
https://www.postgresql.org/docs/current/wal-configuration.html - PostgreSQL docs — Write Ahead Log parameters (
fsync,synchronous_commit,full_page_writes, etc.)
https://www.postgresql.org/docs/current/runtime-config-wal.html - PostgreSQL docs — Non-Durable Settings
https://www.postgresql.org/docs/current/non-durability.html - SQLite docs — Write-Ahead Logging
https://sqlite.org/wal.html - RocksDB wiki — WAL file format
https://github.com/facebook/rocksdb/wiki/Write-Ahead-Log-File-Format - RocksDB wiki — WAL recovery modes
https://github.com/facebook/rocksdb/wiki/WAL-Recovery-Modes - RocksDB wiki — WAL performance
https://github.com/facebook/rocksdb/wiki/WAL-Performance
One-line takeaway
WAL tuning is not a performance tweak; it is your explicit contract for how truth survives a crash.