WAL + Checkpoint Durability-Latency Tradeoff: Production Playbook

Date: 2026-03-03
Category: knowledge
Domain: software / databases / reliability engineering

Why this matters

Most teams tune query plans and cache hit-rates, but ignore the write path durability contract.

That contract is where outages become either:

a clean restart, or
silent data loss / long recovery / corrupted confidence.

WAL and checkpoint behavior determine your real guarantees under crash, not your happy-path benchmark.

Core principle

Separate three concerns explicitly:

Commit acknowledgment policy (when client sees success)
Durability boundary (what survives OS/power crash)
Recovery strategy (how strictly you replay or reject uncertain tail data)

If these are not documented, your system is running on folklore.

1) Mental model (portable across engines)

Every WAL engine has the same loop:

Append change intent to log
Optionally flush/sync log to durable media
Acknowledge commit
Later checkpoint/flush data files
On crash, replay WAL from last safe point

What actually changes by configuration

Latency: how often you wait on fsync/barriers
Loss window: how much acknowledged work can disappear
Recovery behavior: fail fast vs salvage partial tail

2) Durable mode taxonomy (use this in design docs)

Mode A — Strict durability

Commit waits for durable WAL sync.
Best for balances, orders, irreversible state transitions.
Cost: higher p99 commit latency.

Mode B — Bounded-loss async durability

Commit can return before durable sync.
Useful for telemetry, derived views, cache-like state.
Cost: crash can lose recent acknowledged writes.

Mode C — Best-effort / salvage-oriented

Recovery may ignore damaged tail or recover to last consistent point.
Useful when replicas/rebuild pipelines exist.
Cost: explicit acceptance of partial recent loss.

Do not mix modes implicitly. Label tables/topics/streams by required mode.

3) PostgreSQL playbook

From PostgreSQL docs:

Checkpoints start every checkpoint_timeout or when max_wal_size is near limit (defaults: 5 minutes / 1 GB).
fsync=on protects consistency after OS/hardware crash.
synchronous_commit=off may lose recent commits, but docs state it does not risk database inconsistency.
full_page_writes=on protects against torn-page scenarios after crash.

Practical tuning pattern

Keep fsync=on, full_page_writes=on in production unless DB is explicitly disposable.
Use synchronous_commit=off only for explicitly noncritical write paths.
Increase max_wal_size and tune checkpoint_timeout to avoid checkpoint thrash.
Keep checkpoint_completion_target near default (0.9) to smooth checkpoint I/O.
Watch if checkpoints are requested too frequently; that usually means WAL budget is too small.

Failure semantics to document

Which transactions are allowed async commit?
Maximum acceptable lost-ack window for those transactions?
What runbook marks suspect time interval after crash?

4) SQLite WAL playbook

From SQLite WAL docs:

WAL adds a third primitive: checkpointing.
Default auto-checkpoint triggers at ~1000 pages.
Readers and writer can run concurrently, but long-lived readers can block checkpoint progress.
PRAGMA synchronous=FULL syncs WAL on every commit; NORMAL shifts durability behavior and often avoids commit-time sync cost.

Practical tuning pattern

Use WAL mode when read/write concurrency matters.
Keep read transactions short; long readers can inflate WAL and degrade reads.
Move checkpoints to controlled background/idle windows if commit-latency spikes are visible.
Define upper WAL size SLO and alert when checkpoint starvation happens.

Common production trap

Teams enable WAL for speed, but never add checkpoint observability. Then one sticky reader causes WAL bloat and sudden latency cliffs.

5) RocksDB playbook

From RocksDB wiki:

WAL reconstructs memtable state on failure.
WriteOptions.sync=true fsyncs WAL before returning.
WriteOptions.sync=false (default) is faster but not crash-safe for latest writes.
Recovery modes encode different tradeoffs (strict, point-in-time recovery, tail-tolerant salvage).

Practical tuning pattern

Assign CF/workload class to sync policy (strict vs noncritical).
Use group commit expectations in p99 budgeting (throughput may improve, tails still matter).
Pick WAL recovery mode intentionally:
- strict corruption intolerance for critical state,
- point-in-time fallback when replica backfill exists,
- salvage mode only for disaster-recovery style use.

6) Observability: minimum dashboard

Track per engine/workload:

commit latency p50/p95/p99 split by sync policy
WAL bytes/sec and segment churn rate
checkpoint frequency/duration and backlog
WAL size growth vs checkpoint progress
recovery replay duration (from incident drills)
count of writes by durability class (strict vs async)

If you cannot answer “what was our effective loss window at crash time?”, you are not operating durability.

7) Game-day tests (must run before launch)

Kill -9 / process crash during write load
OS reboot / power-cut simulation on staging
Long-reader checkpoint starvation (SQLite-like pattern)
WAL tail corruption drill (RocksDB mode behavior)
Large checkpoint pressure (Postgres max_wal_size breach scenarios)

For each drill, record:

lost acknowledged writes (count/time window)
recovery duration
operator decision points
whether alerts fired early enough

8) 12-point production checklist

Durability modes documented per data class
Commit acknowledgment policy documented per write path
WAL/checkpoint knobs checked into infra-as-code
Explicit policy for async commits (where allowed / forbidden)
Crash drill evidence captured in runbook
WAL growth and checkpoint lag alerts configured
Long-reader detection added (where applicable)
Recovery mode choice documented (e.g., RocksDB)
Incident template includes “acknowledged-but-lost window”
Recovery SLO (RTO) and data-loss SLO (RPO) agreed
Non-durable settings prohibited by default policy
Quarterly re-validation under realistic write burst

Common anti-patterns

“fsync off only for now” becoming permanent in prod
using async commit without tagging noncritical data classes
treating checkpoint storms as infra noise, not durability debt
no fault-injection rehearsal for crash-recovery path
assuming replication removes need for local durability policy clarity

References

PostgreSQL docs — WAL Configuration
https://www.postgresql.org/docs/current/wal-configuration.html
PostgreSQL docs — Write Ahead Log parameters (fsync, synchronous_commit, full_page_writes, etc.)
https://www.postgresql.org/docs/current/runtime-config-wal.html
PostgreSQL docs — Non-Durable Settings
https://www.postgresql.org/docs/current/non-durability.html
SQLite docs — Write-Ahead Logging
https://sqlite.org/wal.html
RocksDB wiki — WAL file format
https://github.com/facebook/rocksdb/wiki/Write-Ahead-Log-File-Format
RocksDB wiki — WAL recovery modes
https://github.com/facebook/rocksdb/wiki/WAL-Recovery-Modes
RocksDB wiki — WAL performance
https://github.com/facebook/rocksdb/wiki/WAL-Performance

One-line takeaway

WAL tuning is not a performance tweak; it is your explicit contract for how truth survives a crash.