SQLite WAL Concurrency + Checkpoint-Starvation Operations Playbook

2026-03-30 · software

SQLite WAL Concurrency + Checkpoint-Starvation Operations Playbook

Stop "database is locked" incidents by controlling write duration, reader lifetime, and checkpoint policy

Why this note: SQLite in WAL mode is excellent for local-first, edge, and embedded systems—but many teams hit the same wall: latency spikes, growing -wal files, and intermittent SQLITE_BUSY errors under mixed read/write load.


1) Failure Mode in One Sentence

Most SQLite production incidents are not "too much write volume"; they are long-lived transactions + unmanaged checkpointing causing lock pressure and WAL growth.


2) Ground Truth You Must Design Around

  1. Many readers, single writer: SQLite allows concurrent readers, but only one writer at a time.
  2. WAL mode improves read/write concurrency: readers do not block writer append, but writer serialization still exists.
  3. Checkpointing is a third operation: read, write, and checkpoint all matter for p99 latency.
  4. Long readers can stall checkpoint progress: if readers pin older snapshots, checkpoint cannot advance beyond them.

If you ignore (4), WAL can grow and read performance decays over time.


3) Minimal Mental Model

Let:

Operationally:

A simple estimate:

[ \text{wal_bytes_per_sec} \approx G \times \text{page_size} ]

Use this to set alarms before files become huge.


4) Baseline Configuration (Safe Starting Point)

Run once per DB (and per connection where noted):

PRAGMA journal_mode=WAL;
PRAGMA synchronous=NORMAL;          -- FULL if you need stronger durability guarantees
PRAGMA wal_autocheckpoint=1000;     -- tune later from telemetry
PRAGMA busy_timeout=5000;           -- per-connection policy
PRAGMA foreign_keys=ON;

Notes:


5) Transaction Design Rules (Biggest ROI)

Writer rules

Reader rules

If you fix only one thing, fix transaction lifetime.


6) Checkpoint Strategy by Workload

A) Low/medium write apps

B) Bursty write apps

C) Strict disk-footprint apps


7) Choosing Checkpoint Modes (Practical)

Rule of thumb:


8) Busy Handling Policy (Avoid Retry Storms)

A common anti-pattern is app-level instant retries on SQLITE_BUSY.

Use layered policy:

  1. Connection-level busy_timeout (e.g., 2-5s)
  2. Small, bounded app retries with jitter
  3. Escalate/observe after retry budget exhausted

Do not spin-retry with zero sleep; it amplifies lock contention.


9) Observability: What to Track

At minimum, export:

Alert examples:


10) Incident Runbook

When latency spikes + busy errors appear:

  1. Check active long-lived readers first.
  2. Inspect WAL size trend and recent checkpoint outcomes.
  3. Trigger wal_checkpoint(PASSIVE); if still stuck, run FULL in controlled window.
  4. Reduce concurrent writer pressure temporarily (admission control).
  5. Patch transaction scope leaks (missing finalize/reset, long-lived cursors).

Post-incident, adjust:


11) Common Anti-Patterns


12) Fast Hardening Checklist

[ ] journal_mode=WAL enabled and verified
[ ] busy_timeout set on every connection
[ ] write transaction scope audited (no I/O inside)
[ ] long-read transaction age metric added
[ ] checkpoint mode policy documented (PASSIVE vs FULL/RESTART/TRUNCATE)
[ ] WAL-size and checkpoint-success alerts wired
[ ] incident runbook tested in staging with synthetic lock contention

References


TL;DR

SQLite WAL scaling is mostly an operations discipline problem: control transaction lifetimes, pick checkpoint modes intentionally, and monitor WAL growth + reader age. Do that, and most "database is locked" pain disappears.