SQLite WAL Concurrency + Checkpoint-Starvation Operations Playbook
Stop "database is locked" incidents by controlling write duration, reader lifetime, and checkpoint policy
Why this note: SQLite in WAL mode is excellent for local-first, edge, and embedded systems—but many teams hit the same wall: latency spikes, growing
-walfiles, and intermittentSQLITE_BUSYerrors under mixed read/write load.
1) Failure Mode in One Sentence
Most SQLite production incidents are not "too much write volume"; they are long-lived transactions + unmanaged checkpointing causing lock pressure and WAL growth.
2) Ground Truth You Must Design Around
- Many readers, single writer: SQLite allows concurrent readers, but only one writer at a time.
- WAL mode improves read/write concurrency: readers do not block writer append, but writer serialization still exists.
- Checkpointing is a third operation: read, write, and checkpoint all matter for p99 latency.
- Long readers can stall checkpoint progress: if readers pin older snapshots, checkpoint cannot advance beyond them.
If you ignore (4), WAL can grow and read performance decays over time.
3) Minimal Mental Model
Let:
G= WAL growth rate (frames/sec)C= effective checkpoint drain rate (frames/sec)R_long= presence of long-lived readers
Operationally:
- Stable:
C >= Gand no persistentR_long - Risky:
C < Gintermittently - Pathological:
R_longpersists, checkpoint repeatedly partial, WAL grows unbounded
A simple estimate:
[ \text{wal_bytes_per_sec} \approx G \times \text{page_size} ]
Use this to set alarms before files become huge.
4) Baseline Configuration (Safe Starting Point)
Run once per DB (and per connection where noted):
PRAGMA journal_mode=WAL;
PRAGMA synchronous=NORMAL; -- FULL if you need stronger durability guarantees
PRAGMA wal_autocheckpoint=1000; -- tune later from telemetry
PRAGMA busy_timeout=5000; -- per-connection policy
PRAGMA foreign_keys=ON;
Notes:
journal_mode=WALis persistent at DB level.busy_timeoutis connection-local (set it on every pool connection).synchronous=NORMALoften gives best latency in app-managed checkpoint setups.
5) Transaction Design Rules (Biggest ROI)
Writer rules
- Keep write transactions short (no network calls inside transaction scope).
- Batch predictably (many tiny commits can over-trigger checkpoint churn).
- Prefer
BEGIN IMMEDIATEwhen you want clear fail-fast lock acquisition semantics.
Reader rules
- Do not keep read transactions open across UI/event-loop idle periods.
- Avoid streaming huge result sets without chunking/pagination.
- Always finalize/reset statements promptly.
If you fix only one thing, fix transaction lifetime.
6) Checkpoint Strategy by Workload
A) Low/medium write apps
- Keep
wal_autocheckpointenabled. - Schedule periodic
wal_checkpoint(PASSIVE)during normal operation.
B) Bursty write apps
- Raise autockpt threshold moderately.
- Run explicit checkpoints from a background worker:
PASSIVEfrequentlyFULLorRESTARTduring controlled idle windows
C) Strict disk-footprint apps
- Use periodic
TRUNCATEcheckpoints during known quiet periods. - Never run aggressive truncate loops during peak traffic.
7) Choosing Checkpoint Modes (Practical)
- PASSIVE: non-blocking best effort; safest default in hot paths.
- FULL: wait for readers/writer conditions; good maintenance tool.
- RESTART: like FULL + ensures next writer restarts WAL from beginning.
- TRUNCATE: like RESTART + truncates WAL file to zero bytes.
Rule of thumb:
- request path ->
PASSIVE - maintenance window ->
FULL/RESTART/TRUNCATE
8) Busy Handling Policy (Avoid Retry Storms)
A common anti-pattern is app-level instant retries on SQLITE_BUSY.
Use layered policy:
- Connection-level
busy_timeout(e.g., 2-5s) - Small, bounded app retries with jitter
- Escalate/observe after retry budget exhausted
Do not spin-retry with zero sleep; it amplifies lock contention.
9) Observability: What to Track
At minimum, export:
sqlite_busy_errors_totalsqlite_commit_latency_ms(p50/p95/p99)sqlite_wal_size_bytessqlite_checkpoint_duration_mssqlite_checkpoint_frames_log/frames_checkpointedlong_read_txn_countand max read txn age
Alert examples:
- WAL size monotonic growth for >N minutes
- Busy error rate > baseline + X sigma
- Checkpoint success ratio drops below threshold
10) Incident Runbook
When latency spikes + busy errors appear:
- Check active long-lived readers first.
- Inspect WAL size trend and recent checkpoint outcomes.
- Trigger
wal_checkpoint(PASSIVE); if still stuck, runFULLin controlled window. - Reduce concurrent writer pressure temporarily (admission control).
- Patch transaction scope leaks (missing finalize/reset, long-lived cursors).
Post-incident, adjust:
- transaction boundaries
- autockpt threshold
- checkpoint scheduling policy
11) Common Anti-Patterns
- Holding transaction open while waiting on network I/O
- Assuming WAL removes single-writer limitation
- Treating
busy_timeoutas global (it is per connection) - Running
TRUNCATEcheckpoints aggressively during peak - No telemetry on reader age and WAL growth
12) Fast Hardening Checklist
[ ] journal_mode=WAL enabled and verified
[ ] busy_timeout set on every connection
[ ] write transaction scope audited (no I/O inside)
[ ] long-read transaction age metric added
[ ] checkpoint mode policy documented (PASSIVE vs FULL/RESTART/TRUNCATE)
[ ] WAL-size and checkpoint-success alerts wired
[ ] incident runbook tested in staging with synthetic lock contention
References
- SQLite WAL documentation: https://www.sqlite.org/wal.html
- SQLite transaction semantics: https://www.sqlite.org/lang_transaction.html
- SQLite checkpoint API (
sqlite3_wal_checkpoint_v2): https://www.sqlite.org/c3ref/wal_checkpoint_v2.html - SQLite busy timeout API: https://www.sqlite.org/c3ref/busy_timeout.html
TL;DR
SQLite WAL scaling is mostly an operations discipline problem: control transaction lifetimes, pick checkpoint modes intentionally, and monitor WAL growth + reader age. Do that, and most "database is locked" pain disappears.