SQLite WAL Concurrency + Checkpoint-Starvation Operations Playbook

Stop "database is locked" incidents by controlling write duration, reader lifetime, and checkpoint policy

Why this note: SQLite in WAL mode is excellent for local-first, edge, and embedded systems—but many teams hit the same wall: latency spikes, growing -wal files, and intermittent SQLITE_BUSY errors under mixed read/write load.

1) Failure Mode in One Sentence

Most SQLite production incidents are not "too much write volume"; they are long-lived transactions + unmanaged checkpointing causing lock pressure and WAL growth.

2) Ground Truth You Must Design Around

Many readers, single writer: SQLite allows concurrent readers, but only one writer at a time.
WAL mode improves read/write concurrency: readers do not block writer append, but writer serialization still exists.
Checkpointing is a third operation: read, write, and checkpoint all matter for p99 latency.
Long readers can stall checkpoint progress: if readers pin older snapshots, checkpoint cannot advance beyond them.

If you ignore (4), WAL can grow and read performance decays over time.

3) Minimal Mental Model

Let:

G = WAL growth rate (frames/sec)
C = effective checkpoint drain rate (frames/sec)
R_long = presence of long-lived readers

Operationally:

Stable: C >= G and no persistent R_long
Risky: C < G intermittently
Pathological: R_long persists, checkpoint repeatedly partial, WAL grows unbounded

A simple estimate:

[ \text{wal_bytes_per_sec} \approx G \times \text{page_size} ]

Use this to set alarms before files become huge.

4) Baseline Configuration (Safe Starting Point)

Run once per DB (and per connection where noted):

PRAGMA journal_mode=WAL;
PRAGMA synchronous=NORMAL;          -- FULL if you need stronger durability guarantees
PRAGMA wal_autocheckpoint=1000;     -- tune later from telemetry
PRAGMA busy_timeout=5000;           -- per-connection policy
PRAGMA foreign_keys=ON;

Notes:

journal_mode=WAL is persistent at DB level.
busy_timeout is connection-local (set it on every pool connection).
synchronous=NORMAL often gives best latency in app-managed checkpoint setups.

5) Transaction Design Rules (Biggest ROI)

Writer rules

Keep write transactions short (no network calls inside transaction scope).
Batch predictably (many tiny commits can over-trigger checkpoint churn).
Prefer BEGIN IMMEDIATE when you want clear fail-fast lock acquisition semantics.

Reader rules

Do not keep read transactions open across UI/event-loop idle periods.
Avoid streaming huge result sets without chunking/pagination.
Always finalize/reset statements promptly.

If you fix only one thing, fix transaction lifetime.

6) Checkpoint Strategy by Workload

A) Low/medium write apps

Keep wal_autocheckpoint enabled.
Schedule periodic wal_checkpoint(PASSIVE) during normal operation.

B) Bursty write apps

Raise autockpt threshold moderately.
Run explicit checkpoints from a background worker:
- PASSIVE frequently
- FULL or RESTART during controlled idle windows

C) Strict disk-footprint apps

Use periodic TRUNCATE checkpoints during known quiet periods.
Never run aggressive truncate loops during peak traffic.

7) Choosing Checkpoint Modes (Practical)

PASSIVE: non-blocking best effort; safest default in hot paths.
FULL: wait for readers/writer conditions; good maintenance tool.
RESTART: like FULL + ensures next writer restarts WAL from beginning.
TRUNCATE: like RESTART + truncates WAL file to zero bytes.

Rule of thumb:

request path -> PASSIVE
maintenance window -> FULL/RESTART/TRUNCATE

8) Busy Handling Policy (Avoid Retry Storms)

A common anti-pattern is app-level instant retries on SQLITE_BUSY.

Use layered policy:

Connection-level busy_timeout (e.g., 2-5s)
Small, bounded app retries with jitter
Escalate/observe after retry budget exhausted

Do not spin-retry with zero sleep; it amplifies lock contention.

9) Observability: What to Track

At minimum, export:

sqlite_busy_errors_total
sqlite_commit_latency_ms (p50/p95/p99)
sqlite_wal_size_bytes
sqlite_checkpoint_duration_ms
sqlite_checkpoint_frames_log / frames_checkpointed
long_read_txn_count and max read txn age

Alert examples:

WAL size monotonic growth for >N minutes
Busy error rate > baseline + X sigma
Checkpoint success ratio drops below threshold

10) Incident Runbook

When latency spikes + busy errors appear:

Check active long-lived readers first.
Inspect WAL size trend and recent checkpoint outcomes.
Trigger wal_checkpoint(PASSIVE); if still stuck, run FULL in controlled window.
Reduce concurrent writer pressure temporarily (admission control).
Patch transaction scope leaks (missing finalize/reset, long-lived cursors).

Post-incident, adjust:

transaction boundaries
autockpt threshold
checkpoint scheduling policy

11) Common Anti-Patterns

Holding transaction open while waiting on network I/O
Assuming WAL removes single-writer limitation
Treating busy_timeout as global (it is per connection)
Running TRUNCATE checkpoints aggressively during peak
No telemetry on reader age and WAL growth

12) Fast Hardening Checklist

[ ] journal_mode=WAL enabled and verified
[ ] busy_timeout set on every connection
[ ] write transaction scope audited (no I/O inside)
[ ] long-read transaction age metric added
[ ] checkpoint mode policy documented (PASSIVE vs FULL/RESTART/TRUNCATE)
[ ] WAL-size and checkpoint-success alerts wired
[ ] incident runbook tested in staging with synthetic lock contention

References

SQLite WAL documentation: https://www.sqlite.org/wal.html
SQLite transaction semantics: https://www.sqlite.org/lang_transaction.html
SQLite checkpoint API (sqlite3_wal_checkpoint_v2): https://www.sqlite.org/c3ref/wal_checkpoint_v2.html
SQLite busy timeout API: https://www.sqlite.org/c3ref/busy_timeout.html

TL;DR

SQLite WAL scaling is mostly an operations discipline problem: control transaction lifetimes, pick checkpoint modes intentionally, and monitor WAL growth + reader age. Do that, and most "database is locked" pain disappears.