RocksDB Compaction Style & Write-Stall Control Playbook
Date: 2026-03-30
Category: knowledge
Audience: Operators running write-heavy or mixed read/write RocksDB workloads in production.
Why this matters
RocksDB performance incidents are rarely “random.” They usually come from one of three amplification failures:
- Write amplification spikes → compaction can’t keep up.
- Read amplification rises → too many files/runs touched per read.
- Space amplification grows → tombstones/overlap/backlog consume disk.
When flush+compaction lag, stalls begin: first slowdown, then hard stop.
1) Pick compaction style by failure mode (not by habit)
A. Leveled (default / tiered+leveled in RocksDB internals)
Use when:
- You need stable read latency and predictable point lookups.
- Your workload is read-sensitive and can tolerate somewhat higher write amp.
Tradeoff:
- Lower read amp / lower space amp, but usually higher write amp.
Operational note:
- Can still be very good for skewed updates and key-order inserts.
B. Universal (tiered family)
Use when:
- You are write-throughput bound and leveled compaction can’t keep up.
- You can tolerate more read amp + more space amp volatility.
Tradeoff:
- Lower write amp, but read/space amplification and performance variance usually increase.
Operational note:
- Plan extra free disk headroom: full compactions can need near-double temporary space.
C. FIFO
Use when:
- Data is TTL-like and old data can be dropped by age/size.
- You optimize for bounded storage and simple retention over query flexibility.
Tradeoff:
- Not a general-purpose replacement for leveled/universal for arbitrary queries.
2) Write stall triggers you must monitor
Stalls are per-column-family triggers but throttle the whole DB process.
Primary triggers:
- Too many immutable memtables (
max_write_buffer_numberpressure) - Too many L0 files (
level0_slowdown_writes_trigger,level0_stop_writes_trigger) - Too many pending compaction bytes (
soft/hard_pending_compaction_bytes_limit)
When triggered, RocksDB delays writes (delayed_write_rate), then can hard-stop writers.
If some writes must never block, use WriteOptions.no_slowdown=true and handle Status::Incomplete() in application logic.
3) Tuning ladder (safe order)
Measure first
rocksdb.stats, compaction stats, perf context, host disk/CPU metrics.- Confirm whether bottleneck is write BW, read IOPS, CPU, or cache pressure.
Increase compaction/flush concurrency
- Prefer
max_background_jobssizing from actual host capacity. - Keep flush and compaction pools balanced; avoid compaction starvation.
- Prefer
Reduce avoidable write amp
- Revisit
write_buffer_size,min_write_buffer_number_to_merge, level sizing. - Only then consider switching compaction style.
- Revisit
Control metadata-cache thrash (large DB:RAM ratios)
- Enable partitioned indexes/filters (
kTwoLevelIndexSearch,partition_filters). - Use cache policies that prioritize/pin top-level metadata where supported.
- Enable partitioned indexes/filters (
Only then move stall thresholds
- Raising stop/slowdown limits without fixing compaction debt just delays failure.
4) Practical baseline profiles
Read-latency-first profile
- Leveled compaction
- Conservative L0 triggers
- Strong block cache discipline
- Partitioned index/filter for large DB:RAM
Write-throughput-first profile
- Universal compaction candidate
- More compaction parallelism
- Explicit free-space guardrails for compaction bursts
- SLO accepts wider read-latency variance
Bulk-ingest profile (temporary)
- Relax stall sensitivity only during controlled ingest windows
- Separate ingest CF/DB when possible
- Post-ingest compaction + validation checkpoint before normal traffic
5) Incident runbook (fast triage)
- Symptom: write p99/p999 explodes or writers block.
- Check LOG for: memtable stalls, L0 file stalls, pending compaction byte stalls.
- Check compaction backlog + disk write BW saturation.
- If L0 runaway: prioritize L0→L1 catch-up, reduce incoming write burst.
- If pending bytes runaway: increase compaction capacity or reduce write amp path.
- If cache-thrash driven read collapse: verify index/filter memory policy + partitions.
- Capture before/after stats snapshot; never “tune blind” during incident.
6) Common mistakes
- “Fixing” stalls only by increasing trigger thresholds.
- Compaction style flips without workload replay validation.
- Ignoring DB:RAM ratio and metadata-cache behavior.
- Treating one benchmark result as universal truth.
- Forgetting per-CF triggers can stall the whole DB.
7) Operator checklist
- Compaction style chosen by workload/SLO tradeoff.
- Stall counters and log patterns on alerting dashboard.
- Background job sizing validated against host BW/CPU.
- Partitioned index/filter evaluated for large datasets.
- Runbook tested with synthetic write bursts.
- Version-specific option names/behavior verified before rollout.
References
- RocksDB Compaction Wiki: https://github.com/facebook/rocksdb/wiki/Compaction
- RocksDB Tuning Guide: https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide
- RocksDB Universal Compaction: https://github.com/facebook/rocksdb/wiki/universal-compaction
- RocksDB Write Stalls: https://github.com/facebook/rocksdb/wiki/Write-Stalls
- RocksDB Partitioned Index/Filters: https://github.com/facebook/rocksdb/wiki/Partitioned-Index-Filters