RocksDB Compaction Style & Write-Stall Control Playbook

2026-03-30 · software

RocksDB Compaction Style & Write-Stall Control Playbook

Date: 2026-03-30
Category: knowledge
Audience: Operators running write-heavy or mixed read/write RocksDB workloads in production.


Why this matters

RocksDB performance incidents are rarely “random.” They usually come from one of three amplification failures:

  1. Write amplification spikes → compaction can’t keep up.
  2. Read amplification rises → too many files/runs touched per read.
  3. Space amplification grows → tombstones/overlap/backlog consume disk.

When flush+compaction lag, stalls begin: first slowdown, then hard stop.


1) Pick compaction style by failure mode (not by habit)

A. Leveled (default / tiered+leveled in RocksDB internals)

Use when:

Tradeoff:

Operational note:

B. Universal (tiered family)

Use when:

Tradeoff:

Operational note:

C. FIFO

Use when:

Tradeoff:


2) Write stall triggers you must monitor

Stalls are per-column-family triggers but throttle the whole DB process.

Primary triggers:

When triggered, RocksDB delays writes (delayed_write_rate), then can hard-stop writers.

If some writes must never block, use WriteOptions.no_slowdown=true and handle Status::Incomplete() in application logic.


3) Tuning ladder (safe order)

  1. Measure first

    • rocksdb.stats, compaction stats, perf context, host disk/CPU metrics.
    • Confirm whether bottleneck is write BW, read IOPS, CPU, or cache pressure.
  2. Increase compaction/flush concurrency

    • Prefer max_background_jobs sizing from actual host capacity.
    • Keep flush and compaction pools balanced; avoid compaction starvation.
  3. Reduce avoidable write amp

    • Revisit write_buffer_size, min_write_buffer_number_to_merge, level sizing.
    • Only then consider switching compaction style.
  4. Control metadata-cache thrash (large DB:RAM ratios)

    • Enable partitioned indexes/filters (kTwoLevelIndexSearch, partition_filters).
    • Use cache policies that prioritize/pin top-level metadata where supported.
  5. Only then move stall thresholds

    • Raising stop/slowdown limits without fixing compaction debt just delays failure.

4) Practical baseline profiles

Read-latency-first profile

Write-throughput-first profile

Bulk-ingest profile (temporary)


5) Incident runbook (fast triage)

  1. Symptom: write p99/p999 explodes or writers block.
  2. Check LOG for: memtable stalls, L0 file stalls, pending compaction byte stalls.
  3. Check compaction backlog + disk write BW saturation.
  4. If L0 runaway: prioritize L0→L1 catch-up, reduce incoming write burst.
  5. If pending bytes runaway: increase compaction capacity or reduce write amp path.
  6. If cache-thrash driven read collapse: verify index/filter memory policy + partitions.
  7. Capture before/after stats snapshot; never “tune blind” during incident.

6) Common mistakes


7) Operator checklist


References