B-tree vs LSM-tree Storage Engine Selection Playbook
Why this matters
Choosing a storage engine is less about benchmark screenshots and more about which pain you want to pay:
- B-tree families tend to pay more on random writes, but keep reads simpler.
- LSM families absorb writes efficiently, but push complexity into compaction, read amplification, and operational tuning.
If this choice is wrong, teams usually discover it late—under production write pressure, p99 latency incidents, and disk-cost surprises.
The core trade-off triangle
A practical way to reason about both engines:
- Write amplification (WA): how many physical writes per logical write
- Read amplification (RA): how many structures/pages/SSTables touched per read
- Space amplification (SA): how much extra storage overhead you carry
You can improve one, but usually worsen another.
B-tree profile (InnoDB-like mental model)
Strengths
- Strong point-read and short range-read behavior when working set is indexed well.
- Simpler read path under steady state.
- Operationally easier to reason about for many OLTP workloads.
Costs
- Random-write heavy workloads cause page splits and frequent rewrites.
- Write path can become expensive under very high ingest.
- Hot pages and index maintenance pressure can dominate tail latency.
Typical fit
- Moderate write rate + strict point lookup latency
- Transaction-heavy workloads with strong secondary index usage
- Teams that want predictable operational behavior over max ingest throughput
LSM profile (RocksDB/LevelDB-like mental model)
Strengths
- High write ingest from append + memtable flush + background compaction.
- Good for write-heavy or bursty ingestion workloads.
- Flexible tuning space for throughput vs latency vs cost.
Costs
- Compaction debt can create p99/p999 latency spikes.
- Read path may consult multiple levels/SSTables (especially under stale tombstones or poor compaction state).
- Operational complexity is materially higher (many knobs, many failure modes).
Typical fit
- High-ingest event streams/time-series/log-style workloads
- Systems where write durability under heavy load is primary
- Teams ready to own compaction observability and tuning discipline
Decision matrix (quick version)
Choose B-tree-first when:
- Point reads dominate and must stay stable
- Write rate is significant but not extreme
- Operational simplicity is a top requirement
Choose LSM-first when:
- Write amplification cost in B-tree becomes unacceptable
- Ingest spikes are frequent and large
- You can invest in compaction tuning + SLO-aware runbooks
Production signals to watch
If you run B-tree
- Page split rate
- Buffer pool hit ratio by table/index
- Secondary index maintenance cost
- Checkpoint and flush pressure
If you run LSM
- Pending compaction bytes / compaction backlog
- Read amplification trend by key access pattern
- Tombstone density and stale-data retention
- Write stall events and stall duration
Practical anti-patterns
“LSM is always faster for writes”
- True only until compaction debt catches up.
“B-tree is old, so it’s slower”
- For many real OLTP shapes, B-tree gives cleaner latency.
Benchmarking only average throughput
- p95/p99 under mixed read/write + maintenance windows decides production reality.
Ignoring deletion semantics in LSM
- Tombstones are not free; poor lifecycle handling quietly taxes reads and storage.
A pragmatic evaluation protocol
Before final selection, test both engines with a workload harness that includes:
- Real key distribution (not purely uniform)
- Realistic read/write mix and burst profile
- TTL/delete patterns if applicable
- Tail-latency SLO checks (p95/p99/p999)
- Long-run test horizon to expose compaction/checkpoint cycles
Run each candidate long enough for background maintenance behavior to appear. Short tests mostly measure cache warmth, not operational truth.
Bottom line
- B-tree: usually simpler and stronger for stable read-latency OLTP.
- LSM: usually better for sustained high write ingest, but only with mature compaction operations.
The right question is not “which engine is better?”
The right question is: “Which failure mode can we detect early and operate reliably at 2 a.m.?”