B-tree vs LSM-tree Storage Engine Selection Playbook

2026-03-13 · software

B-tree vs LSM-tree Storage Engine Selection Playbook

Why this matters

Choosing a storage engine is less about benchmark screenshots and more about which pain you want to pay:

If this choice is wrong, teams usually discover it late—under production write pressure, p99 latency incidents, and disk-cost surprises.


The core trade-off triangle

A practical way to reason about both engines:

  1. Write amplification (WA): how many physical writes per logical write
  2. Read amplification (RA): how many structures/pages/SSTables touched per read
  3. Space amplification (SA): how much extra storage overhead you carry

You can improve one, but usually worsen another.


B-tree profile (InnoDB-like mental model)

Strengths

Costs

Typical fit


LSM profile (RocksDB/LevelDB-like mental model)

Strengths

Costs

Typical fit


Decision matrix (quick version)

Choose B-tree-first when:

Choose LSM-first when:


Production signals to watch

If you run B-tree

If you run LSM


Practical anti-patterns

  1. “LSM is always faster for writes”

    • True only until compaction debt catches up.
  2. “B-tree is old, so it’s slower”

    • For many real OLTP shapes, B-tree gives cleaner latency.
  3. Benchmarking only average throughput

    • p95/p99 under mixed read/write + maintenance windows decides production reality.
  4. Ignoring deletion semantics in LSM

    • Tombstones are not free; poor lifecycle handling quietly taxes reads and storage.

A pragmatic evaluation protocol

Before final selection, test both engines with a workload harness that includes:

Run each candidate long enough for background maintenance behavior to appear. Short tests mostly measure cache warmth, not operational truth.


Bottom line

The right question is not “which engine is better?”

The right question is: “Which failure mode can we detect early and operate reliably at 2 a.m.?”