Probabilistic Data Structures Playbook (Bloom Filter, Count-Min Sketch, HyperLogLog)

2026-02-26 · software

Probabilistic Data Structures Playbook (Bloom Filter, Count-Min Sketch, HyperLogLog)

Date: 2026-02-26
Category: knowledge
Domain: data infrastructure / high-scale backend

Why this matters

At scale, exact data structures become expensive fast:

Probabilistic structures trade a small, controlled error for massive gains in memory and speed.

If you pick the right sketch for the right question, you can ship systems that are both cheaper and more responsive.


One-line rule of thumb


1) Bloom Filter — fast approximate membership

Best use cases

Guarantees

Core sizing math

For expected inserts n, bit-array size m, hash count k:

Useful intuition:

Practical pitfalls


2) Count-Min Sketch — approximate frequency at stream scale

Best use cases

Guarantees

Let total count mass be N.

Sizing math

Interpretation:

Practical pitfalls


3) HyperLogLog — approximate distinct counting

Best use cases

Guarantees

For m = 2^p registers:

Examples:

Operational strengths

Practical pitfalls


Choosing quickly (production decision tree)

  1. Need approximate set membership with fast negative checks? → Bloom
  2. Need approximate per-key frequency in a stream? → CMS
  3. Need approximate unique count and easy merge across partitions? → HLL

If the answer is “I need two of these,” that is normal. Most mature systems run multiple sketches side-by-side.


Common architecture pattern (works well in practice)

Then periodically:


Error budgeting (the part teams skip)

Treat approximation as an explicit SLO input.

Example policy:

When error exceeds budget:

  1. alert,
  2. rotate/rebuild sketch,
  3. re-baseline with exact sample,
  4. publish impact note (which decisions were affected).

Implementation checklist


References (researched)