Probabilistic Data Structures Playbook (Bloom Filter, Count-Min Sketch, HyperLogLog)

Date: 2026-02-26
Category: knowledge
Domain: data infrastructure / high-scale backend

Why this matters

At scale, exact data structures become expensive fast:

exact sets grow too large for hot memory,
exact counters are too heavy for high-cardinality streams,
exact distinct counting is slow in distributed pipelines.

Probabilistic structures trade a small, controlled error for massive gains in memory and speed.

If you pick the right sketch for the right question, you can ship systems that are both cheaper and more responsive.

One-line rule of thumb

Bloom Filter: “Have I probably seen this key before?” (membership)
Count-Min Sketch (CMS): “How often have I seen this key?” (frequency)
HyperLogLog (HLL): “How many unique keys did I see?” (cardinality)

1) Bloom Filter — fast approximate membership

Best use cases

pre-check before DB/cache lookup,
duplicate suppression in ingest pipelines,
anti-abuse prefiltering,
“probably not present” fast path.

Guarantees

False negatives: none (for standard insert-only Bloom filter)
False positives: possible

Core sizing math

For expected inserts n, bit-array size m, hash count k:

false-positive rate: p ≈ (1 - e^{-kn/m})^k
optimal hashes: k* = (m/n) ln 2
bits per item target: m/n = -ln(p) / (ln 2)^2

Useful intuition:

~10 bits/item gives roughly ~1% FP range (with near-optimal k).
Lower FP target costs memory quickly.

Practical pitfalls

Capacity drift: if real n exceeds planned capacity, FP explodes.
No deletions in plain Bloom filter (use counting/cuckoo alternatives when delete is required).
Poor hash quality creates correlated collisions and worse real error.

2) Count-Min Sketch — approximate frequency at stream scale

Best use cases

heavy-hitter detection,
abuse/rate hot-key monitoring,
stream telemetry summaries,
approximate top talker tracking.

Guarantees

Let total count mass be N.

estimate is never below truth (overestimation bias),
with probability 1 - δ: estimate(x) ≤ true(x) + εN.

Sizing math

width: w = ceil(e / ε)
depth: d = ceil(ln(1/δ))
memory: w * d * counter_bytes

Interpretation:

smaller ε = tighter error, more width,
smaller δ = higher confidence, more depth.

Practical pitfalls

Collision inflation hurts low-frequency keys most.
If you need decrements/deletes, pick a variant carefully.
CMS alone does not solve “top-K” perfectly; combine with heap/top-k sketch for robust ranking.

3) HyperLogLog — approximate distinct counting

Best use cases

daily/monthly active users,
unique sessions/events/IPs at scale,
cross-shard distinct counting with merge,
dashboard-level cardinality metrics.

Guarantees

For m = 2^p registers:

relative standard error is about 1.04 / sqrt(m).

Examples:

p=12 (m=4096) → ~1.6% error,
p=14 (m=16384) → ~0.8% error.

Operational strengths

Fixed memory footprint,
Merge-friendly (register-wise max),
Excellent for distributed pipelines.

Practical pitfalls

Small-cardinality ranges need bias correction logic (modern implementations handle this).
Not suitable when exact legal/accounting numbers are mandatory.
Hash uniformity quality directly affects estimator stability.

Choosing quickly (production decision tree)

Need approximate set membership with fast negative checks? → Bloom
Need approximate per-key frequency in a stream? → CMS
Need approximate unique count and easy merge across partitions? → HLL

If the answer is “I need two of these,” that is normal. Most mature systems run multiple sketches side-by-side.

Common architecture pattern (works well in practice)

Edge ingest:
- Bloom filter for duplicate/noise prefilter.
Stream analytics:
- CMS for hot-key and abuse pressure signals.
Product/business metrics:
- HLL for DAU/WAU/MAU and unique dimensions.

Then periodically:

sample exact ground truth on a small slice,
compare sketch error vs SLO,
adjust parameters (m, k, ε, δ, p) before drift hurts decisions.

Error budgeting (the part teams skip)

Treat approximation as an explicit SLO input.

Example policy:

Bloom FP ≤ 1.0% for online gating,
CMS additive error ≤ 0.5% of stream mass at 99% confidence,
HLL relative error ≤ 1.0% for executive dashboards.

When error exceeds budget:

alert,
rotate/rebuild sketch,
re-baseline with exact sample,
publish impact note (which decisions were affected).

Implementation checklist

Standardize hashing primitives across services.
Version sketch parameters in metadata.
Log sketch saturation metrics (fill ratio, collision pressure, register distribution).
Keep “exact fallback path” for audits and incident replay.
Document where probabilistic outputs are allowed (and where forbidden).

References (researched)

Bloom, B. H. (1970), Space/time trade-offs in hash coding with allowable errors (Communications of the ACM)
https://dl.acm.org/doi/10.1145/362686.362692
Cormode, G., Muthukrishnan, S., An Improved Data Stream Summary: The Count-Min Sketch and its Applications
https://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf
Flajolet, P. et al. (2007), HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm
https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
Redis probabilistic data structures docs (Bloom/CMS/HLL ecosystem in practice)
https://redis.io/probabilistic/
Apache DataSketches HLL docs
https://datasketches.apache.org/docs/HLL/HllSketches.html