Probabilistic Data Structures Playbook (Bloom Filter, Count-Min Sketch, HyperLogLog)
Date: 2026-02-26
Category: knowledge
Domain: data infrastructure / high-scale backend
Why this matters
At scale, exact data structures become expensive fast:
- exact sets grow too large for hot memory,
- exact counters are too heavy for high-cardinality streams,
- exact distinct counting is slow in distributed pipelines.
Probabilistic structures trade a small, controlled error for massive gains in memory and speed.
If you pick the right sketch for the right question, you can ship systems that are both cheaper and more responsive.
One-line rule of thumb
- Bloom Filter: “Have I probably seen this key before?” (membership)
- Count-Min Sketch (CMS): “How often have I seen this key?” (frequency)
- HyperLogLog (HLL): “How many unique keys did I see?” (cardinality)
1) Bloom Filter — fast approximate membership
Best use cases
- pre-check before DB/cache lookup,
- duplicate suppression in ingest pipelines,
- anti-abuse prefiltering,
- “probably not present” fast path.
Guarantees
- False negatives: none (for standard insert-only Bloom filter)
- False positives: possible
Core sizing math
For expected inserts n, bit-array size m, hash count k:
- false-positive rate:
p ≈ (1 - e^{-kn/m})^k - optimal hashes:
k* = (m/n) ln 2 - bits per item target:
m/n = -ln(p) / (ln 2)^2
Useful intuition:
- ~10 bits/item gives roughly ~1% FP range (with near-optimal
k). - Lower FP target costs memory quickly.
Practical pitfalls
- Capacity drift: if real
nexceeds planned capacity, FP explodes. - No deletions in plain Bloom filter (use counting/cuckoo alternatives when delete is required).
- Poor hash quality creates correlated collisions and worse real error.
2) Count-Min Sketch — approximate frequency at stream scale
Best use cases
- heavy-hitter detection,
- abuse/rate hot-key monitoring,
- stream telemetry summaries,
- approximate top talker tracking.
Guarantees
Let total count mass be N.
- estimate is never below truth (overestimation bias),
- with probability
1 - δ:estimate(x) ≤ true(x) + εN.
Sizing math
- width:
w = ceil(e / ε) - depth:
d = ceil(ln(1/δ)) - memory:
w * d * counter_bytes
Interpretation:
- smaller
ε= tighter error, more width, - smaller
δ= higher confidence, more depth.
Practical pitfalls
- Collision inflation hurts low-frequency keys most.
- If you need decrements/deletes, pick a variant carefully.
- CMS alone does not solve “top-K” perfectly; combine with heap/top-k sketch for robust ranking.
3) HyperLogLog — approximate distinct counting
Best use cases
- daily/monthly active users,
- unique sessions/events/IPs at scale,
- cross-shard distinct counting with merge,
- dashboard-level cardinality metrics.
Guarantees
For m = 2^p registers:
- relative standard error is about
1.04 / sqrt(m).
Examples:
p=12(m=4096) → ~1.6% error,p=14(m=16384) → ~0.8% error.
Operational strengths
- Fixed memory footprint,
- Merge-friendly (register-wise max),
- Excellent for distributed pipelines.
Practical pitfalls
- Small-cardinality ranges need bias correction logic (modern implementations handle this).
- Not suitable when exact legal/accounting numbers are mandatory.
- Hash uniformity quality directly affects estimator stability.
Choosing quickly (production decision tree)
- Need approximate set membership with fast negative checks? → Bloom
- Need approximate per-key frequency in a stream? → CMS
- Need approximate unique count and easy merge across partitions? → HLL
If the answer is “I need two of these,” that is normal. Most mature systems run multiple sketches side-by-side.
Common architecture pattern (works well in practice)
- Edge ingest:
- Bloom filter for duplicate/noise prefilter.
- Stream analytics:
- CMS for hot-key and abuse pressure signals.
- Product/business metrics:
- HLL for DAU/WAU/MAU and unique dimensions.
Then periodically:
- sample exact ground truth on a small slice,
- compare sketch error vs SLO,
- adjust parameters (
m,k,ε,δ,p) before drift hurts decisions.
Error budgeting (the part teams skip)
Treat approximation as an explicit SLO input.
Example policy:
- Bloom FP ≤ 1.0% for online gating,
- CMS additive error ≤ 0.5% of stream mass at 99% confidence,
- HLL relative error ≤ 1.0% for executive dashboards.
When error exceeds budget:
- alert,
- rotate/rebuild sketch,
- re-baseline with exact sample,
- publish impact note (which decisions were affected).
Implementation checklist
- Standardize hashing primitives across services.
- Version sketch parameters in metadata.
- Log sketch saturation metrics (fill ratio, collision pressure, register distribution).
- Keep “exact fallback path” for audits and incident replay.
- Document where probabilistic outputs are allowed (and where forbidden).
References (researched)
- Bloom, B. H. (1970), Space/time trade-offs in hash coding with allowable errors (Communications of the ACM)
https://dl.acm.org/doi/10.1145/362686.362692 - Cormode, G., Muthukrishnan, S., An Improved Data Stream Summary: The Count-Min Sketch and its Applications
https://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf - Flajolet, P. et al. (2007), HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm
https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf - Redis probabilistic data structures docs (Bloom/CMS/HLL ecosystem in practice)
https://redis.io/probabilistic/ - Apache DataSketches HLL docs
https://datasketches.apache.org/docs/HLL/HllSketches.html