Birthday Bound for Random IDs: Collision Risk That Sneaks Up at Scale

2026-02-28 · computation

Birthday Bound for Random IDs: Collision Risk That Sneaks Up at Scale

TL;DR

If you generate IDs uniformly at random from a space of size (N), collision risk grows with the square of draws ((n^2)), not linearly with (n).

That’s why 64-bit random IDs can look “huge” but still become collision-relevant in large systems.


1) Why this feels wrong

Most people intuitively think:

But collisions are pairwise events. The number of pairs among (n) samples is:

[ \binom{n}{2} = \frac{n(n-1)}{2} ]

So risk scales with pair count, i.e., roughly (n^2).

This is the same core idea behind the classic birthday problem (23 people (\rightarrow) >50% chance of a shared birthday).


2) Practical formulas you can actually use

For uniform random draws with replacement from (N) buckets:

[ p \approx 1 - \exp\left(-\frac{n(n-1)}{2N}\right) ]

For very small (p):

[ p \approx \frac{n(n-1)}{2N} ]

Invert to estimate sample size for a target collision probability (p):

[ n \approx \sqrt{2N\ln\frac{1}{1-p}} ]

Special case at (p=0.5):

[ n_{50} \approx \sqrt{2N\ln 2} \approx 1.177\sqrt{N} ]


3) How many draws until trouble?

Using the approximation above:

Random bits Space (N) ~1% collision risk ~50% collision risk
32 (2^{32}) 9,291 77,163
48 (2^{48}) 2,378,621 19,753,662
64 (2^{64}) 608,926,881 5,056,937,541
96 (2^{96}) 39,906,632,088,518 331,411,458,666,436
122 (UUIDv4 random bits) (2^{122}) 326,915,130,069,136,000 2,714,922,669,395,445,248

Takeaway:


4) Three concrete scale checks

Approximate collision probability after generating random IDs:

Draw count (n) 64-bit 96-bit 122-bit
1,000,000 (2.71\times10^{-8}) (6.31\times10^{-18}) (9.40\times10^{-26})
100,000,000 (2.71\times10^{-4}) (~0.027%) (6.31\times10^{-14}) (9.40\times10^{-22})
1,000,000,000 (2.67\times10^{-2}) (~2.67%) (6.31\times10^{-12}) (9.40\times10^{-20})

The billion-row line is the one that surprises teams: 64-bit random IDs are no longer “astronomically safe.”


5) System design implications

A) Don’t confuse “large keyspace” with “small collision probability”

Always reason in terms of (expected draws per collision domain), not just bit width.

B) Define collision domain explicitly

Your true (n) is per domain (e.g., global table, per-tenant table, per-day partition).

If you partition the domain, you reduce effective (n), often dramatically reducing risk.

C) Keep uniqueness enforcement anyway

Even with very low probabilities:

Low-probability events still happen in long-running systems.

D) Use enough entropy for your growth horizon

For long-lived, high-throughput, multi-region systems, prefer ID schemes with strong entropy budgets (commonly 96+ effective random bits, or robust non-random uniqueness construction with strict coordination semantics).


6) Quick estimator snippet

import math

def collision_prob(n: int, bits: int) -> float:
    N = 2 ** bits
    x = n * (n - 1) / (2 * N)
    return 1 - math.exp(-x)

# Example: 1B draws in 64-bit space
print(collision_prob(1_000_000_000, 64))  # ~0.0267

Use this in capacity planning docs before locking an ID format.


7) A simple policy template

For each ID type:

This turns “UUID vibes” into an explicit risk budget.


Closing

The birthday bound is one of those rare ideas that is both mathematically simple and operationally expensive to ignore.

If your system is scaling fast, collision math should be part of architecture reviews—not an afterthought after the first duplicate key incident.


References