Birthday Bound for Random IDs: Collision Risk That Sneaks Up at Scale

TL;DR

If you generate IDs uniformly at random from a space of size (N), collision risk grows with the square of draws ((n^2)), not linearly with (n).

Approximation: (p(\text{collision}) \approx 1 - e^{-n(n-1)/(2N)})
Rule of thumb for 50% risk: (n_{50} \approx 1.177\sqrt{N})

That’s why 64-bit random IDs can look “huge” but still become collision-relevant in large systems.

1) Why this feels wrong

Most people intuitively think:

“I have (2^{64}) possibilities, so I need near (2^{64}) samples before collisions matter.”

But collisions are pairwise events. The number of pairs among (n) samples is:

[ \binom{n}{2} = \frac{n(n-1)}{2} ]

So risk scales with pair count, i.e., roughly (n^2).

This is the same core idea behind the classic birthday problem (23 people (\rightarrow) >50% chance of a shared birthday).

2) Practical formulas you can actually use

For uniform random draws with replacement from (N) buckets:

[ p \approx 1 - \exp\left(-\frac{n(n-1)}{2N}\right) ]

For very small (p):

[ p \approx \frac{n(n-1)}{2N} ]

Invert to estimate sample size for a target collision probability (p):

[ n \approx \sqrt{2N\ln\frac{1}{1-p}} ]

Special case at (p=0.5):

[ n_{50} \approx \sqrt{2N\ln 2} \approx 1.177\sqrt{N} ]

3) How many draws until trouble?

Using the approximation above:

Random bits	Space (N)	~1% collision risk	~50% collision risk
32	(2^{32})	9,291	77,163
48	(2^{48})	2,378,621	19,753,662
64	(2^{64})	608,926,881	5,056,937,541
96	(2^{96})	39,906,632,088,518	331,411,458,666,436
122 (UUIDv4 random bits)	(2^{122})	326,915,130,069,136,000	2,714,922,669,395,445,248

Takeaway:

32-bit and 48-bit spaces are tiny for modern systems.
64-bit may be okay for small scopes, but can become risky at global/high-throughput scale.
96+ bits is usually effectively collision-negligible for ordinary production workloads.

4) Three concrete scale checks

Approximate collision probability after generating random IDs:

Draw count (n)	64-bit	96-bit	122-bit
1,000,000	(2.71\times10^{-8})	(6.31\times10^{-18})	(9.40\times10^{-26})
100,000,000	(2.71\times10^{-4}) (~0.027%)	(6.31\times10^{-14})	(9.40\times10^{-22})
1,000,000,000	(2.67\times10^{-2}) (~2.67%)	(6.31\times10^{-12})	(9.40\times10^{-20})

The billion-row line is the one that surprises teams: 64-bit random IDs are no longer “astronomically safe.”

5) System design implications

A) Don’t confuse “large keyspace” with “small collision probability”

Always reason in terms of (expected draws per collision domain), not just bit width.

B) Define collision domain explicitly

Your true (n) is per domain (e.g., global table, per-tenant table, per-day partition).

If you partition the domain, you reduce effective (n), often dramatically reducing risk.

C) Keep uniqueness enforcement anyway

Even with very low probabilities:

keep a UNIQUE constraint/index,
handle insert retry on conflict,
instrument collision counters.

Low-probability events still happen in long-running systems.

D) Use enough entropy for your growth horizon

For long-lived, high-throughput, multi-region systems, prefer ID schemes with strong entropy budgets (commonly 96+ effective random bits, or robust non-random uniqueness construction with strict coordination semantics).

6) Quick estimator snippet

import math

def collision_prob(n: int, bits: int) -> float:
    N = 2 ** bits
    x = n * (n - 1) / (2 * N)
    return 1 - math.exp(-x)

# Example: 1B draws in 64-bit space
print(collision_prob(1_000_000_000, 64))  # ~0.0267

Use this in capacity planning docs before locking an ID format.

7) A simple policy template

For each ID type:

Define collision domain (where uniqueness must hold).
Estimate max cumulative draws over 1y / 3y / system lifetime.
Compute collision probability at those horizons.
Set acceptable risk threshold (e.g., < (10^{-9}) per domain-year).
Add DB uniqueness + retry policy + monitoring.
Re-evaluate when throughput assumptions change.

This turns “UUID vibes” into an explicit risk budget.

Closing

The birthday bound is one of those rare ideas that is both mathematically simple and operationally expensive to ignore.

If your system is scaling fast, collision math should be part of architecture reviews—not an afterthought after the first duplicate key incident.

References

Wikipedia: Birthday problem — derivations and approximations for collision probabilities. https://en.wikipedia.org/wiki/Birthday_problem
Wikipedia: Birthday attack — cryptographic collision search intuition and birthday bound. https://en.wikipedia.org/wiki/Birthday_attack
IETF RFC 9562 (May 2024): Universally Unique IDentifiers (UUIDs) — UUID format and v4 random-bit construction (122 random bits after version/variant). https://www.rfc-editor.org/rfc/rfc9562
NIST SP 800-107 Rev.1: Recommendation for Applications Using Approved Hash Algorithms — collision resistance strength guidance (birthday-bound context). https://nvlpubs.nist.gov/nistpubs/legacy/sp/nistspecialpublication800-107r1.pdf