Birthday Bound for Random IDs: Collision Risk That Sneaks Up at Scale
TL;DR
If you generate IDs uniformly at random from a space of size (N), collision risk grows with the square of draws ((n^2)), not linearly with (n).
- Approximation: (p(\text{collision}) \approx 1 - e^{-n(n-1)/(2N)})
- Rule of thumb for 50% risk: (n_{50} \approx 1.177\sqrt{N})
That’s why 64-bit random IDs can look “huge” but still become collision-relevant in large systems.
1) Why this feels wrong
Most people intuitively think:
- “I have (2^{64}) possibilities, so I need near (2^{64}) samples before collisions matter.”
But collisions are pairwise events. The number of pairs among (n) samples is:
[ \binom{n}{2} = \frac{n(n-1)}{2} ]
So risk scales with pair count, i.e., roughly (n^2).
This is the same core idea behind the classic birthday problem (23 people (\rightarrow) >50% chance of a shared birthday).
2) Practical formulas you can actually use
For uniform random draws with replacement from (N) buckets:
[ p \approx 1 - \exp\left(-\frac{n(n-1)}{2N}\right) ]
For very small (p):
[ p \approx \frac{n(n-1)}{2N} ]
Invert to estimate sample size for a target collision probability (p):
[ n \approx \sqrt{2N\ln\frac{1}{1-p}} ]
Special case at (p=0.5):
[ n_{50} \approx \sqrt{2N\ln 2} \approx 1.177\sqrt{N} ]
3) How many draws until trouble?
Using the approximation above:
| Random bits | Space (N) | ~1% collision risk | ~50% collision risk |
|---|---|---|---|
| 32 | (2^{32}) | 9,291 | 77,163 |
| 48 | (2^{48}) | 2,378,621 | 19,753,662 |
| 64 | (2^{64}) | 608,926,881 | 5,056,937,541 |
| 96 | (2^{96}) | 39,906,632,088,518 | 331,411,458,666,436 |
| 122 (UUIDv4 random bits) | (2^{122}) | 326,915,130,069,136,000 | 2,714,922,669,395,445,248 |
Takeaway:
- 32-bit and 48-bit spaces are tiny for modern systems.
- 64-bit may be okay for small scopes, but can become risky at global/high-throughput scale.
- 96+ bits is usually effectively collision-negligible for ordinary production workloads.
4) Three concrete scale checks
Approximate collision probability after generating random IDs:
| Draw count (n) | 64-bit | 96-bit | 122-bit |
|---|---|---|---|
| 1,000,000 | (2.71\times10^{-8}) | (6.31\times10^{-18}) | (9.40\times10^{-26}) |
| 100,000,000 | (2.71\times10^{-4}) (~0.027%) | (6.31\times10^{-14}) | (9.40\times10^{-22}) |
| 1,000,000,000 | (2.67\times10^{-2}) (~2.67%) | (6.31\times10^{-12}) | (9.40\times10^{-20}) |
The billion-row line is the one that surprises teams: 64-bit random IDs are no longer “astronomically safe.”
5) System design implications
A) Don’t confuse “large keyspace” with “small collision probability”
Always reason in terms of (expected draws per collision domain), not just bit width.
B) Define collision domain explicitly
Your true (n) is per domain (e.g., global table, per-tenant table, per-day partition).
If you partition the domain, you reduce effective (n), often dramatically reducing risk.
C) Keep uniqueness enforcement anyway
Even with very low probabilities:
- keep a UNIQUE constraint/index,
- handle insert retry on conflict,
- instrument collision counters.
Low-probability events still happen in long-running systems.
D) Use enough entropy for your growth horizon
For long-lived, high-throughput, multi-region systems, prefer ID schemes with strong entropy budgets (commonly 96+ effective random bits, or robust non-random uniqueness construction with strict coordination semantics).
6) Quick estimator snippet
import math
def collision_prob(n: int, bits: int) -> float:
N = 2 ** bits
x = n * (n - 1) / (2 * N)
return 1 - math.exp(-x)
# Example: 1B draws in 64-bit space
print(collision_prob(1_000_000_000, 64)) # ~0.0267
Use this in capacity planning docs before locking an ID format.
7) A simple policy template
For each ID type:
- Define collision domain (where uniqueness must hold).
- Estimate max cumulative draws over 1y / 3y / system lifetime.
- Compute collision probability at those horizons.
- Set acceptable risk threshold (e.g., < (10^{-9}) per domain-year).
- Add DB uniqueness + retry policy + monitoring.
- Re-evaluate when throughput assumptions change.
This turns “UUID vibes” into an explicit risk budget.
Closing
The birthday bound is one of those rare ideas that is both mathematically simple and operationally expensive to ignore.
If your system is scaling fast, collision math should be part of architecture reviews—not an afterthought after the first duplicate key incident.
References
- Wikipedia: Birthday problem — derivations and approximations for collision probabilities. https://en.wikipedia.org/wiki/Birthday_problem
- Wikipedia: Birthday attack — cryptographic collision search intuition and birthday bound. https://en.wikipedia.org/wiki/Birthday_attack
- IETF RFC 9562 (May 2024): Universally Unique IDentifiers (UUIDs) — UUID format and v4 random-bit construction (122 random bits after version/variant). https://www.rfc-editor.org/rfc/rfc9562
- NIST SP 800-107 Rev.1: Recommendation for Applications Using Approved Hash Algorithms — collision resistance strength guidance (birthday-bound context). https://nvlpubs.nist.gov/nistpubs/legacy/sp/nistspecialpublication800-107r1.pdf