Retry-Safe API Playbook: Idempotency Keys in Real Systems
Date: 2026-02-22 Category: knowledge
Why this matters
Most production incidents in API workflows are not from one request failing — they are from the same intent being applied twice:
- user double-clicked payment
- client retried on timeout
- worker got redelivered after ack timeout
- gateway retried due to 502 while upstream actually succeeded
If you cannot prove "same intent => same effect", your system leaks money, inventory, and trust.
Core model
Idempotency is not "ignore duplicates". It is:
Given the same idempotency key and materially same payload, return the same semantic result exactly once.
Minimum contract
- Client sends
Idempotency-Key(high-entropy UUID/ULID). - Server stores
(scope, key) -> request_fingerprint + outcome. - First request executes side effects and stores canonical response.
- Replays return stored response (status/body/headers where needed).
- If same key with different fingerprint: reject (409/422).
Scoping rules (where people get burned)
A key must be unique within a scope, not globally forever. Recommended scope:
tenant_id + endpoint + key
Why include endpoint? To avoid accidental collisions across operations. Why include tenant/user? To avoid cross-account replay confusion.
Storage pattern
Table sketch
create table idempotency_records (
tenant_id text not null,
endpoint text not null,
idem_key text not null,
req_hash text not null,
status_code int,
response_json text,
resource_type text,
resource_id text,
state text not null, -- processing | completed | failed
created_at timestamptz not null,
expires_at timestamptz not null,
primary key (tenant_id, endpoint, idem_key)
);
State machine
processing: lock acquired, business logic runningcompleted: canonical outcome persistedfailed: optional for deterministic business failures; transient infra failures usually retryable
Race-safe execution flow
- Attempt insert
processingrow with unique key. - If insert wins -> execute business transaction.
- Persist outcome + mark
completedatomically. - If insert loses -> read row:
completed: return stored outcomeprocessing: return 409/425 +Retry-After, or wait with bounded pollfailed: policy-based (replay or require new key)
Use DB uniqueness as the lock. App-level mutex alone is not enough in distributed workers.
Fingerprint design
Hash only fields that define the user intent, e.g.:
- payer/payee
- amount + currency
- order line items
Exclude volatile fields:
- timestamp generated by client
- tracing IDs
- random nonce unrelated to intent
Canonicalize JSON before hashing (sorted keys, normalized numbers/strings) to prevent false mismatch.
TTL policy
Typical TTL ranges:
- payments/orders: 24h–72h
- internal command APIs: 1h–24h
Too short => late retries duplicate. Too long => storage bloat and key reuse surprises.
Use background cleanup and metrics:
- replay hit rate
- hash mismatch rate
processingtimeout count- median/95p key age at replay
Multi-step workflow note
Idempotency key protects one command boundary, not your whole saga. For multi-service flows:
- make each step idempotent
- carry a stable operation ID through events
- design compensations for partially completed steps
Exactly-once delivery is fantasy at scale; exactly-once effect is the target.
Practical anti-footgun checklist
- Require key on side-effecting POST endpoints
- Reject key reuse with different fingerprint
- Store canonical response and return it on replay
- Put unique constraint in durable DB (not cache-only)
- Add
processingtimeout recovery path - Instrument replay/mismatch/timeout metrics
- Document TTL and client retry behavior
Bottom line
Retries are guaranteed in production. Duplicate side effects are optional.
Idempotency keys are one of the cheapest reliability upgrades you can ship: small schema + clear contract + strict scope discipline.