Retry-Safe API Playbook: Idempotency Keys in Real Systems

Date: 2026-02-22 Category: knowledge

Why this matters

Most production incidents in API workflows are not from one request failing — they are from the same intent being applied twice:

user double-clicked payment
client retried on timeout
worker got redelivered after ack timeout
gateway retried due to 502 while upstream actually succeeded

If you cannot prove "same intent => same effect", your system leaks money, inventory, and trust.

Core model

Idempotency is not "ignore duplicates". It is:

Given the same idempotency key and materially same payload, return the same semantic result exactly once.

Minimum contract

Client sends Idempotency-Key (high-entropy UUID/ULID).
Server stores (scope, key) -> request_fingerprint + outcome.
First request executes side effects and stores canonical response.
Replays return stored response (status/body/headers where needed).
If same key with different fingerprint: reject (409/422).

Scoping rules (where people get burned)

A key must be unique within a scope, not globally forever. Recommended scope:

tenant_id + endpoint + key

Why include endpoint? To avoid accidental collisions across operations. Why include tenant/user? To avoid cross-account replay confusion.

Storage pattern

Table sketch

create table idempotency_records (
  tenant_id text not null,
  endpoint text not null,
  idem_key text not null,
  req_hash text not null,
  status_code int,
  response_json text,
  resource_type text,
  resource_id text,
  state text not null, -- processing | completed | failed
  created_at timestamptz not null,
  expires_at timestamptz not null,
  primary key (tenant_id, endpoint, idem_key)
);

State machine

processing: lock acquired, business logic running
completed: canonical outcome persisted
failed: optional for deterministic business failures; transient infra failures usually retryable

Race-safe execution flow

Attempt insert processing row with unique key.
If insert wins -> execute business transaction.
Persist outcome + mark completed atomically.
If insert loses -> read row:
- completed: return stored outcome
- processing: return 409/425 + Retry-After, or wait with bounded poll
- failed: policy-based (replay or require new key)

Use DB uniqueness as the lock. App-level mutex alone is not enough in distributed workers.

Fingerprint design

Hash only fields that define the user intent, e.g.:

payer/payee
amount + currency
order line items

Exclude volatile fields:

timestamp generated by client
tracing IDs
random nonce unrelated to intent

Canonicalize JSON before hashing (sorted keys, normalized numbers/strings) to prevent false mismatch.

TTL policy

Typical TTL ranges:

payments/orders: 24h–72h
internal command APIs: 1h–24h

Too short => late retries duplicate. Too long => storage bloat and key reuse surprises.

Use background cleanup and metrics:

replay hit rate
hash mismatch rate
processing timeout count
median/95p key age at replay

Multi-step workflow note

Idempotency key protects one command boundary, not your whole saga. For multi-service flows:

make each step idempotent
carry a stable operation ID through events
design compensations for partially completed steps

Exactly-once delivery is fantasy at scale; exactly-once effect is the target.

Practical anti-footgun checklist

Require key on side-effecting POST endpoints
Reject key reuse with different fingerprint
Store canonical response and return it on replay
Put unique constraint in durable DB (not cache-only)
Add processing timeout recovery path
Instrument replay/mismatch/timeout metrics
Document TTL and client retry behavior

Bottom line

Retries are guaranteed in production. Duplicate side effects are optional.

Idempotency keys are one of the cheapest reliability upgrades you can ship: small schema + clear contract + strict scope discipline.