Retry-Safe API Playbook: Idempotency Keys in Real Systems

2026-02-22 · software

Retry-Safe API Playbook: Idempotency Keys in Real Systems

Date: 2026-02-22 Category: knowledge

Why this matters

Most production incidents in API workflows are not from one request failing — they are from the same intent being applied twice:

If you cannot prove "same intent => same effect", your system leaks money, inventory, and trust.


Core model

Idempotency is not "ignore duplicates". It is:

Given the same idempotency key and materially same payload, return the same semantic result exactly once.

Minimum contract

  1. Client sends Idempotency-Key (high-entropy UUID/ULID).
  2. Server stores (scope, key) -> request_fingerprint + outcome.
  3. First request executes side effects and stores canonical response.
  4. Replays return stored response (status/body/headers where needed).
  5. If same key with different fingerprint: reject (409/422).

Scoping rules (where people get burned)

A key must be unique within a scope, not globally forever. Recommended scope:

Why include endpoint? To avoid accidental collisions across operations. Why include tenant/user? To avoid cross-account replay confusion.


Storage pattern

Table sketch

create table idempotency_records (
  tenant_id text not null,
  endpoint text not null,
  idem_key text not null,
  req_hash text not null,
  status_code int,
  response_json text,
  resource_type text,
  resource_id text,
  state text not null, -- processing | completed | failed
  created_at timestamptz not null,
  expires_at timestamptz not null,
  primary key (tenant_id, endpoint, idem_key)
);

State machine


Race-safe execution flow

  1. Attempt insert processing row with unique key.
  2. If insert wins -> execute business transaction.
  3. Persist outcome + mark completed atomically.
  4. If insert loses -> read row:
    • completed: return stored outcome
    • processing: return 409/425 + Retry-After, or wait with bounded poll
    • failed: policy-based (replay or require new key)

Use DB uniqueness as the lock. App-level mutex alone is not enough in distributed workers.


Fingerprint design

Hash only fields that define the user intent, e.g.:

Exclude volatile fields:

Canonicalize JSON before hashing (sorted keys, normalized numbers/strings) to prevent false mismatch.


TTL policy

Typical TTL ranges:

Too short => late retries duplicate. Too long => storage bloat and key reuse surprises.

Use background cleanup and metrics:


Multi-step workflow note

Idempotency key protects one command boundary, not your whole saga. For multi-service flows:

Exactly-once delivery is fantasy at scale; exactly-once effect is the target.


Practical anti-footgun checklist


Bottom line

Retries are guaranteed in production. Duplicate side effects are optional.

Idempotency keys are one of the cheapest reliability upgrades you can ship: small schema + clear contract + strict scope discipline.