Idempotency-Window Expiry Duplicate Slippage Playbook

2026-03-29 · finance

Idempotency-Window Expiry Duplicate Slippage Playbook

Pricing Retry-Time Uncertainty as a First-Class Execution Risk

Why this note: Many routers implement idempotency keys with finite dedupe windows (TTL). Under ACK-tail inflation, retries can cross that boundary and be accepted as fresh intent. The result is not a simple “tech bug” but a branch-risk problem: overfill/unwind, phantom underfill, and late catch-up convexity.


1) Failure Mode in One Sentence

When ACK/fill finality arrives after the idempotency dedupe window, a retry may become a second live order, creating hidden overfill and unwind slippage.


2) Extend the Action Objective with Duplicate-Branch Risk

For action (a) in context (x):

[ J(a|x)=\mathbb{E}[IS|x,a] + \lambda,\mathrm{CVaR}_{q}(IS|x,a) + \eta,\mathrm{MissRisk}(x,a) + \rho,\mathrm{DuplicateRisk}(x,a) ]

Where (\mathrm{DuplicateRisk}) is expected incremental loss from timeout/retry branch ambiguity:

Without this term, “retry for reliability” can silently mutate into tail slippage.


3) Minimal Dynamics Model

Let:

Define window-expiry collision probability:

[ p^{dup}_t = P(A_t > W_t, R_t=1, L_t=1 \mid x_t) ]

Expected duplicate branch cost:

[ \mathrm{DuplicateRisk}_t = p^{dup}_t\cdot C^{overfill}_t + p^{miss}_t\cdot C^{latecatch}_t + p^{unc}_t\cdot C^{reconcile}_t ]

Where:

Use latent regime (S_t\in{\text{CLEAN},\text{WATCH},\text{COLLISION_RISK},\text{SAFE_SINGLE_INTENT}}).


4) Branch Taxonomy (Model What Actually Happens)

For each timeout event, classify outcome:

  1. Single-Live Recover
    Original accepted, retry blocked by dedupe (good).
  2. Window-Expiry Duplicate
    Original accepted, retry also accepted as new intent (bad overfill branch).
  3. True Drop + Valid Retry
    Original genuinely lost/rejected, retry required (good rescue branch).
  4. Ambiguous Pending
    Neither leg final for too long; exposure uncertain (control-risk branch).

Slippage comes from mispricing branch probabilities, not from average timeout count alone.


5) Telemetry Contract (Required)

A) Intent / Idempotency

B) Gateway / Venue Finality

C) Execution Consequences

D) Context


6) Label Design

Create three event labels:

  1. WindowExpiryDuplicateEvent
    Retry accepted after dedupe expiry while original remained live.
  2. RetryRescueEvent
    Original failed, retry prevented miss (positive branch).
  3. AmbiguousFinalityEvent
    Finality unresolved beyond threshold; temporary exposure uncertainty.

Training only on generic retry success rate hides asymmetric tail damage.


7) Modeling Stack (Practical)

Layer A — Finality Survival Model

Estimate (P(A_t > \tau\mid x_t)) for ACK/finality tails (quantile-aware).
This gives dynamic window pressure vs configured TTL.

Layer B — Competing-Risks Branch Model

Estimate branch probabilities:

[ P(B=b\mid x_t,a_t),; b\in{\text{single},\text{duplicate},\text{rescue},\text{ambiguous}} ]

Layer C — Branch-Conditional Cost Model

For each branch, model (IS) distribution (p50/p90/p99).
Then aggregate:

[ \mathbb{E}[IS|x,a]=\sum_b P(B=b|x,a)\cdot \mathbb{E}[IS|x,a,B=b] ]

Layer D — Policy Simulation

Offline replay with alternative:

to find lower tail-cost operating points.


8) KPIs That Reveal Hidden Duplicate Tax

  1. Window-Expiry Collision Rate (WECR) [ WECR=\frac{N_{window_expiry_duplicate}}{N_{timeout_retries}+\epsilon} ]

  2. Duplicate Overshoot Cost (DOC) [ DOC=IS_{duplicate_branch}-IS_{matched_single_branch} ]

  3. Retry Rescue Precision (RRP) Fraction of retries that truly rescued failed originals (higher is better).

  4. Intent Finality Lag p95 (IFL95) p95 of send→finality latency for timeout cohort.

  5. Reconciliation Half-Life (RHL) Median time to resolve ambiguous exposure after timeout.

If WECR rises while overall fill/completion stays “normal,” you are likely paying hidden unwind tax.


9) Control Policy (CLEAN → SAFE_SINGLE_INTENT)

Use hysteresis + dwell time to avoid oscillation between retry modes.


10) Rollout Blueprint

  1. Shadow week: compute WECR/DOC/RRP/IFL95 from current logs.
  2. Counterfactual replay: test adaptive TTL + retry-lock policy on recent stress windows.
  3. Canary: symbols/notional subset with strict rollback triggers.
  4. Promotion gates: lower DOC and WECR without degrading completion beyond budget.
  5. Chaos drill: inject ACK-tail delays and confirm controller enters/exits SAFE correctly.

11) Common Mistakes


12) Fast Implementation Checklist

[ ] Log dedupe-window lifecycle per intent (first seen, expiry, retry attempt)
[ ] Build duplicate/rescue/ambiguous branch labels
[ ] Add DuplicateRisk term to routing objective
[ ] Train branch-probability + branch-cost models (quantile heads)
[ ] Deploy CLEAN→WATCH→COLLISION_RISK→SAFE controller
[ ] Gate rollout on WECR + DOC + completion reliability

References


TL;DR

Timeout retries are a branching execution decision, not a transport footnote. Model idempotency-window expiry explicitly, price duplicate-branch risk in action selection, and enforce single-intent SAFE controls before hidden overfill/unwind cost leaks into tail slippage.