Idempotency-Window Expiry Duplicate Slippage Playbook

Pricing Retry-Time Uncertainty as a First-Class Execution Risk

Why this note: Many routers implement idempotency keys with finite dedupe windows (TTL). Under ACK-tail inflation, retries can cross that boundary and be accepted as fresh intent. The result is not a simple “tech bug” but a branch-risk problem: overfill/unwind, phantom underfill, and late catch-up convexity.

1) Failure Mode in One Sentence

When ACK/fill finality arrives after the idempotency dedupe window, a retry may become a second live order, creating hidden overfill and unwind slippage.

2) Extend the Action Objective with Duplicate-Branch Risk

For action (a) in context (x):

[ J(a|x)=\mathbb{E}[IS|x,a] + \lambda,\mathrm{CVaR}_{q}(IS|x,a) + \eta,\mathrm{MissRisk}(x,a) + \rho,\mathrm{DuplicateRisk}(x,a) ]

Where (\mathrm{DuplicateRisk}) is expected incremental loss from timeout/retry branch ambiguity:

retry before true finality,
dedupe-window expiry,
dual-live exposure,
forced unwind after late reconciliation.

Without this term, “retry for reliability” can silently mutate into tail slippage.

3) Minimal Dynamics Model

Let:

(W_t): active idempotency dedupe window (ms)
(A_t): ACK/finality latency random variable
(R_t\in{0,1}): retry fired before finality
(L_t\in{0,1}): original order still live at retry time

Define window-expiry collision probability:

[ p^{dup}_t = P(A_t > W_t, R_t=1, L_t=1 \mid x_t) ]

Expected duplicate branch cost:

[ \mathrm{DuplicateRisk}_t = p^{dup}_t\cdot C^{overfill}_t + p^{miss}_t\cdot C^{latecatch}_t + p^{unc}_t\cdot C^{reconcile}_t ]

Where:

(C^{overfill}): extra impact + unwind cost when both intents execute,
(C^{latecatch}): urgency convexity when retry suppressed too long,
(C^{reconcile}): temporary hedge/position noise while truth is unresolved.

Use latent regime (S_t\in{\text{CLEAN},\text{WATCH},\text{COLLISION_RISK},\text{SAFE_SINGLE_INTENT}}).

4) Branch Taxonomy (Model What Actually Happens)

For each timeout event, classify outcome:

Single-Live Recover
Original accepted, retry blocked by dedupe (good).
Window-Expiry Duplicate
Original accepted, retry also accepted as new intent (bad overfill branch).
True Drop + Valid Retry
Original genuinely lost/rejected, retry required (good rescue branch).
Ambiguous Pending
Neither leg final for too long; exposure uncertain (control-risk branch).

Slippage comes from mispricing branch probabilities, not from average timeout count alone.

5) Telemetry Contract (Required)

A) Intent / Idempotency

intent_id, idempotency_key, parent_id, child_seq
key_first_seen_at, dedupe_ttl_ms, dedupe_expire_at
retry_attempt, retry_reason, retry_backoff_ms

B) Gateway / Venue Finality

send_ts, ack_ts, fill_ts, cancel_ack_ts
ack_latency_ms, finality_latency_ms
venue_order_id mapping per attempt
duplicate_accept_detected (post-reconcile)

C) Execution Consequences

position_overshoot_qty
forced_unwind_qty, unwind_bps
deadline_residual_sec
markout_1s/5s/30s

D) Context

spread/depth/vol regime, urgency bucket, time-to-close
venue-specific reject/ack behavior
network/load indicators (to explain ACK-tail inflation)

6) Label Design

Create three event labels:

WindowExpiryDuplicateEvent
Retry accepted after dedupe expiry while original remained live.
RetryRescueEvent
Original failed, retry prevented miss (positive branch).
AmbiguousFinalityEvent
Finality unresolved beyond threshold; temporary exposure uncertainty.

Training only on generic retry success rate hides asymmetric tail damage.

7) Modeling Stack (Practical)

Layer A — Finality Survival Model

Estimate (P(A_t > \tau\mid x_t)) for ACK/finality tails (quantile-aware).
This gives dynamic window pressure vs configured TTL.

Layer B — Competing-Risks Branch Model

Estimate branch probabilities:

[ P(B=b\mid x_t,a_t),; b\in{\text{single},\text{duplicate},\text{rescue},\text{ambiguous}} ]

Layer C — Branch-Conditional Cost Model

For each branch, model (IS) distribution (p50/p90/p99).
Then aggregate:

[ \mathbb{E}[IS|x,a]=\sum_b P(B=b|x,a)\cdot \mathbb{E}[IS|x,a,B=b] ]

Layer D — Policy Simulation

Offline replay with alternative:

dedupe TTL percentiles,
retry backoff ladders,
single-intent lock rules,

to find lower tail-cost operating points.

8) KPIs That Reveal Hidden Duplicate Tax

Window-Expiry Collision Rate (WECR) [ WECR=\frac{N_{window_expiry_duplicate}}{N_{timeout_retries}+\epsilon} ]
Duplicate Overshoot Cost (DOC) [ DOC=IS_{duplicate_branch}-IS_{matched_single_branch} ]
Retry Rescue Precision (RRP) Fraction of retries that truly rescued failed originals (higher is better).
Intent Finality Lag p95 (IFL95) p95 of send→finality latency for timeout cohort.
Reconciliation Half-Life (RHL) Median time to resolve ambiguous exposure after timeout.

If WECR rises while overall fill/completion stays “normal,” you are likely paying hidden unwind tax.

9) Control Policy (CLEAN → SAFE_SINGLE_INTENT)

CLEAN
- standard retry/backoff and static TTL.
WATCH
- raise dedupe TTL toward predicted (A_t) quantiles,
- widen retry spacing for noisy links.
COLLISION_RISK
- enforce single-intent lock (no second live child until prior finality state is provably terminal),
- lower aggressive catch-up to avoid overshoot unwind loops.
SAFE_SINGLE_INTENT
- completion-first conservative mode,
- strict intent ledger checks and branch-safe throttles until WECR/IFL normalize.

Use hysteresis + dwell time to avoid oscillation between retry modes.

10) Rollout Blueprint

Shadow week: compute WECR/DOC/RRP/IFL95 from current logs.
Counterfactual replay: test adaptive TTL + retry-lock policy on recent stress windows.
Canary: symbols/notional subset with strict rollback triggers.
Promotion gates: lower DOC and WECR without degrading completion beyond budget.
Chaos drill: inject ACK-tail delays and confirm controller enters/exits SAFE correctly.

11) Common Mistakes

Treating idempotency TTL as static config instead of latency-quantile control variable.
Counting retries, but not classifying retry outcome branches.
Measuring average ACK latency only (tail blindness).
Allowing dual-live exposure under uncertain finality.
Ignoring temporary position uncertainty as a real risk cost.

12) Fast Implementation Checklist

[ ] Log dedupe-window lifecycle per intent (first seen, expiry, retry attempt)
[ ] Build duplicate/rescue/ambiguous branch labels
[ ] Add DuplicateRisk term to routing objective
[ ] Train branch-probability + branch-cost models (quantile heads)
[ ] Deploy CLEAN→WATCH→COLLISION_RISK→SAFE controller
[ ] Gate rollout on WECR + DOC + completion reliability

References

RFC 7231 / RFC 9110 idempotency semantics (HTTP method-level background; useful but insufficient for trading intent guarantees).
FIX Protocol session and application-level sequencing guidance (for practical finality/replay handling).
Almgren, R. & Chriss, N. (2000), Optimal Execution of Portfolio Transactions.
Cartea, Á., Jaimungal, S., Penalva, J. (2015), Algorithmic and High-Frequency Trading.

TL;DR

Timeout retries are a branching execution decision, not a transport footnote. Model idempotency-window expiry explicitly, price duplicate-branch risk in action selection, and enforce single-intent SAFE controls before hidden overfill/unwind cost leaks into tail slippage.