Idempotency-Window Expiry Duplicate Slippage Playbook
Pricing Retry-Time Uncertainty as a First-Class Execution Risk
Why this note: Many routers implement idempotency keys with finite dedupe windows (TTL). Under ACK-tail inflation, retries can cross that boundary and be accepted as fresh intent. The result is not a simple “tech bug” but a branch-risk problem: overfill/unwind, phantom underfill, and late catch-up convexity.
1) Failure Mode in One Sentence
When ACK/fill finality arrives after the idempotency dedupe window, a retry may become a second live order, creating hidden overfill and unwind slippage.
2) Extend the Action Objective with Duplicate-Branch Risk
For action (a) in context (x):
[ J(a|x)=\mathbb{E}[IS|x,a] + \lambda,\mathrm{CVaR}_{q}(IS|x,a) + \eta,\mathrm{MissRisk}(x,a) + \rho,\mathrm{DuplicateRisk}(x,a) ]
Where (\mathrm{DuplicateRisk}) is expected incremental loss from timeout/retry branch ambiguity:
- retry before true finality,
- dedupe-window expiry,
- dual-live exposure,
- forced unwind after late reconciliation.
Without this term, “retry for reliability” can silently mutate into tail slippage.
3) Minimal Dynamics Model
Let:
- (W_t): active idempotency dedupe window (ms)
- (A_t): ACK/finality latency random variable
- (R_t\in{0,1}): retry fired before finality
- (L_t\in{0,1}): original order still live at retry time
Define window-expiry collision probability:
[ p^{dup}_t = P(A_t > W_t, R_t=1, L_t=1 \mid x_t) ]
Expected duplicate branch cost:
[ \mathrm{DuplicateRisk}_t = p^{dup}_t\cdot C^{overfill}_t + p^{miss}_t\cdot C^{latecatch}_t + p^{unc}_t\cdot C^{reconcile}_t ]
Where:
- (C^{overfill}): extra impact + unwind cost when both intents execute,
- (C^{latecatch}): urgency convexity when retry suppressed too long,
- (C^{reconcile}): temporary hedge/position noise while truth is unresolved.
Use latent regime (S_t\in{\text{CLEAN},\text{WATCH},\text{COLLISION_RISK},\text{SAFE_SINGLE_INTENT}}).
4) Branch Taxonomy (Model What Actually Happens)
For each timeout event, classify outcome:
- Single-Live Recover
Original accepted, retry blocked by dedupe (good). - Window-Expiry Duplicate
Original accepted, retry also accepted as new intent (bad overfill branch). - True Drop + Valid Retry
Original genuinely lost/rejected, retry required (good rescue branch). - Ambiguous Pending
Neither leg final for too long; exposure uncertain (control-risk branch).
Slippage comes from mispricing branch probabilities, not from average timeout count alone.
5) Telemetry Contract (Required)
A) Intent / Idempotency
intent_id,idempotency_key,parent_id,child_seqkey_first_seen_at,dedupe_ttl_ms,dedupe_expire_atretry_attempt,retry_reason,retry_backoff_ms
B) Gateway / Venue Finality
send_ts,ack_ts,fill_ts,cancel_ack_tsack_latency_ms,finality_latency_msvenue_order_idmapping per attemptduplicate_accept_detected(post-reconcile)
C) Execution Consequences
position_overshoot_qtyforced_unwind_qty,unwind_bpsdeadline_residual_secmarkout_1s/5s/30s
D) Context
- spread/depth/vol regime, urgency bucket, time-to-close
- venue-specific reject/ack behavior
- network/load indicators (to explain ACK-tail inflation)
6) Label Design
Create three event labels:
- WindowExpiryDuplicateEvent
Retry accepted after dedupe expiry while original remained live. - RetryRescueEvent
Original failed, retry prevented miss (positive branch). - AmbiguousFinalityEvent
Finality unresolved beyond threshold; temporary exposure uncertainty.
Training only on generic retry success rate hides asymmetric tail damage.
7) Modeling Stack (Practical)
Layer A — Finality Survival Model
Estimate (P(A_t > \tau\mid x_t)) for ACK/finality tails (quantile-aware).
This gives dynamic window pressure vs configured TTL.
Layer B — Competing-Risks Branch Model
Estimate branch probabilities:
[ P(B=b\mid x_t,a_t),; b\in{\text{single},\text{duplicate},\text{rescue},\text{ambiguous}} ]
Layer C — Branch-Conditional Cost Model
For each branch, model (IS) distribution (p50/p90/p99).
Then aggregate:
[ \mathbb{E}[IS|x,a]=\sum_b P(B=b|x,a)\cdot \mathbb{E}[IS|x,a,B=b] ]
Layer D — Policy Simulation
Offline replay with alternative:
- dedupe TTL percentiles,
- retry backoff ladders,
- single-intent lock rules,
to find lower tail-cost operating points.
8) KPIs That Reveal Hidden Duplicate Tax
Window-Expiry Collision Rate (WECR) [ WECR=\frac{N_{window_expiry_duplicate}}{N_{timeout_retries}+\epsilon} ]
Duplicate Overshoot Cost (DOC) [ DOC=IS_{duplicate_branch}-IS_{matched_single_branch} ]
Retry Rescue Precision (RRP) Fraction of retries that truly rescued failed originals (higher is better).
Intent Finality Lag p95 (IFL95) p95 of send→finality latency for timeout cohort.
Reconciliation Half-Life (RHL) Median time to resolve ambiguous exposure after timeout.
If WECR rises while overall fill/completion stays “normal,” you are likely paying hidden unwind tax.
9) Control Policy (CLEAN → SAFE_SINGLE_INTENT)
- CLEAN
- standard retry/backoff and static TTL.
- WATCH
- raise dedupe TTL toward predicted (A_t) quantiles,
- widen retry spacing for noisy links.
- COLLISION_RISK
- enforce single-intent lock (no second live child until prior finality state is provably terminal),
- lower aggressive catch-up to avoid overshoot unwind loops.
- SAFE_SINGLE_INTENT
- completion-first conservative mode,
- strict intent ledger checks and branch-safe throttles until WECR/IFL normalize.
Use hysteresis + dwell time to avoid oscillation between retry modes.
10) Rollout Blueprint
- Shadow week: compute WECR/DOC/RRP/IFL95 from current logs.
- Counterfactual replay: test adaptive TTL + retry-lock policy on recent stress windows.
- Canary: symbols/notional subset with strict rollback triggers.
- Promotion gates: lower DOC and WECR without degrading completion beyond budget.
- Chaos drill: inject ACK-tail delays and confirm controller enters/exits SAFE correctly.
11) Common Mistakes
- Treating idempotency TTL as static config instead of latency-quantile control variable.
- Counting retries, but not classifying retry outcome branches.
- Measuring average ACK latency only (tail blindness).
- Allowing dual-live exposure under uncertain finality.
- Ignoring temporary position uncertainty as a real risk cost.
12) Fast Implementation Checklist
[ ] Log dedupe-window lifecycle per intent (first seen, expiry, retry attempt)
[ ] Build duplicate/rescue/ambiguous branch labels
[ ] Add DuplicateRisk term to routing objective
[ ] Train branch-probability + branch-cost models (quantile heads)
[ ] Deploy CLEAN→WATCH→COLLISION_RISK→SAFE controller
[ ] Gate rollout on WECR + DOC + completion reliability
References
- RFC 7231 / RFC 9110 idempotency semantics (HTTP method-level background; useful but insufficient for trading intent guarantees).
- FIX Protocol session and application-level sequencing guidance (for practical finality/replay handling).
- Almgren, R. & Chriss, N. (2000), Optimal Execution of Portfolio Transactions.
- Cartea, Á., Jaimungal, S., Penalva, J. (2015), Algorithmic and High-Frequency Trading.
TL;DR
Timeout retries are a branching execution decision, not a transport footnote. Model idempotency-window expiry explicitly, price duplicate-branch risk in action selection, and enforce single-intent SAFE controls before hidden overfill/unwind cost leaks into tail slippage.