Cancel-on-Disconnect Flapping Session Slippage Playbook
Date: 2026-03-14
Category: research
Focus: Modeling hidden execution cost when connectivity flaps trigger venue-level Cancel-on-Disconnect (CoD) order purges.
1) Why this failure mode deserves first-class treatment
Most slippage models assume a clean order lifecycle:
submit -> rest -> partial fill -> amend/cancel -> complete
But in real production, session-level controls (FIX disconnect rules, gateway heartbeat timeout, venue CoD settings) can force a different lifecycle:
disconnect -> venue purges resting orders -> reconnect -> strategy rebuilds inventory with fresh child orders
That creates a control-plane slippage tax that often gets misattributed to “market volatility”:
- queue priority disappears instantly,
- live order inventory becomes uncertain for a short interval,
- residual schedule bunches into re-entry bursts,
- urgency gets recalculated on stale assumptions.
If this is not explicitly modeled, p95/p99 cost tails look random and “unfixable.”
2) Mechanism map (what actually causes the leak)
2.1 CoD purge branch
When session drops long enough to violate venue/gateway timeout, resting passive orders are canceled by venue policy.
Direct consequences
- Queue Age Reset: passive queue option value is destroyed.
- Re-entry Competition: replacement orders fight from the back of queue.
- Residual Compression: missed passive fills become future catch-up pressure.
- Asymmetric Replay: fills/acks/cancels can replay in mixed order after reconnect.
2.2 Flapping amplification
A single disconnect is manageable. Flapping (repeated disconnect/reconnect cycles) is dangerous:
- repeated CoD waves,
- repeated queue resets,
- strategy repeatedly re-prices and re-sends,
- throttle/reject risk rises during recovery bursts.
So the tail is not linear in outage duration; it is convex in flap count × rebuild pressure.
3) Cost decomposition
Model total execution shortfall as:
[ C_{total} = C_{base} + C_{CoD} + C_{rebuild} + C_{deadline} ]
Where:
- (C_{base}): normal microstructure cost (spread + impact + timing)
- (C_{CoD}): queue-option loss from forced cancels
- (C_{rebuild}): adverse re-entry cost + reject/retry overhead
- (C_{deadline}): catch-up convexity if schedule deficit must be repaid fast
A practical branch expectation form:
[ \mathbb{E}[C] = p_{stable}C_{stable} + p_{single}C_{single_CoD} + p_{flap}C_{flap_cascade} ]
with (p_{flap}) estimated from session-health features, not market features alone.
4) Feature set for production modeling
4.1 Session-health features (must-have)
disconnect_count_1m,disconnect_count_5msession_uptime_secondsheartbeat_rtt_p50/p95,heartbeat_miss_streakreconnect_duration_mstime_since_last_disconnect_ms
4.2 CoD/rebuild features
cod_cancel_countcod_cancel_notionalrebuild_order_count_30srebuild_notional_ratio = rebuild_notional / remaining_notionalpost_reconnect_reject_rate
4.3 Queue-loss proxies
forced_cancel_queue_age_sum(if queue age tracked)passive_fill_miss_delta_30s(expected minus realized)touch_rejoin_latency_ms
4.4 Coupling features with market stress
- spread z-score
- microvolatility burst
- top-of-book depth decay
- short-horizon OFI skew
Important: CoD cost is worst when infra fragility and liquidity fragility co-occur.
5) Core metrics (dashboard + alarms)
5.1 DFI — Disconnect Flap Index
[ DFI = w_1 \cdot disconnect_count_{1m} + w_2 \cdot heartbeat_miss_streak + w_3 \cdot reconnect_duration_z ]
Measures session instability intensity.
5.2 QLT — Queue Loss Tax
[ QLT = \frac{\text{post-CoD realized cost} - \text{counterfactual no-CoD cost}}{\text{executed notional}} ]
Primary KPI for this failure mode.
5.3 RBS — Rebuild Burst Stress
[ RBS = \frac{\text{rebuild notional in }30s}{\text{remaining schedule notional}} ]
Captures urgency injection from forced restarts.
5.4 FRR — Flap Recovery Reject rate
[ FRR = \frac{\text{rejects during }T_{recovery}}{\text{orders sent during }T_{recovery}} ]
Detects control-plane saturation during re-entry.
6) State machine for execution controls
STABLE
- Normal policy.
- Passive/midpoint behavior governed by standard toxicity model.
FLAP_WATCH (DFI above watch threshold)
- Reduce replace/cancel churn.
- Widen minimum dwell before repricing.
- Slightly lower venue fan-out to reduce control load.
COD_RECOVERY (confirmed CoD event)
- Freeze non-essential strategy mutations for short cool-down window.
- Rebuild in paced ladder (chunked resubmission), not burst dump.
- Temporarily cap aggression step size.
SAFE_STABILIZE (repeated flaps or FRR spike)
- Enter conservative completion mode:
- smaller slices,
- stricter reject budget,
- venue subset with healthiest session path.
- Optional temporary participation cap reduction until session-health recovers.
Recovery requires hysteresis (time + health thresholds), not one-tick oscillation.
7) Backtest/replay methodology
Episode labeling
- Partition historical data into
stable,single_CoD,flap_cascadeepisodes.
- Partition historical data into
Counterfactual reconstruction
- Replay with observed market path but synthetic no-CoD order continuity.
Tail-centric evaluation
- Track q50/q90/q95 shortfall delta by episode class.
Completion guardrail
- Report cost improvement jointly with completion reliability and deadline misses.
Stress slicing
- Evaluate separately for open/close windows where queue reset is most expensive.
8) Practical rollout plan (30 days)
Week 1: Instrumentation
- Ensure CoD/cancel reason codes are normalized across venues.
- Add session-health telemetry to execution event log.
- Build episode tagger.
Week 2: Shadow model
- Compute DFI/QLT/RBS/FRR without changing live policy.
- Validate that CoD episodes explain tail residuals currently blamed on “vol.”
Week 3: Guarded activation
- Enable
FLAP_WATCH+COD_RECOVERYcontrols for 5–10% traffic. - Hard rollback trigger: completion drop > preset threshold.
Week 4: Scale + stabilize
- Expand traffic if q95 QLT improves and completion stays inside SLA.
- Tune hysteresis to avoid state flapping.
9) Anti-patterns
- Treating disconnects as pure infra incidents unrelated to trading cost.
- Re-submitting full residual immediately after reconnect.
- Using identical post-reconnect behavior regardless of reject pressure.
- Evaluating success only on mean bps (ignoring p95 and completion reliability).
10) Bottom line
Cancel-on-Disconnect is a slippage regime, not an operational footnote.
If you do not model session-health and CoD-induced queue resets explicitly, your execution stack will keep paying hidden tail tax during connectivity turbulence. The winning setup is:
- session-health-aware slippage features,
- CoD episode decomposition,
- paced recovery controls with hysteresis,
- q95 + completion joint governance.
That turns reconnect chaos from “market bad luck” into an engineerable control problem.