Cancel-on-Disconnect Flapping Session Slippage Playbook

Date: 2026-03-14
Category: research
Focus: Modeling hidden execution cost when connectivity flaps trigger venue-level Cancel-on-Disconnect (CoD) order purges.

1) Why this failure mode deserves first-class treatment

Most slippage models assume a clean order lifecycle:

submit -> rest -> partial fill -> amend/cancel -> complete

But in real production, session-level controls (FIX disconnect rules, gateway heartbeat timeout, venue CoD settings) can force a different lifecycle:

disconnect -> venue purges resting orders -> reconnect -> strategy rebuilds inventory with fresh child orders

That creates a control-plane slippage tax that often gets misattributed to “market volatility”:

queue priority disappears instantly,
live order inventory becomes uncertain for a short interval,
residual schedule bunches into re-entry bursts,
urgency gets recalculated on stale assumptions.

If this is not explicitly modeled, p95/p99 cost tails look random and “unfixable.”

2) Mechanism map (what actually causes the leak)

2.1 CoD purge branch

When session drops long enough to violate venue/gateway timeout, resting passive orders are canceled by venue policy.

Direct consequences

Queue Age Reset: passive queue option value is destroyed.
Re-entry Competition: replacement orders fight from the back of queue.
Residual Compression: missed passive fills become future catch-up pressure.
Asymmetric Replay: fills/acks/cancels can replay in mixed order after reconnect.

2.2 Flapping amplification

A single disconnect is manageable. Flapping (repeated disconnect/reconnect cycles) is dangerous:

repeated CoD waves,
repeated queue resets,
strategy repeatedly re-prices and re-sends,
throttle/reject risk rises during recovery bursts.

So the tail is not linear in outage duration; it is convex in flap count × rebuild pressure.

3) Cost decomposition

Model total execution shortfall as:

[ C_{total} = C_{base} + C_{CoD} + C_{rebuild} + C_{deadline} ]

Where:

(C_{base}): normal microstructure cost (spread + impact + timing)
(C_{CoD}): queue-option loss from forced cancels
(C_{rebuild}): adverse re-entry cost + reject/retry overhead
(C_{deadline}): catch-up convexity if schedule deficit must be repaid fast

A practical branch expectation form:

[ \mathbb{E}[C] = p_{stable}C_{stable} + p_{single}C_{single_CoD} + p_{flap}C_{flap_cascade} ]

with (p_{flap}) estimated from session-health features, not market features alone.

4) Feature set for production modeling

4.1 Session-health features (must-have)

disconnect_count_1m, disconnect_count_5m
session_uptime_seconds
heartbeat_rtt_p50/p95, heartbeat_miss_streak
reconnect_duration_ms
time_since_last_disconnect_ms

4.2 CoD/rebuild features

cod_cancel_count
cod_cancel_notional
rebuild_order_count_30s
rebuild_notional_ratio = rebuild_notional / remaining_notional
post_reconnect_reject_rate

4.3 Queue-loss proxies

forced_cancel_queue_age_sum (if queue age tracked)
passive_fill_miss_delta_30s (expected minus realized)
touch_rejoin_latency_ms

4.4 Coupling features with market stress

spread z-score
microvolatility burst
top-of-book depth decay
short-horizon OFI skew

Important: CoD cost is worst when infra fragility and liquidity fragility co-occur.

5) Core metrics (dashboard + alarms)

5.1 DFI — Disconnect Flap Index

[ DFI = w_1 \cdot disconnect_count_{1m} + w_2 \cdot heartbeat_miss_streak + w_3 \cdot reconnect_duration_z ]

Measures session instability intensity.

5.2 QLT — Queue Loss Tax

[ QLT = \frac{\text{post-CoD realized cost} - \text{counterfactual no-CoD cost}}{\text{executed notional}} ]

Primary KPI for this failure mode.

5.3 RBS — Rebuild Burst Stress

[ RBS = \frac{\text{rebuild notional in }30s}{\text{remaining schedule notional}} ]

Captures urgency injection from forced restarts.

5.4 FRR — Flap Recovery Reject rate

[ FRR = \frac{\text{rejects during }T_{recovery}}{\text{orders sent during }T_{recovery}} ]

Detects control-plane saturation during re-entry.

6) State machine for execution controls

`STABLE`

Normal policy.
Passive/midpoint behavior governed by standard toxicity model.

`FLAP_WATCH` (DFI above watch threshold)

Reduce replace/cancel churn.
Widen minimum dwell before repricing.
Slightly lower venue fan-out to reduce control load.

`COD_RECOVERY` (confirmed CoD event)

Freeze non-essential strategy mutations for short cool-down window.
Rebuild in paced ladder (chunked resubmission), not burst dump.
Temporarily cap aggression step size.

`SAFE_STABILIZE` (repeated flaps or FRR spike)

Enter conservative completion mode:
- smaller slices,
- stricter reject budget,
- venue subset with healthiest session path.
Optional temporary participation cap reduction until session-health recovers.

Recovery requires hysteresis (time + health thresholds), not one-tick oscillation.

7) Backtest/replay methodology

Episode labeling
- Partition historical data into stable, single_CoD, flap_cascade episodes.
Counterfactual reconstruction
- Replay with observed market path but synthetic no-CoD order continuity.
Tail-centric evaluation
- Track q50/q90/q95 shortfall delta by episode class.
Completion guardrail
- Report cost improvement jointly with completion reliability and deadline misses.
Stress slicing
- Evaluate separately for open/close windows where queue reset is most expensive.

8) Practical rollout plan (30 days)

Week 1: Instrumentation

Ensure CoD/cancel reason codes are normalized across venues.
Add session-health telemetry to execution event log.
Build episode tagger.

Week 2: Shadow model

Compute DFI/QLT/RBS/FRR without changing live policy.
Validate that CoD episodes explain tail residuals currently blamed on “vol.”

Week 3: Guarded activation

Enable FLAP_WATCH + COD_RECOVERY controls for 5–10% traffic.
Hard rollback trigger: completion drop > preset threshold.

Week 4: Scale + stabilize

Expand traffic if q95 QLT improves and completion stays inside SLA.
Tune hysteresis to avoid state flapping.

9) Anti-patterns

Treating disconnects as pure infra incidents unrelated to trading cost.
Re-submitting full residual immediately after reconnect.
Using identical post-reconnect behavior regardless of reject pressure.
Evaluating success only on mean bps (ignoring p95 and completion reliability).

10) Bottom line

Cancel-on-Disconnect is a slippage regime, not an operational footnote.

If you do not model session-health and CoD-induced queue resets explicitly, your execution stack will keep paying hidden tail tax during connectivity turbulence. The winning setup is:

session-health-aware slippage features,
CoD episode decomposition,
paced recovery controls with hysteresis,
q95 + completion joint governance.

That turns reconnect chaos from “market bad luck” into an engineerable control problem.

Cancel-on-Disconnect Flapping Session Slippage Playbook

Cancel-on-Disconnect Flapping Session Slippage Playbook

1) Why this failure mode deserves first-class treatment

2) Mechanism map (what actually causes the leak)

2.1 CoD purge branch

2.2 Flapping amplification

3) Cost decomposition

4) Feature set for production modeling

4.1 Session-health features (must-have)

4.2 CoD/rebuild features

4.3 Queue-loss proxies

4.4 Coupling features with market stress

5) Core metrics (dashboard + alarms)

5.1 DFI — Disconnect Flap Index

5.2 QLT — Queue Loss Tax

5.3 RBS — Rebuild Burst Stress

5.4 FRR — Flap Recovery Reject rate

6) State machine for execution controls

STABLE

FLAP_WATCH (DFI above watch threshold)

COD_RECOVERY (confirmed CoD event)

SAFE_STABILIZE (repeated flaps or FRR spike)

7) Backtest/replay methodology

8) Practical rollout plan (30 days)

Week 1: Instrumentation

Week 2: Shadow model

Week 3: Guarded activation

Week 4: Scale + stabilize

9) Anti-patterns

10) Bottom line

`STABLE`

`FLAP_WATCH` (DFI above watch threshold)

`COD_RECOVERY` (confirmed CoD event)

`SAFE_STABILIZE` (repeated flaps or FRR spike)