Gateway Failover Order-State Divergence Slippage Playbook

Scope: How order-entry gateway failover, sequence recovery, and temporary live-order ambiguity turn infra incidents into measurable execution slippage.

1) Why this matters

Most teams classify failover as "platform reliability" and move on. Execution desks pay for it later.

When the primary order-entry path fails, the backup path often comes online before order truth is fully converged (acks, cancels, fills, rejects, drop-copy). During that uncertainty window, routers either:

under-trade (schedule deficit), or
over-trade (duplicate intent / panic catch-up),

and both inflate implementation shortfall tails.

2) Failure anatomy (where cost appears)

Common chain:

Primary path degrades or disconnects.
Backup session establishes, but sequence/replay is still in progress.
Local view of "working orders" diverges from venue truth.
Strategy decisions are made on uncertain state.
Re-entry/cancel behavior becomes too conservative or too aggressive.
Queue position is lost and catch-up urgency rises near horizon end.

Key asymmetry: transport recovery can be faster than state recovery.

3) Slippage decomposition with failover component

Let:

IS_total: total implementation shortfall (bps)
IS_base: baseline spread/impact/timing cost
IS_FO: failover-driven incremental cost

IS_total = IS_base + IS_FO + ε

Decompose failover term:

IS_FO = C_divergence + C_queue_reset + C_throttle + C_catchup

Where:

C_divergence: wrong decisions under live-set ambiguity
C_queue_reset: queue-priority loss after defensive cancels/reposts
C_throttle: temporary under-participation during uncertainty gating
C_catchup: convex urgency premium when deficit is repaid late

4) Core metrics (production KPIs)

1) Failover Detection Skew (FDS)

Time between first hard path-failure signal and controlled failover-state entry.

2) Live-Set Similarity (LSS)

Similarity between local working-order set and reconstructed venue truth over time.

LSS_t = |Local_t ∩ Venue_t| / |Local_t ∪ Venue_t|

3) Replay Watermark Horizon (RWH)

Outstanding sequence/replay distance until order-state confidence is trustworthy.

4) State Decision Lag (SDL)

Latency from confidence threshold crossing to router policy shift (back to normal participation profile).

5) Backlog Catch-up Uplift (BCU)

Extra bps paid per unit backlog cleared in post-failover recovery window.

6) Failover Episode Tail Lift (FETL)

p95(IS | failover episodes) - p95(IS | matched controls)

5) State machine (explicit, not implicit)

HEALTHY
FAILOVER_DETECTED
STATE_DIVERGED
REPLAY_SYNCING
CONTROLLED_REENTRY
STABLE

Transition on objective signals only: disconnect/failure events, sequence-gap closure, drop-copy parity, LSS threshold, and confidence SLA.

6) Data requirements (minimum schema)

Per child order:

parent/child IDs, symbol, side, qty, price
decision/send timestamps
ack/reject/cancel/fill timestamps (entry + drop-copy)
session/gateway identifiers
failover cycle ID
sequence numbers and replay/gap-fill markers
local-state confidence score at decision time

Per episode:

failover start/end
sequence-gap trajectory
LSS time series
backlog trajectory vs target schedule
slippage attribution (C_divergence, C_queue_reset, C_throttle, C_catchup)

Without this lineage, failover slippage gets mislabeled as generic volatility.

7) Identification and calibration

Episode labeling
- Label failover windows from gateway events + sequence recovery markers.
Matched controls
- Match by symbol bucket, spread/vol regime, parent urgency, and time-of-day.
Event study
- Measure pre/post drift in slippage and completion path around failover boundaries.
Mediation analysis
- Quantify how much tail lift is explained by backlog repayment vs raw market move.
Policy heterogeneity
- Slice by venue/session type, recovery speed, and router confidence threshold.

8) Control policy design

A) Pre-incident hardening

Keep failover runbooks machine-testable (chaos drills + replay audits).
Enforce dual-truth ingestion (order-entry + independent drop-copy).
Track sequence-gap behavior and replay saturation limits.

B) Incident mode

Enter STATE_DIVERGED quickly.
Freeze strategy expansions that depend on precise live-order truth.
Use confidence-weighted participation caps, not binary all-on/all-off toggles.

C) Recovery mode

Increase participation as LSS and replay confidence improve.
Cap catch-up acceleration with convexity-aware limits.
Auto-safe-mode if FETL or BCU breaches hard guardrails.

9) 2-week rollout plan

Week 1

Add failover episode labels and sequence/replay telemetry.
Implement LSS/RWH/FETL dashboard.
Wire state machine with safe router profile.

Week 2

Add matched-control analytics.
Canary confidence-gated recovery on limited symbols/notional.
Promotion gates: lower FETL, no completion-SLA regression, no unresolved state divergence beyond SLA.

10) Common mistakes

Treating "socket reconnected" as equivalent to "state recovered".
Ignoring duplicate-intent risk during ambiguous windows.
Running panic catch-up without backlog convexity limits.
Averaging away tails (mean IS only, no p95/p99 failover view).
Failing to version failover policies and compare before/after.

11) What good looks like

For each failover incident, you can answer:

How long order-state divergence persisted (not just link downtime).
How much slippage came from divergence vs queue reset vs catch-up.
Whether confidence-gated recovery reduced tails versus matched controls.
Which policy threshold should be tightened or relaxed next.

If those answers are unavailable, failover is still a hidden slippage tax.

References

CME Group Client Systems Wiki — Fault Tolerance (iLink session failover guidance; wait for in-flight resend completion before failover):
https://cmegroupclientsite.atlassian.net/wiki/spaces/EPICSANDBOX/pages/457671413/Fault+Tolerance
OnixS FIX 4.2 Dictionary — Resend Request (MsgType=2) (gap recovery mechanics):
https://www.onixs.biz/fix-dictionary/4.2/msgtype_2_2.html
OnixS FIX 4.2 Dictionary — Sequence Reset (MsgType=4) (GapFill vs Reset disaster mode):
https://www.onixs.biz/fix-dictionary/4.2/msgtype_4_4.html
CME Group — FAQ: Port Closure (repeated resend/disconnect behavior risk and operational controls):
https://www.cmegroup.com/solutions/market-access/globex/trade-on-globex/about-the-global-command-center/faq-port-closure.html