Gateway Failover Order-State Divergence Slippage Playbook
Scope: How order-entry gateway failover, sequence recovery, and temporary live-order ambiguity turn infra incidents into measurable execution slippage.
1) Why this matters
Most teams classify failover as "platform reliability" and move on. Execution desks pay for it later.
When the primary order-entry path fails, the backup path often comes online before order truth is fully converged (acks, cancels, fills, rejects, drop-copy). During that uncertainty window, routers either:
- under-trade (schedule deficit), or
- over-trade (duplicate intent / panic catch-up),
and both inflate implementation shortfall tails.
2) Failure anatomy (where cost appears)
Common chain:
- Primary path degrades or disconnects.
- Backup session establishes, but sequence/replay is still in progress.
- Local view of "working orders" diverges from venue truth.
- Strategy decisions are made on uncertain state.
- Re-entry/cancel behavior becomes too conservative or too aggressive.
- Queue position is lost and catch-up urgency rises near horizon end.
Key asymmetry: transport recovery can be faster than state recovery.
3) Slippage decomposition with failover component
Let:
IS_total: total implementation shortfall (bps)IS_base: baseline spread/impact/timing costIS_FO: failover-driven incremental cost
IS_total = IS_base + IS_FO + ε
Decompose failover term:
IS_FO = C_divergence + C_queue_reset + C_throttle + C_catchup
Where:
C_divergence: wrong decisions under live-set ambiguityC_queue_reset: queue-priority loss after defensive cancels/repostsC_throttle: temporary under-participation during uncertainty gatingC_catchup: convex urgency premium when deficit is repaid late
4) Core metrics (production KPIs)
1) Failover Detection Skew (FDS)
Time between first hard path-failure signal and controlled failover-state entry.
2) Live-Set Similarity (LSS)
Similarity between local working-order set and reconstructed venue truth over time.
LSS_t = |Local_t ∩ Venue_t| / |Local_t ∪ Venue_t|
3) Replay Watermark Horizon (RWH)
Outstanding sequence/replay distance until order-state confidence is trustworthy.
4) State Decision Lag (SDL)
Latency from confidence threshold crossing to router policy shift (back to normal participation profile).
5) Backlog Catch-up Uplift (BCU)
Extra bps paid per unit backlog cleared in post-failover recovery window.
6) Failover Episode Tail Lift (FETL)
p95(IS | failover episodes) - p95(IS | matched controls)
5) State machine (explicit, not implicit)
HEALTHYFAILOVER_DETECTEDSTATE_DIVERGEDREPLAY_SYNCINGCONTROLLED_REENTRYSTABLE
Transition on objective signals only: disconnect/failure events, sequence-gap closure, drop-copy parity, LSS threshold, and confidence SLA.
6) Data requirements (minimum schema)
Per child order:
- parent/child IDs, symbol, side, qty, price
- decision/send timestamps
- ack/reject/cancel/fill timestamps (entry + drop-copy)
- session/gateway identifiers
- failover cycle ID
- sequence numbers and replay/gap-fill markers
- local-state confidence score at decision time
Per episode:
- failover start/end
- sequence-gap trajectory
- LSS time series
- backlog trajectory vs target schedule
- slippage attribution (
C_divergence,C_queue_reset,C_throttle,C_catchup)
Without this lineage, failover slippage gets mislabeled as generic volatility.
7) Identification and calibration
- Episode labeling
- Label failover windows from gateway events + sequence recovery markers.
- Matched controls
- Match by symbol bucket, spread/vol regime, parent urgency, and time-of-day.
- Event study
- Measure pre/post drift in slippage and completion path around failover boundaries.
- Mediation analysis
- Quantify how much tail lift is explained by backlog repayment vs raw market move.
- Policy heterogeneity
- Slice by venue/session type, recovery speed, and router confidence threshold.
8) Control policy design
A) Pre-incident hardening
- Keep failover runbooks machine-testable (chaos drills + replay audits).
- Enforce dual-truth ingestion (order-entry + independent drop-copy).
- Track sequence-gap behavior and replay saturation limits.
B) Incident mode
- Enter
STATE_DIVERGEDquickly. - Freeze strategy expansions that depend on precise live-order truth.
- Use confidence-weighted participation caps, not binary all-on/all-off toggles.
C) Recovery mode
- Increase participation as LSS and replay confidence improve.
- Cap catch-up acceleration with convexity-aware limits.
- Auto-safe-mode if FETL or BCU breaches hard guardrails.
9) 2-week rollout plan
Week 1
- Add failover episode labels and sequence/replay telemetry.
- Implement LSS/RWH/FETL dashboard.
- Wire state machine with safe router profile.
Week 2
- Add matched-control analytics.
- Canary confidence-gated recovery on limited symbols/notional.
- Promotion gates: lower FETL, no completion-SLA regression, no unresolved state divergence beyond SLA.
10) Common mistakes
- Treating "socket reconnected" as equivalent to "state recovered".
- Ignoring duplicate-intent risk during ambiguous windows.
- Running panic catch-up without backlog convexity limits.
- Averaging away tails (mean IS only, no p95/p99 failover view).
- Failing to version failover policies and compare before/after.
11) What good looks like
For each failover incident, you can answer:
- How long order-state divergence persisted (not just link downtime).
- How much slippage came from divergence vs queue reset vs catch-up.
- Whether confidence-gated recovery reduced tails versus matched controls.
- Which policy threshold should be tightened or relaxed next.
If those answers are unavailable, failover is still a hidden slippage tax.
References
- CME Group Client Systems Wiki — Fault Tolerance (iLink session failover guidance; wait for in-flight resend completion before failover):
https://cmegroupclientsite.atlassian.net/wiki/spaces/EPICSANDBOX/pages/457671413/Fault+Tolerance - OnixS FIX 4.2 Dictionary — Resend Request (MsgType=2) (gap recovery mechanics):
https://www.onixs.biz/fix-dictionary/4.2/msgtype_2_2.html - OnixS FIX 4.2 Dictionary — Sequence Reset (MsgType=4) (GapFill vs Reset disaster mode):
https://www.onixs.biz/fix-dictionary/4.2/msgtype_4_4.html - CME Group — FAQ: Port Closure (repeated resend/disconnect behavior risk and operational controls):
https://www.cmegroup.com/solutions/market-access/globex/trade-on-globex/about-the-global-command-center/faq-port-closure.html