Gateway Failover Order-State Divergence Slippage Playbook

2026-03-30 · finance

Gateway Failover Order-State Divergence Slippage Playbook

Scope: How order-entry gateway failover, sequence recovery, and temporary live-order ambiguity turn infra incidents into measurable execution slippage.

1) Why this matters

Most teams classify failover as "platform reliability" and move on. Execution desks pay for it later.

When the primary order-entry path fails, the backup path often comes online before order truth is fully converged (acks, cancels, fills, rejects, drop-copy). During that uncertainty window, routers either:

  1. under-trade (schedule deficit), or
  2. over-trade (duplicate intent / panic catch-up),

and both inflate implementation shortfall tails.

2) Failure anatomy (where cost appears)

Common chain:

  1. Primary path degrades or disconnects.
  2. Backup session establishes, but sequence/replay is still in progress.
  3. Local view of "working orders" diverges from venue truth.
  4. Strategy decisions are made on uncertain state.
  5. Re-entry/cancel behavior becomes too conservative or too aggressive.
  6. Queue position is lost and catch-up urgency rises near horizon end.

Key asymmetry: transport recovery can be faster than state recovery.

3) Slippage decomposition with failover component

Let:

IS_total = IS_base + IS_FO + ε

Decompose failover term:

IS_FO = C_divergence + C_queue_reset + C_throttle + C_catchup

Where:

4) Core metrics (production KPIs)

1) Failover Detection Skew (FDS)

Time between first hard path-failure signal and controlled failover-state entry.

2) Live-Set Similarity (LSS)

Similarity between local working-order set and reconstructed venue truth over time.

LSS_t = |Local_t ∩ Venue_t| / |Local_t ∪ Venue_t|

3) Replay Watermark Horizon (RWH)

Outstanding sequence/replay distance until order-state confidence is trustworthy.

4) State Decision Lag (SDL)

Latency from confidence threshold crossing to router policy shift (back to normal participation profile).

5) Backlog Catch-up Uplift (BCU)

Extra bps paid per unit backlog cleared in post-failover recovery window.

6) Failover Episode Tail Lift (FETL)

p95(IS | failover episodes) - p95(IS | matched controls)

5) State machine (explicit, not implicit)

Transition on objective signals only: disconnect/failure events, sequence-gap closure, drop-copy parity, LSS threshold, and confidence SLA.

6) Data requirements (minimum schema)

Per child order:

Per episode:

Without this lineage, failover slippage gets mislabeled as generic volatility.

7) Identification and calibration

  1. Episode labeling
    • Label failover windows from gateway events + sequence recovery markers.
  2. Matched controls
    • Match by symbol bucket, spread/vol regime, parent urgency, and time-of-day.
  3. Event study
    • Measure pre/post drift in slippage and completion path around failover boundaries.
  4. Mediation analysis
    • Quantify how much tail lift is explained by backlog repayment vs raw market move.
  5. Policy heterogeneity
    • Slice by venue/session type, recovery speed, and router confidence threshold.

8) Control policy design

A) Pre-incident hardening

B) Incident mode

C) Recovery mode

9) 2-week rollout plan

Week 1

  1. Add failover episode labels and sequence/replay telemetry.
  2. Implement LSS/RWH/FETL dashboard.
  3. Wire state machine with safe router profile.

Week 2

  1. Add matched-control analytics.
  2. Canary confidence-gated recovery on limited symbols/notional.
  3. Promotion gates: lower FETL, no completion-SLA regression, no unresolved state divergence beyond SLA.

10) Common mistakes

  1. Treating "socket reconnected" as equivalent to "state recovered".
  2. Ignoring duplicate-intent risk during ambiguous windows.
  3. Running panic catch-up without backlog convexity limits.
  4. Averaging away tails (mean IS only, no p95/p99 failover view).
  5. Failing to version failover policies and compare before/after.

11) What good looks like

For each failover incident, you can answer:

  1. How long order-state divergence persisted (not just link downtime).
  2. How much slippage came from divergence vs queue reset vs catch-up.
  3. Whether confidence-gated recovery reduced tails versus matched controls.
  4. Which policy threshold should be tightened or relaxed next.

If those answers are unavailable, failover is still a hidden slippage tax.

References