Direct-Feed Failover Flap Slippage Playbook

2026-03-30 · finance

Direct-Feed Failover Flap Slippage Playbook

When Primary/Backup Feed Arbitration Becomes a Hidden Execution Tax

Why this note: Many execution stacks run dual market-data paths (primary + backup direct feeds) and switch on packet-loss/latency alarms. The failure is not one clean failover — it is flapping (rapid switch-back/switch-forth), which injects state discontinuities, quote-age spikes, and decision jitter that standard slippage models miss.


1) Failure Mode in One Sentence

If feed arbitration can flap faster than your execution controller can stabilize, you will overtrade stale state, reset queue edge, and pay convex catch-up slippage.


2) Cost Decomposition with Feed-Switch Risk

For child action (a_t):

[ \mathbb{E}[IS_t(a_t)] = C_{spread} + C_{impact} + C_{queue} + C_{stale} + C_{switch} + C_{opportunity} ]

Where new hidden terms are:

A practical branch form:

[ \mathbb{E}[IS_t] = (1-p_{sw}),\mu_{stable}(x_t,a_t) + p_{sw},\mu_{switch}(x_t,a_t) ]

with (p_{sw}=\Pr(\text{feed switch in next }\Delta\text{ ms})).


3) Model the Hazard, Not Just the Outcome

Let (z_t) include feed health state:

Hazard model example:

[ \text{logit}(p_{sw}) = \beta_0 + \beta_1,gapRate_t + \beta_2,lossBurst_t + \beta_3,latencySkew_t + \beta_4,switchCount_{1s} ]

Then use a mixture slippage estimator for routing decisions.


4) Regime States (Encode Explicitly)

Use hysteresis + minimum dwell times to avoid control thrash.


5) Features Worth Adding to the Slippage Stack

Feed integrity

Book continuity

Execution sensitivity

Missing switch_count and time_since_last_switch at decision time should fail promotion for this model family.


6) New Diagnostics / KPIs

  1. FSR (Feed Switch Rate): switches per second per symbol bucket
  2. FFR (Feed Flap Rate): switch-back within (<T) ms after failover
  3. SAS (Switch-Age Spike): quote-age increase around switch events
  4. SMD (Switch Markout Delta): markout(high-switch-hazard) - markout(low-switch-hazard)
  5. QLRS (Queue-Loss-at-Resync Share): % IS explained by queue deterioration during F2/F3

If average IS is flat but SMD and QLRS are rising, the desk is silently paying control-plane slippage.


7) Control Policy (Hazard-Aware)

STABLE_PRIMARY ((p_{sw}<\tau_1))

DEGRADED_PRIMARY ((\tau_1\le p_{sw}<\tau_2))

FAILOVER_TRANSITION ((\tau_2\le p_{sw}<\tau_3))

FLAP_STORM ((p_{sw}\ge\tau_3) or FFR breach)


8) Labeling & Training Contract

For each child decision, store:

  1. Regime label (F0..F4)
  2. Switch outcome in next 100/250/1000ms
  3. Realized IS + multi-horizon markouts
  4. Fill path (fill/cancel/replace/reject)
  5. Feed-health snapshot (gaps/loss/skew/age)

Train three heads:


9) Rollout Blueprint

  1. Shadow switch-hazard + mixture predictions for 10–14 days
  2. Compare by symbol liquidity bucket + time-of-day
  3. Canary only for F2/F3 logic first
  4. Promote when all hold:
    • better tail coverage in switch windows
    • lower QLRS and SMD
    • no completion collapse under flap episodes

Rollback triggers:


10) Common Anti-Patterns


11) Fast Implementation Checklist

[ ] Add feed-switch telemetry to decision-time feature store
[ ] Train short-horizon switch hazard model
[ ] Upgrade cost model to stable/switch mixture form
[ ] Add F0..F4 state machine with hysteresis + dwell timers
[ ] Gate deployment on switch-window tail coverage + QLRS/SMD
[ ] Canary with strict notional caps in F2/F3 only

References


TL;DR

Dual-feed redundancy is not automatically safer for slippage. Without flap-aware hazard modeling and state-dependent controls, feed arbitration itself becomes a repeatable hidden cost center.