Direct-Feed Failover Flap Slippage Playbook

When Primary/Backup Feed Arbitration Becomes a Hidden Execution Tax

Why this note: Many execution stacks run dual market-data paths (primary + backup direct feeds) and switch on packet-loss/latency alarms. The failure is not one clean failover — it is flapping (rapid switch-back/switch-forth), which injects state discontinuities, quote-age spikes, and decision jitter that standard slippage models miss.

1) Failure Mode in One Sentence

If feed arbitration can flap faster than your execution controller can stabilize, you will overtrade stale state, reset queue edge, and pay convex catch-up slippage.

2) Cost Decomposition with Feed-Switch Risk

For child action (a_t):

[ \mathbb{E}[IS_t(a_t)] = C_{spread} + C_{impact} + C_{queue} + C_{stale} + C_{switch} + C_{opportunity} ]

Where new hidden terms are:

(C_{stale}): executable-truth drift from quote-age spikes during switch windows
(C_{switch}): policy error introduced by arbitration transitions (book discontinuity, reset, delayed confidence)

A practical branch form:

[ \mathbb{E}[IS_t] = (1-p_{sw}),\mu_{stable}(x_t,a_t) + p_{sw},\mu_{switch}(x_t,a_t) ]

with (p_{sw}=\Pr(\text{feed switch in next }\Delta\text{ ms})).

3) Model the Hazard, Not Just the Outcome

Let (z_t) include feed health state:

sequence-gap pressure
packet-loss burst score
feed-latency spread (primary vs backup)
recent switch count
book divergence signal

Hazard model example:

[ \text{logit}(p_{sw}) = \beta_0 + \beta_1,gapRate_t + \beta_2,lossBurst_t + \beta_3,latencySkew_t + \beta_4,switchCount_{1s} ]

Then use a mixture slippage estimator for routing decisions.

4) Regime States (Encode Explicitly)

F0 STABLE_PRIMARY: low hazard, consistent primary feed
F1 DEGRADED_PRIMARY: rising loss/gap, no switch yet
F2 FAILOVER_TRANSITION: active switch in progress
F3 FLAP_STORM: repeated switch-back events in short window
F4 SAFE_CONTAIN: confidence floor breached; conservative execution only

Use hysteresis + minimum dwell times to avoid control thrash.

5) Features Worth Adding to the Slippage Stack

Feed integrity

primary_gap_rate_100ms
backup_gap_rate_100ms
primary_backup_latency_skew_us
feed_switch_count_1s
time_since_last_switch_ms

Book continuity

top_of_book_discontinuity_score
microprice_jump_at_switch
depth_rebuild_time_ms
quote_age_p95_ms (post-switch window)

Execution sensitivity

queue_age_ms
remaining_deadline_slack_ms
child_retry_burst_1s
venue_ack_jitter_p95

Missing switch_count and time_since_last_switch at decision time should fail promotion for this model family.

6) New Diagnostics / KPIs

FSR (Feed Switch Rate): switches per second per symbol bucket
FFR (Feed Flap Rate): switch-back within (<T) ms after failover
SAS (Switch-Age Spike): quote-age increase around switch events
SMD (Switch Markout Delta): markout(high-switch-hazard) - markout(low-switch-hazard)
QLRS (Queue-Loss-at-Resync Share): % IS explained by queue deterioration during F2/F3

If average IS is flat but SMD and QLRS are rising, the desk is silently paying control-plane slippage.

7) Control Policy (Hazard-Aware)

STABLE_PRIMARY ((p_{sw}<\tau_1))

normal passive/neutral routing

DEGRADED_PRIMARY ((\tau_1\le p_{sw}<\tau_2))

reduce passive timeout
downweight fragile queues
tighten cancel/replace budget

FAILOVER_TRANSITION ((\tau_2\le p_{sw}<\tau_3))

avoid tactics requiring ultra-fresh L1 certainty
prefer smaller child slices with bounded urgency
require stronger confirmation for aggressive takes

FLAP_STORM ((p_{sw}\ge\tau_3) or FFR breach)

freeze high-churn repricing logic
cap participation + widen safety margins
switch to completion-preserving SAFE profile when confidence floor is breached

8) Labeling & Training Contract

For each child decision, store:

Regime label (F0..F4)
Switch outcome in next 100/250/1000ms
Realized IS + multi-horizon markouts
Fill path (fill/cancel/replace/reject)
Feed-health snapshot (gaps/loss/skew/age)

Train three heads:

mean cost (P50)
tail cost (P90/P97.5)
switch hazard (short horizon)

9) Rollout Blueprint

Shadow switch-hazard + mixture predictions for 10–14 days
Compare by symbol liquidity bucket + time-of-day
Canary only for F2/F3 logic first
Promote when all hold:
- better tail coverage in switch windows
- lower QLRS and SMD
- no completion collapse under flap episodes

Rollback triggers:

repeated false-positive SAFE transitions
underfill spike above threshold
unresolved feed-clock integrity alerts

10) Common Anti-Patterns

Treating failover as one-off event instead of recurrent flap process
Optimizing only p50 while F3 tails deteriorate
Switching feeds without switching execution policy profile
Ignoring quote-age spikes right after feed resync
No hysteresis, causing policy and feed-control oscillation together

11) Fast Implementation Checklist

[ ] Add feed-switch telemetry to decision-time feature store
[ ] Train short-horizon switch hazard model
[ ] Upgrade cost model to stable/switch mixture form
[ ] Add F0..F4 state machine with hysteresis + dwell timers
[ ] Gate deployment on switch-window tail coverage + QLRS/SMD
[ ] Canary with strict notional caps in F2/F3 only

References

Cartea, Á., Jaimungal, S., Penalva, J. (2015), Algorithmic and High-Frequency Trading.
Gould, M. D., et al. (2013), Limit Order Books (Quantitative Finance review).
Bouchaud, J.-P., Farmer, J. D., Lillo, F. (2009), How Markets Slowly Digest Changes in Supply and Demand.
Bacry, E., Mastromatteo, I., Muzy, J.-F. (2015), Hawkes Processes in Finance.

TL;DR

Dual-feed redundancy is not automatically safer for slippage. Without flap-aware hazard modeling and state-dependent controls, feed arbitration itself becomes a repeatable hidden cost center.