BGP Convergence & Anycast Path-Flap RTT-Bimodality Slippage Playbook
Date: 2026-03-25
Category: research
Audience: execution operators running multi-region / anycast-connected trading infrastructure
Why this note
Many execution stacks assume network latency is a single noisy distribution around a stable mean.
During inter-domain routing churn (BGP withdrawal/re-announcement, maintenance misstep, upstream instability, anycast path shift), latency often becomes bimodal rather than merely “higher.”
That regime change can create hidden slippage through:
- stale-signal admissions (alpha half-life violated before order reaches venue),
- bursty catch-up behavior (controller reacts late, then over-trades),
- false router attribution (venue blamed when transport path changed),
- and completion-risk convexity near deadlines.
The key mistake: treating route instability as generic latency noise instead of a separate execution state.
1) Cost decomposition with route-instability term
For child order decision at time (t):
[ \mathbb{E}[C_t] = \mathbb{E}[C_{base} \mid a_t, s_t] + \lambda_r \cdot \mathbb{E}[C_{route} \mid z_t] ]
Where:
- (C_{base}): spread + impact + fees + residual opportunity cost
- (z_t): route-integrity state (stable / shifting / unstable)
- (C_{route}): excess cost from path churn
Practical branch model:
[ C_{route} \approx p_{stale}L_{stale} + p_{retry}L_{retry} + p_{deadline}L_{deadline} ]
- (p_{stale}): probability the decision arrives after signal validity window
- (p_{retry}): probability of reject/timeout/retry loop under transient path degradation
- (p_{deadline}): probability of forced urgency escalation near schedule end
2) Observability contract (point-in-time)
A) Network path telemetry
- per-destination RTT distribution (p50/p90/p99, not just mean)
- jitter and packet reordering indicators
- TCP retransmissions / QUIC loss-recovery counts
- path identity hints (upstream ASN, next-hop POP/region when available)
- route-change event markers (BGP update timestamps from collectors or providers)
B) Execution timeline telemetry
- decision timestamp
- send timestamp
- ACK/reject timestamp
- fill timestamp
- feature-age-at-decision and estimated feature-age-at-arrival
C) Session-level context
- open/close/auction/news windows
- strategy urgency bucket (low / medium / high)
- venue-specific timeout/retry behavior
Without path-event markers, the model confuses structural regime shifts with random latency noise.
3) Modeling stack
A) Route state classifier
Define discrete states:
- PATH_STABLE — unimodal latency, low update churn
- PATH_SHIFTING — emerging second mode, rising jitter/retransmits
- PATH_UNSTABLE — persistent bimodality or rapid path flips
- PATH_RECOVERING — variance decays but tail still elevated
Implementation options:
- rules + hysteresis (fast to deploy), or
- HMM/state-space filter using (RTT_{p50,p95,p99}), retransmits, path tags, update counts.
B) Bimodality detector
Use simple online indicators:
- Hartigan-style dip proxy (or Gaussian mixture BIC delta),
- tail-ratio jump ((p99/p50)),
- short-window mode-split score.
Trigger if bimodality persists over minimum dwell time (avoid false positives from microbursts).
C) Stale-arrival model
Estimate:
[ P(\text{arrival age} > \tau_{alpha} \mid z_t, venue, urgency) ]
This probability should directly influence passive-vs-aggressive decisioning.
D) Deadline convexity model
Model expected incremental cost of waiting near schedule end:
[ \Delta C_{deadline}(u) = C_{forced}(u) - C_{smooth}(u) ]
where (u) is residual urgency budget. Under PATH_UNSTABLE, (\Delta C_{deadline}) usually steepens.
4) Live policy controller (state-aware)
GREEN — PATH_STABLE
- Normal routing/sizing policy.
AMBER — PATH_SHIFTING
- tighten signal-age admission,
- reduce child slice size,
- increase retry spacing (avoid self-induced bursts),
- slightly favor venues with lower tail-latency sensitivity.
RED — PATH_UNSTABLE
- disable fragile alpha branches with short half-life,
- cap participation and prevent panic catch-up bursts,
- prefer robust completion schedule over opportunistic micro-alpha,
- activate stricter kill/containment limits for timeout cascades.
RECOVERY — PATH_RECOVERING
- keep conservative caps until dwell-based stability is confirmed,
- gradual ramp-up, not instant GREEN re-entry.
No direct RED→GREEN jumps; require recovery dwell and KPI normalization.
5) Production KPIs
- RBS (Route Bimodality Score): online bimodality intensity metric
- PSR (Path Shift Rate): route/path identity changes per minute
- SAF (Stale Arrival Fraction): arrivals beyond alpha-age budget
- RLC (Route-Latency Cost): incremental slippage vs PATH_STABLE baseline
- DCS (Deadline Convexity Slope): urgency cost steepness under current state
Useful alerts:
- RBS↑ + SAF↑ for N minutes ⇒ escalate to RED
- PSR spike during auction/open windows ⇒ restrict opportunistic policies
- RLC above weekly p95 while base spread unchanged ⇒ likely transport-driven cost leak
6) Validation ladder
Historical relabeling
Join execution logs with path/route event markers; compare state-conditional slippage.Counterfactual replay
Re-run decisions with AMBER/RED controls and estimate avoided tail cost.Fault injection drills
Inject synthetic latency bimodality/retransmit bursts in staging.Shadow then canary
Shadow state classifier first, then enable controls on small flow slice.
Primary success criterion: reduced q95/q99 slippage with bounded completion-risk degradation.
7) 14-day implementation plan
Days 1-2
Instrument path-event + execution-age fields end-to-end.
Days 3-4
Implement RBS/PSR/SAF calculators and dashboards.
Days 5-7
Build initial state classifier with hysteresis.
Days 8-9
Train stale-arrival and deadline-convexity components.
Days 10-11
Integrate AMBER/RED policy guards in shadow mode.
Days 12-13
Canary enablement with strict rollback gates.
Day 14
Finalize incident runbook and postmortem template.
Common mistakes
Using mean RTT as primary health signal
Mean hides the second mode that actually drives tail slippage.Immediate aggression ramp on partial recovery
Recovery without dwell causes repeated oscillation and cost spikes.No separation between market toxicity and path toxicity
Different causes require different controls.Retry storms without pacing
Naive retries convert transient path noise into self-inflicted burst impact.
Bottom line
BGP/anycast path churn is not just “infra noise.”
In execution systems it behaves like a distinct slippage regime with:
- stale-arrival risk,
- retry/catch-up convexity,
- and deadline-cost amplification.
Model it explicitly, gate policy by route state, and enforce conservative recovery transitions. That is how you stop network path chaos from quietly becoming basis-point leakage.
References
RFC 4271 — A Border Gateway Protocol 4 (BGP-4)
https://www.rfc-editor.org/rfc/rfc4271RFC 8326 — Graceful BGP Session Shutdown
https://www.rfc-editor.org/rfc/rfc8326RIPE Labs — BGP Routing Churn and Stability Analyses
https://labs.ripe.net/Cloudflare — How Anycast Works (Operational Background)
https://www.cloudflare.com/learning/cdn/glossary/anycast-network/Perold, A. F. (1988), The Implementation Shortfall: Paper vs Reality
https://www.hbs.edu/faculty/Pages/item.aspx?num=2083