BGP Convergence & Anycast Path-Flap RTT-Bimodality Slippage Playbook

Date: 2026-03-25
Category: research
Audience: execution operators running multi-region / anycast-connected trading infrastructure

Why this note

Many execution stacks assume network latency is a single noisy distribution around a stable mean.

During inter-domain routing churn (BGP withdrawal/re-announcement, maintenance misstep, upstream instability, anycast path shift), latency often becomes bimodal rather than merely “higher.”

That regime change can create hidden slippage through:

stale-signal admissions (alpha half-life violated before order reaches venue),
bursty catch-up behavior (controller reacts late, then over-trades),
false router attribution (venue blamed when transport path changed),
and completion-risk convexity near deadlines.

The key mistake: treating route instability as generic latency noise instead of a separate execution state.

1) Cost decomposition with route-instability term

For child order decision at time (t):

[ \mathbb{E}[C_t] = \mathbb{E}[C_{base} \mid a_t, s_t] + \lambda_r \cdot \mathbb{E}[C_{route} \mid z_t] ]

Where:

(C_{base}): spread + impact + fees + residual opportunity cost
(z_t): route-integrity state (stable / shifting / unstable)
(C_{route}): excess cost from path churn

Practical branch model:

[ C_{route} \approx p_{stale}L_{stale} + p_{retry}L_{retry} + p_{deadline}L_{deadline} ]

(p_{stale}): probability the decision arrives after signal validity window
(p_{retry}): probability of reject/timeout/retry loop under transient path degradation
(p_{deadline}): probability of forced urgency escalation near schedule end

2) Observability contract (point-in-time)

A) Network path telemetry

per-destination RTT distribution (p50/p90/p99, not just mean)
jitter and packet reordering indicators
TCP retransmissions / QUIC loss-recovery counts
path identity hints (upstream ASN, next-hop POP/region when available)
route-change event markers (BGP update timestamps from collectors or providers)

B) Execution timeline telemetry

decision timestamp
send timestamp
ACK/reject timestamp
fill timestamp
feature-age-at-decision and estimated feature-age-at-arrival

C) Session-level context

open/close/auction/news windows
strategy urgency bucket (low / medium / high)
venue-specific timeout/retry behavior

Without path-event markers, the model confuses structural regime shifts with random latency noise.

3) Modeling stack

A) Route state classifier

Define discrete states:

PATH_STABLE — unimodal latency, low update churn
PATH_SHIFTING — emerging second mode, rising jitter/retransmits
PATH_UNSTABLE — persistent bimodality or rapid path flips
PATH_RECOVERING — variance decays but tail still elevated

Implementation options:

rules + hysteresis (fast to deploy), or
HMM/state-space filter using (RTT_{p50,p95,p99}), retransmits, path tags, update counts.

B) Bimodality detector

Use simple online indicators:

Hartigan-style dip proxy (or Gaussian mixture BIC delta),
tail-ratio jump ((p99/p50)),
short-window mode-split score.

Trigger if bimodality persists over minimum dwell time (avoid false positives from microbursts).

C) Stale-arrival model

Estimate:

[ P(\text{arrival age} > \tau_{alpha} \mid z_t, venue, urgency) ]

This probability should directly influence passive-vs-aggressive decisioning.

D) Deadline convexity model

Model expected incremental cost of waiting near schedule end:

[ \Delta C_{deadline}(u) = C_{forced}(u) - C_{smooth}(u) ]

where (u) is residual urgency budget. Under PATH_UNSTABLE, (\Delta C_{deadline}) usually steepens.

4) Live policy controller (state-aware)

GREEN — PATH_STABLE

Normal routing/sizing policy.

AMBER — PATH_SHIFTING

tighten signal-age admission,
reduce child slice size,
increase retry spacing (avoid self-induced bursts),
slightly favor venues with lower tail-latency sensitivity.

RED — PATH_UNSTABLE

disable fragile alpha branches with short half-life,
cap participation and prevent panic catch-up bursts,
prefer robust completion schedule over opportunistic micro-alpha,
activate stricter kill/containment limits for timeout cascades.

RECOVERY — PATH_RECOVERING

keep conservative caps until dwell-based stability is confirmed,
gradual ramp-up, not instant GREEN re-entry.

No direct RED→GREEN jumps; require recovery dwell and KPI normalization.

5) Production KPIs

RBS (Route Bimodality Score): online bimodality intensity metric
PSR (Path Shift Rate): route/path identity changes per minute
SAF (Stale Arrival Fraction): arrivals beyond alpha-age budget
RLC (Route-Latency Cost): incremental slippage vs PATH_STABLE baseline
DCS (Deadline Convexity Slope): urgency cost steepness under current state

Useful alerts:

RBS↑ + SAF↑ for N minutes ⇒ escalate to RED
PSR spike during auction/open windows ⇒ restrict opportunistic policies
RLC above weekly p95 while base spread unchanged ⇒ likely transport-driven cost leak

6) Validation ladder

Historical relabeling
Join execution logs with path/route event markers; compare state-conditional slippage.
Counterfactual replay
Re-run decisions with AMBER/RED controls and estimate avoided tail cost.
Fault injection drills
Inject synthetic latency bimodality/retransmit bursts in staging.
Shadow then canary
Shadow state classifier first, then enable controls on small flow slice.

Primary success criterion: reduced q95/q99 slippage with bounded completion-risk degradation.

7) 14-day implementation plan

Days 1-2
Instrument path-event + execution-age fields end-to-end.

Days 3-4
Implement RBS/PSR/SAF calculators and dashboards.

Days 5-7
Build initial state classifier with hysteresis.

Days 8-9
Train stale-arrival and deadline-convexity components.

Days 10-11
Integrate AMBER/RED policy guards in shadow mode.

Days 12-13
Canary enablement with strict rollback gates.

Day 14
Finalize incident runbook and postmortem template.

Common mistakes

Using mean RTT as primary health signal
Mean hides the second mode that actually drives tail slippage.
Immediate aggression ramp on partial recovery
Recovery without dwell causes repeated oscillation and cost spikes.
No separation between market toxicity and path toxicity
Different causes require different controls.
Retry storms without pacing
Naive retries convert transient path noise into self-inflicted burst impact.

Bottom line

BGP/anycast path churn is not just “infra noise.”

In execution systems it behaves like a distinct slippage regime with:

stale-arrival risk,
retry/catch-up convexity,
and deadline-cost amplification.

Model it explicitly, gate policy by route state, and enforce conservative recovery transitions. That is how you stop network path chaos from quietly becoming basis-point leakage.

References

RFC 4271 — A Border Gateway Protocol 4 (BGP-4)
https://www.rfc-editor.org/rfc/rfc4271
RFC 8326 — Graceful BGP Session Shutdown
https://www.rfc-editor.org/rfc/rfc8326
RIPE Labs — BGP Routing Churn and Stability Analyses
https://labs.ripe.net/
Cloudflare — How Anycast Works (Operational Background)
https://www.cloudflare.com/learning/cdn/glossary/anycast-network/
Perold, A. F. (1988), The Implementation Shortfall: Paper vs Reality
https://www.hbs.edu/faculty/Pages/item.aspx?num=2083