BBR ProbeRTT Drain/Recovery Burst Slippage Playbook

2026-03-26 · finance

BBR ProbeRTT Drain/Recovery Burst Slippage Playbook

Date: 2026-03-26
Category: research
Audience: quant execution operators running low-latency gateways over mixed WAN paths


Why this note

A hidden execution failure mode in production stacks that depend on TCP transport quality:

  1. Session is stable under BBR pacing.
  2. Congestion-control state enters a ProbeRTT/low-inflight refresh phase.
  3. Control-plane/data-plane delivery cadence briefly thins (or becomes lumpy).
  4. Router sees stale/uneven acknowledgments, then over-corrects with catch-up bursts.
  5. Child-order timing dephases from liquidity and tail slippage jumps.

Most slippage models treat transport as a stationary latency distribution. BBR-style model-based control introduces stateful cadence regimes. This playbook models that regime explicitly.


1) Cost decomposition with transport-state penalty

For residual inventory (Q_t):

[ \mathbb{E}[C_t] = \mathbb{E}[C_{base}\mid x_t] + \mathbb{E}[C_{impact}\mid a_t] + \underbrace{P(R_t=1\mid s_t)\cdot \mathbb{E}[C_{burst}\mid R_t]}_{\text{ProbeRTT/recovery burst tax}} ]

Where:

Operator intuition: mean latency can look fine while cadence distortion still destroys queue quality.


2) Data contract (point-in-time only)

A) Transport state telemetry

B) Execution-path timing

C) Market context

If transport-state freshness is missing, hard-cap participation and disable aggressive catch-up logic.


3) Modeling stack

A) Regime classifier: (P(R_t=1\mid s_t))

Train a calibrated classifier to detect drain/recovery-risk windows from:

Use reliability calibration (isotonic/Platt) because control decisions consume probabilities directly.

B) Recovery-horizon survival model

Estimate time until cadence re-normalization:

[ T_{recover} \sim h(t\mid z_t) ]

Key outputs:

This prevents waiting too long in degraded transport states when deadline convexity is rising.

C) Burst-cost quantile model

Conditioned on (R_t), estimate q50/q90/q95 slippage for:

Quantile-first modeling is mandatory; burst damage is tail-dominant.


4) Control policy (state machine)

  1. FLOW_STABLE — normal schedule
  2. FLOW_DRAIN_RISK — transport fragility detected, reduce aggressiveness
  3. FLOW_RECOVERY_LIMITED — bounded catch-up with hard burst caps
  4. SAFE_EXECUTION — deadline-protect mode, prioritize completion reliability

Action score:

[ Score(a)=\mathbb{E}[C\mid a] + \lambda_r P(R_t=1)\mathbb{E}[C_{burst}] + \lambda_d P(unfinished\ at\ horizon) ]

Hard guardrails:


5) Production KPIs

Alerting rule: if TBE exceeds threshold in 2+ recent sessions, auto-tighten burst caps one regime notch.


6) Validation ladder

  1. Historical replay with strict point-in-time transport features
  2. Regime-sliced backtests (calm vs high-volatility vs event windows)
  3. Shadow deployment (score + intended action logs only)
  4. Canary capital with automatic rollback triggers

Anti-pattern: deriving regime labels from future ACK stabilization (look-ahead leakage).


7) Two-week implementation plan

Days 1-3
Instrument transport-state snapshots and ACK-gap telemetry; define regime labels.

Days 4-6
Train/calibrate regime classifier + recovery survival model.

Days 7-9
Build burst-cost quantile model and compare fallback policies in replay.

Days 10-11
Implement FLOW_STABLE -> SAFE_EXECUTION state machine and hard guardrails.

Days 12-13
Run shadow mode; evaluate calibration, burst severity, and tail budget metrics.

Day 14
Canary rollout with strict rollback on q95 exceedance and completion-risk drift.


Common mistakes

  1. Monitoring only average RTT
    Cadence instability (ACK timing shape) is often more damaging than mean RTT moves.

  2. Unbounded catch-up after transient transport stalls
    Recovery bursts can create self-inflicted impact spikes.

  3. Ignoring deadline convexity in degraded transport states
    Waiting for perfect recovery can force toxic late urgency.

  4. No explicit transport-aware execution regimes
    Without state transitions, fallback behavior becomes ad-hoc and non-auditable.


Bottom line

BBR-style congestion control is not just a network detail; it can become a first-class slippage state variable.

Modeling drain/recovery regimes explicitly gives:


References