BBR ProbeRTT Drain/Recovery Burst Slippage Playbook
Date: 2026-03-26
Category: research
Audience: quant execution operators running low-latency gateways over mixed WAN paths
Why this note
A hidden execution failure mode in production stacks that depend on TCP transport quality:
- Session is stable under BBR pacing.
- Congestion-control state enters a ProbeRTT/low-inflight refresh phase.
- Control-plane/data-plane delivery cadence briefly thins (or becomes lumpy).
- Router sees stale/uneven acknowledgments, then over-corrects with catch-up bursts.
- Child-order timing dephases from liquidity and tail slippage jumps.
Most slippage models treat transport as a stationary latency distribution. BBR-style model-based control introduces stateful cadence regimes. This playbook models that regime explicitly.
1) Cost decomposition with transport-state penalty
For residual inventory (Q_t):
[ \mathbb{E}[C_t] = \mathbb{E}[C_{base}\mid x_t] + \mathbb{E}[C_{impact}\mid a_t] + \underbrace{P(R_t=1\mid s_t)\cdot \mathbb{E}[C_{burst}\mid R_t]}_{\text{ProbeRTT/recovery burst tax}} ]
Where:
- (R_t=1): transport in drain/recovery-risk regime
- (C_{burst}): extra slippage from delayed-then-bunched child scheduling
- (s_t): transport + microstructure + urgency state
Operator intuition: mean latency can look fine while cadence distortion still destroys queue quality.
2) Data contract (point-in-time only)
A) Transport state telemetry
- Per-socket congestion algorithm and state snapshots (
ss -ti,tcp_infowhere available) - RTT(min/smoothed), delivery-rate samples, retrans/reorder indicators
- Inflight/cwnd evolution and pacing-rate shifts
- ACK inter-arrival jitter (short rolling windows)
B) Execution-path timing
- Decision -> send timestamp
- Send -> venue ACK timestamp
- ACK -> next-child dispatch gap
- Retry bursts and fallback-switch timestamps
C) Market context
- Touch depth resiliency and spread state
- Microprice drift during transport disturbances
- Session phase (open/close/news windows)
If transport-state freshness is missing, hard-cap participation and disable aggressive catch-up logic.
3) Modeling stack
A) Regime classifier: (P(R_t=1\mid s_t))
Train a calibrated classifier to detect drain/recovery-risk windows from:
- inflight contraction signatures
- ACK-gap instability
- pacing-rate oscillation
- dispatch-gap expansion
Use reliability calibration (isotonic/Platt) because control decisions consume probabilities directly.
B) Recovery-horizon survival model
Estimate time until cadence re-normalization:
[ T_{recover} \sim h(t\mid z_t) ]
Key outputs:
- expected recovery time
- (P(T_{recover} > H)) for remaining execution horizon (H)
This prevents waiting too long in degraded transport states when deadline convexity is rising.
C) Burst-cost quantile model
Conditioned on (R_t), estimate q50/q90/q95 slippage for:
- keep schedule
- bounded catch-up
- immediate safe fallback
Quantile-first modeling is mandatory; burst damage is tail-dominant.
4) Control policy (state machine)
- FLOW_STABLE — normal schedule
- FLOW_DRAIN_RISK — transport fragility detected, reduce aggressiveness
- FLOW_RECOVERY_LIMITED — bounded catch-up with hard burst caps
- SAFE_EXECUTION — deadline-protect mode, prioritize completion reliability
Action score:
[ Score(a)=\mathbb{E}[C\mid a] + \lambda_r P(R_t=1)\mathbb{E}[C_{burst}] + \lambda_d P(unfinished\ at\ horizon) ]
Hard guardrails:
- max child-size multiplier during recovery
- max dispatch acceleration per time bucket
- q95 slippage budget breaker
- automatic rollback to SAFE_EXECUTION on repeated tail breaches
5) Production KPIs
- PRH (Probe/Recovery Hazard calibration): predicted vs realized risk reliability
- DG95 (Dispatch-Gap q95): q95 child dispatch gap during stressed windows
- RBS (Recovery Burst Severity): peak catch-up burst / baseline child cadence
- TCD (Transport Cadence Drift): normalized ACK-gap variance shift
- TBE (Tail Budget Exceedance): realized slippage beyond predicted q95 budget
Alerting rule: if TBE exceeds threshold in 2+ recent sessions, auto-tighten burst caps one regime notch.
6) Validation ladder
- Historical replay with strict point-in-time transport features
- Regime-sliced backtests (calm vs high-volatility vs event windows)
- Shadow deployment (score + intended action logs only)
- Canary capital with automatic rollback triggers
Anti-pattern: deriving regime labels from future ACK stabilization (look-ahead leakage).
7) Two-week implementation plan
Days 1-3
Instrument transport-state snapshots and ACK-gap telemetry; define regime labels.
Days 4-6
Train/calibrate regime classifier + recovery survival model.
Days 7-9
Build burst-cost quantile model and compare fallback policies in replay.
Days 10-11
Implement FLOW_STABLE -> SAFE_EXECUTION state machine and hard guardrails.
Days 12-13
Run shadow mode; evaluate calibration, burst severity, and tail budget metrics.
Day 14
Canary rollout with strict rollback on q95 exceedance and completion-risk drift.
Common mistakes
Monitoring only average RTT
Cadence instability (ACK timing shape) is often more damaging than mean RTT moves.Unbounded catch-up after transient transport stalls
Recovery bursts can create self-inflicted impact spikes.Ignoring deadline convexity in degraded transport states
Waiting for perfect recovery can force toxic late urgency.No explicit transport-aware execution regimes
Without state transitions, fallback behavior becomes ad-hoc and non-auditable.
Bottom line
BBR-style congestion control is not just a network detail; it can become a first-class slippage state variable.
Modeling drain/recovery regimes explicitly gives:
- earlier detection of cadence risk,
- bounded recovery behavior,
- auditable tail-risk controls.
References
Cardwell et al., BBR Congestion Control (IETF Internet-Draft)
https://datatracker.ietf.org/doc/html/draft-cardwell-iccrg-bbr-congestion-controlCardwell et al., BBR: Congestion-Based Congestion Control (ACM Queue)
https://queue.acm.org/detail.cfm?id=3022184Linux kernel
tcp_bbr.c(implementation notes and state behavior)
https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_bbr.cPerold, A. F. (1988), The Implementation Shortfall: Paper versus Reality
https://www.hbs.edu/faculty/Pages/item.aspx?num=2083Almgren, R., Chriss, N. (2000), Optimal Execution of Portfolio Transactions
https://www.smallake.kr/wp-content/uploads/2016/03/optliq.pdf