BBR ProbeRTT Drain/Recovery Burst Slippage Playbook

Date: 2026-03-26
Category: research
Audience: quant execution operators running low-latency gateways over mixed WAN paths

Why this note

A hidden execution failure mode in production stacks that depend on TCP transport quality:

Session is stable under BBR pacing.
Congestion-control state enters a ProbeRTT/low-inflight refresh phase.
Control-plane/data-plane delivery cadence briefly thins (or becomes lumpy).
Router sees stale/uneven acknowledgments, then over-corrects with catch-up bursts.
Child-order timing dephases from liquidity and tail slippage jumps.

Most slippage models treat transport as a stationary latency distribution. BBR-style model-based control introduces stateful cadence regimes. This playbook models that regime explicitly.

1) Cost decomposition with transport-state penalty

For residual inventory (Q_t):

[ \mathbb{E}[C_t] = \mathbb{E}[C_{base}\mid x_t] + \mathbb{E}[C_{impact}\mid a_t] + \underbrace{P(R_t=1\mid s_t)\cdot \mathbb{E}[C_{burst}\mid R_t]}_{\text{ProbeRTT/recovery burst tax}} ]

Where:

(R_t=1): transport in drain/recovery-risk regime
(C_{burst}): extra slippage from delayed-then-bunched child scheduling
(s_t): transport + microstructure + urgency state

Operator intuition: mean latency can look fine while cadence distortion still destroys queue quality.

2) Data contract (point-in-time only)

A) Transport state telemetry

Per-socket congestion algorithm and state snapshots (ss -ti, tcp_info where available)
RTT(min/smoothed), delivery-rate samples, retrans/reorder indicators
Inflight/cwnd evolution and pacing-rate shifts
ACK inter-arrival jitter (short rolling windows)

B) Execution-path timing

Decision -> send timestamp
Send -> venue ACK timestamp
ACK -> next-child dispatch gap
Retry bursts and fallback-switch timestamps

C) Market context

Touch depth resiliency and spread state
Microprice drift during transport disturbances
Session phase (open/close/news windows)

If transport-state freshness is missing, hard-cap participation and disable aggressive catch-up logic.

3) Modeling stack

A) Regime classifier: (P(R_t=1\mid s_t))

Train a calibrated classifier to detect drain/recovery-risk windows from:

inflight contraction signatures
ACK-gap instability
pacing-rate oscillation
dispatch-gap expansion

Use reliability calibration (isotonic/Platt) because control decisions consume probabilities directly.

B) Recovery-horizon survival model

Estimate time until cadence re-normalization:

[ T_{recover} \sim h(t\mid z_t) ]

Key outputs:

expected recovery time
(P(T_{recover} > H)) for remaining execution horizon (H)

This prevents waiting too long in degraded transport states when deadline convexity is rising.

C) Burst-cost quantile model

Conditioned on (R_t), estimate q50/q90/q95 slippage for:

keep schedule
bounded catch-up
immediate safe fallback

Quantile-first modeling is mandatory; burst damage is tail-dominant.

4) Control policy (state machine)

FLOW_STABLE — normal schedule
FLOW_DRAIN_RISK — transport fragility detected, reduce aggressiveness
FLOW_RECOVERY_LIMITED — bounded catch-up with hard burst caps
SAFE_EXECUTION — deadline-protect mode, prioritize completion reliability

Action score:

[ Score(a)=\mathbb{E}[C\mid a] + \lambda_r P(R_t=1)\mathbb{E}[C_{burst}] + \lambda_d P(unfinished\ at\ horizon) ]

Hard guardrails:

max child-size multiplier during recovery
max dispatch acceleration per time bucket
q95 slippage budget breaker
automatic rollback to SAFE_EXECUTION on repeated tail breaches

5) Production KPIs

PRH (Probe/Recovery Hazard calibration): predicted vs realized risk reliability
DG95 (Dispatch-Gap q95): q95 child dispatch gap during stressed windows
RBS (Recovery Burst Severity): peak catch-up burst / baseline child cadence
TCD (Transport Cadence Drift): normalized ACK-gap variance shift
TBE (Tail Budget Exceedance): realized slippage beyond predicted q95 budget

Alerting rule: if TBE exceeds threshold in 2+ recent sessions, auto-tighten burst caps one regime notch.

6) Validation ladder

Historical replay with strict point-in-time transport features
Regime-sliced backtests (calm vs high-volatility vs event windows)
Shadow deployment (score + intended action logs only)
Canary capital with automatic rollback triggers

Anti-pattern: deriving regime labels from future ACK stabilization (look-ahead leakage).

7) Two-week implementation plan

Days 1-3
Instrument transport-state snapshots and ACK-gap telemetry; define regime labels.

Days 4-6
Train/calibrate regime classifier + recovery survival model.

Days 7-9
Build burst-cost quantile model and compare fallback policies in replay.

Days 10-11
Implement FLOW_STABLE -> SAFE_EXECUTION state machine and hard guardrails.

Days 12-13
Run shadow mode; evaluate calibration, burst severity, and tail budget metrics.

Day 14
Canary rollout with strict rollback on q95 exceedance and completion-risk drift.

Common mistakes

Monitoring only average RTT
Cadence instability (ACK timing shape) is often more damaging than mean RTT moves.
Unbounded catch-up after transient transport stalls
Recovery bursts can create self-inflicted impact spikes.
Ignoring deadline convexity in degraded transport states
Waiting for perfect recovery can force toxic late urgency.
No explicit transport-aware execution regimes
Without state transitions, fallback behavior becomes ad-hoc and non-auditable.

Bottom line

BBR-style congestion control is not just a network detail; it can become a first-class slippage state variable.

Modeling drain/recovery regimes explicitly gives:

earlier detection of cadence risk,
bounded recovery behavior,
auditable tail-risk controls.

References

Cardwell et al., BBR Congestion Control (IETF Internet-Draft)
https://datatracker.ietf.org/doc/html/draft-cardwell-iccrg-bbr-congestion-control
Cardwell et al., BBR: Congestion-Based Congestion Control (ACM Queue)
https://queue.acm.org/detail.cfm?id=3022184
Linux kernel tcp_bbr.c (implementation notes and state behavior)
https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_bbr.c
Perold, A. F. (1988), The Implementation Shortfall: Paper versus Reality
https://www.hbs.edu/faculty/Pages/item.aspx?num=2083
Almgren, R., Chriss, N. (2000), Optimal Execution of Portfolio Transactions
https://www.smallake.kr/wp-content/uploads/2016/03/optliq.pdf