QUIC PTO Probe-Burst Recovery Slippage Playbook

Date: 2026-03-23
Category: research
Scope: How QUIC PTO/loss-recovery dynamics create decision→wire cadence distortion and tail slippage in low-latency execution paths

Why this matters

Many execution teams moved market gateways from TCP/TLS stacks to QUIC for better handshake behavior, cleaner user-space control, and migration resilience.

But QUIC introduces a distinct failure mode under stress:

ACK delay inflation or ACK sparsity,
PTO (Probe Timeout) firing,
probe packets + retransmission attempts clustering,
temporary under-send followed by catch-up bursts,
urgency escalation near schedule deadlines.

Result: slippage tails widen even when median latency looks fine.

Failure mechanism (operator timeline)

Child-order stream is paced smoothly under normal RTT/ACK cadence.
Path jitter, queueing, or receiver ACK policy shifts observed ACK timing.
Sender’s loss detector infers delayed/lost progress; PTO fires.
Sender emits probe/recovery traffic to re-establish ACK clock.
Strategy side sees temporary delivery stall, then clustered completions.
Schedule deficit accumulates; controller increases aggression to “catch up.”
Re-entry bursts hit thinner queue states and pay convex impact + queue-reset tax.

Key point: this is not pure market microstructure drift; it is transport-recovery coupling leaking into execution control.

Extend slippage decomposition with QUIC recovery term

[ IS = IS_{market} + IS_{impact} + IS_{timing} + IS_{fees} + \underbrace{IS_{quic}}_{\text{PTO/recovery cadence tax}} ]

Practical approximation:

[ IS_{quic,t} \approx a\cdot PTOF_t + b\cdot PRB_t + c\cdot ADR_t + d\cdot RBR_t + e\cdot DPE_t ]

Where:

(PTOF): PTO fire rate,
(PRB): probe/recovery burst intensity,
(ADR): ACK-delay distortion ratio,
(RBR): recovery burst rebound,
(DPE): dispatch phase error.

Production metrics to add

1) PTO Fire Rate (PTOF)

[ PTOF = \frac{#,\text{PTO events}}{\text{minute}} ]

Track by venue/session class and symbol bucket.

2) Probe-Recovery Burst Intensity (PRB)

[ PRB = p95\left(\text{recovery packets in 50ms bins}\right) ]

Higher PRB indicates clustered loss-recovery pressure.

3) ACK Delay Distortion Ratio (ADR)

[ ADR = \frac{p95(ack_delay)}{p50(ack_delay)+\epsilon} ]

Captures when ACK timing becomes regime-shifted rather than noisy.

4) Recovery Burst Rebound (RBR)

[ RBR = \frac{p95(\text{childs/sec over 100ms})}{median(\text{childs/sec})+\epsilon} ]

Measures under-send then catch-up behavior from recovery episodes.

5) Dispatch Phase Error (DPE)

[ DPE = p95\left(|t_{actual_child} - t_{target_child}|\right) ]

Most direct bridge from transport recovery to execution damage.

6) PTO-Near-Deadline Pressure (PND)

[ PND = P(\text{PTO event within } \Delta \text{ of schedule cutoff}) ]

This interaction often drives worst-tail slippage.

Modeling architecture

Stage 1: QUIC recovery-regime detector

Features:

PTO count/time since last PTO,
smoothed RTT vs RTT variance,
ack_delay and ACK inter-arrival variance,
loss/retransmission markers from QUIC stats,
PRB/RBR/DPE short-horizon trends.

Output:

(P(\text{RECOVERY_REGIME}))

Stage 2: conditional slippage uplift model

Estimate uplift in mean and q95/q99 slippage conditioned on urgency and recovery probability.

Useful interaction:

[ \Delta IS \sim \beta_1,urgency + \beta_2,p_{recovery} + \beta_3,(urgency\times p_{recovery}) ]

Urgent schedules are usually most fragile when PTO/recovery is active.

Controller state machine

GREEN — CLOCK_STABLE

Low PTOF/PRB, stable ADR/DPE.
Baseline schedule and routing.

YELLOW — ACK_DRIFT

ADR and ACK-spacing variance rising.
Actions:
- raise telemetry sampling,
- reduce optional burst fanout,
- tighten DPE alerts.

ORANGE — RECOVERY_ACTIVE

PTOF and PRB elevated; RBR visible.
Actions:
- cap deficit catch-up slope,
- shift toward smoother participation template,
- reduce discretionary venue hopping.

RED — TAIL_CONTAINMENT

Sustained recovery regime + tail budget breach.
Actions:
- hard-limit urgency escalation,
- switch to conservative completion policy,
- fail over to known-stable path/profile when available.

Use hysteresis and minimum dwell times to avoid oscillation.

Engineering mitigations (high ROI first)

Warm persistent QUIC sessions for critical routes
Avoid cold-path recovery risk during active slices.
Tune ACK/RTT assumptions from observed venue-path reality
Incorrect ACK-delay expectations are PTO factories.
Separate control and bulk channels
Keep critical order-control loops isolated from noisy traffic.
Pacing-aware recovery guardrails in execution layer
Do not let schedule deficit trigger uncontrolled burst rebound.
Tail-first canary gates
Promote only when q95/q99 slippage improves without completion fallout.
Integrate QUIC transport counters into TCA
Without PTO/ACK telemetry, attribution over-blames market conditions.

Validation protocol

Label windows with PTO/recovery activity from transport telemetry.
Match cohorts by symbol, spread, volatility, urgency, and participation.
Compare mean and q95/q99 slippage in recovery vs stable windows.
Canary mitigations (ACK policy tuning, pacing guardrails, path/profile pinning).
Promote only after persistent tail improvement and stable completion reliability.

Practical observability checklist

QUIC PTO/loss/retransmission counters per session
ACK delay + inter-ACK spacing distribution
RTT level/variance and path-change indicators
decision→wire latency and target→actual child phase error
burstiness metrics (RBR/PRB) around deadline windows
matched-cohort markout deltas (recovery vs stable)

Success criterion: lower tail slippage during recovery episodes, not just prettier average latency charts.

Pseudocode sketch

q = collect_quic_features()  # PTOF, PRB, ADR, RBR, DPE, PND
p_recovery = recovery_detector.predict_proba(q)
state = decode_state(p_recovery, q)

if state == "GREEN":
    params = baseline_policy()
elif state == "YELLOW":
    params = guarded_policy()
elif state == "ORANGE":
    params = smooth_catchup_policy()
else:  # RED
    params = containment_policy()

execute_with(params)
log(state=state, p_recovery=p_recovery)

Bottom line

QUIC usually improves transport ergonomics, but under ACK/PTO stress it can create cadence distortion that directly leaks into execution costs.

If your slippage stack ignores PTO/recovery regime variables, your tail attribution is incomplete.

References

RFC 9000 — QUIC: A UDP-Based Multiplexed and Secure Transport:
https://www.rfc-editor.org/rfc/rfc9000
RFC 9002 — QUIC Loss Detection and Congestion Control:
https://www.rfc-editor.org/rfc/rfc9002
RFC 9001 — Using TLS to Secure QUIC:
https://www.rfc-editor.org/rfc/rfc9001
QUIC ACK Frequency (IETF draft):
https://datatracker.ietf.org/doc/draft-ietf-quic-ack-frequency/
quiche (Cloudflare QUIC implementation, stats/ops reference):
https://github.com/cloudflare/quiche