ECN CE-Mark Burst & CWND Shock Slippage Playbook
Date: 2026-03-23
Category: research
Scope: How bursty ECN Congestion Experienced (CE) marking can create sender-side congestion-window shocks, pacing discontinuities, and measurable execution slippage
Why this matters
ECN is usually treated as “better than drops” (which is often true). But in live trading paths, CE marks arriving in bursts can still create a hidden execution tax:
- sender reacts with abrupt congestion-window reduction,
- packet pacing briefly under-fills the wire,
- child-order dispatch cadence deforms,
- catch-up bursts happen later,
- queue priority and markout worsen.
In post-trade TCA this is often misread as random venue microstructure noise, when the root cause is transport-control dynamics.
Failure mechanism (operator timeline)
- Queue pressure builds on one network segment (TOR, host qdisc, middlebox, or egress bottleneck).
- AQM/ECN marks packets with CE at elevated frequency (sometimes in concentrated clusters).
- Receiver echoes congestion via ECE; sender acknowledges with CWR behavior and reduces effective sending aggressiveness.
- Decision→wire latency stretches and packet spacing gets less uniform.
- Execution engine misses intended micro-timing slots; child orders land late or lumped.
- When congestion eases, sender ramps up again, often creating a cadence rebound burst.
- Net result: worse queue age, worse fill quality, worse post-fill markout.
Key point: lossless does not mean frictionless. CE bursts can still cause slippage convexity.
Extend slippage decomposition with ECN-shock term
[ IS = IS_{market} + IS_{impact} + IS_{timing} + IS_{fees} + \underbrace{IS_{ecn}}_{\text{CE-burst transport tax}} ]
Operational approximation:
[ IS_{ecn,t} \approx a\cdot CMR_t + b\cdot CBI_t + c\cdot CRL_t + d\cdot CCD_t + e\cdot MCE_t ]
Where:
- (CMR): CE mark rate,
- (CBI): CE burstiness index,
- (CRL): congestion-window recovery lag,
- (CCD): child-cadence discontinuity,
- (MCE): CE-conditioned markout effect.
Production metrics to add
1) CE Mark Rate (CMR)
[ CMR = \frac{#,CE\ marked\ packets}{#,ECN\ capable\ packets} ]
Compute per path/host/venue/session slice.
2) CE Burstiness Index (CBI)
[ CBI = \frac{p99(CE\ marks\ per\ RTT\ window)}{mean(CE\ marks\ per\ RTT\ window)+\epsilon} ]
High CBI indicates clustered congestion signaling (more harmful than smooth low-rate signaling).
3) Congestion Recovery Lag (CRL)
Time from first ECE/CE burst onset to restoration of pre-shock send cadence (or cwnd proxy baseline).
4) Child Cadence Discontinuity (CCD)
[ CCD = \frac{p95(\Delta t_{child})}{p50(\Delta t_{child})} ]
Measure before/within/after CE bursts.
5) CE-Conditioned Markout Effect (MCE)
Matched-cohort markout delta between CE_BURST and NO_CE_BURST windows.
6) Dispatch Underrun Ratio (DUR)
Fraction of intended dispatch slots that miss timing budget while CE bursts are active.
Modeling architecture
Stage 1: CE-regime detector
Inputs:
- CE/ECE event density and burstiness,
- RTT and RTT-variance drift,
- pacing/send-queue depth proxies,
- decision→wire latency quantiles,
- host qdisc and interface queue telemetry.
Output:
- (P(\text{CE_BURST_REGIME}))
Stage 2: conditional slippage model
Estimate expected mean and tail slippage conditioned on CE regime probability.
Useful interaction:
[ \Delta IS \sim \beta_1,urgency + \beta_2,ce + \beta_3,(urgency \times ce) ]
Urgent tactics usually pay a larger penalty under CE-burst regimes.
Controller state machine
GREEN — STABLE_ECN
- Low CMR, low CBI, stable cadence.
- Baseline policy.
YELLOW — CE_RISING
- CMR rising, CBI still moderate.
- Actions:
- increase telemetry sampling,
- reduce optional background traffic on shared path,
- tighten pacing/cadence guardrails.
ORANGE — CE_BURST_ACTIVE
- High CBI + visible cadence deformation.
- Actions:
- cap child fanout per interval,
- prefer steadier participation over bursty catch-up,
- reduce non-essential control-plane chatter on same NIC/path.
RED — TRANSPORT_CONTAINMENT
- Persistent CE bursts with slippage uplift.
- Actions:
- switch to conservative execution template,
- enforce hard tail-latency budget,
- route/schedule failover where available.
Use hysteresis + minimum dwell time to avoid control oscillation.
Engineering mitigations (high ROI first)
Measure CE explicitly, not just drops
Add packet-level CE/ECE observability to the same timeline as child-order decisions.Tune queue disciplines deliberately
Audit fq/fq_codel/codel parameters (target,interval,ce_threshold) for live execution traffic profile.Traffic-class isolation
Separate execution path from batch/replication/analytics traffic (qdisc classing, DSCP policy, host isolation).Cadence-aware execution fallback
During CE bursts, avoid aggressive catch-up bursts that worsen queue position decay.Path-specific runbooks
Maintain per-venue/per-POP CE baselines; trigger alerts on CBI excursions rather than raw averages only.Canary policy rollouts
Deploy CE-aware controls on subset of symbols/hosts first; require stable tail improvement before promotion.
Validation protocol
- Label
CE_BURSTwindows from CE burstiness thresholds. - Match cohorts by symbol, spread, volatility, urgency, participation, and venue.
- Estimate uplift in mean/q95 slippage and completion-miss risk.
- Apply mitigations (queue tuning, path isolation, cadence cap) in canary.
- Promote only if tail improvements persist without unacceptable fill-rate loss.
Practical observability checklist
- CE/ECE packet counters with time buckets
- per-flow RTT and jitter around CE bursts
- decision→wire latency quantiles split by path/host
- child dispatch interval distribution and burst ratio
- qdisc stats (
tc -s) and interface queue pressure - matched-cohort markout deltas (
CE_BURSTvs baseline)
Success criterion: tail slippage stability under congestion-signaled regimes, not just low packet-drop rates.
Pseudocode sketch
features = collect_ecn_features() # CMR, CBI, CRL, CCD, DUR
p_ce = ce_burst_detector.predict_proba(features)
state = decode_transport_state(p_ce, features)
if state == "GREEN":
params = baseline_policy()
elif state == "YELLOW":
params = guarded_policy()
elif state == "ORANGE":
params = cadence_capped_policy()
else: # RED
params = containment_policy()
execute_with(params)
log(state=state, p_ce=p_ce)
Bottom line
ECN is valuable, but bursty CE signaling can still damage execution quality through timing-channel distortion. If your slippage model ignores transport-control regimes, you’ll keep attributing predictable infrastructure tax to “market randomness.”
References
- RFC 3168 — The Addition of Explicit Congestion Notification (ECN) to IP:
https://datatracker.ietf.org/doc/html/rfc3168 - Linux IP sysctl networking reference (
tcp_ecnand related controls):
https://docs.kernel.org/networking/ip-sysctl.html tc-fq_codel(8)manual (ECN/ce_threshold behavior):
https://man7.org/linux/man-pages/man8/tc-fq_codel.8.htmltc-codel(8)manual:
https://man7.org/linux/man-pages/man8/tc-codel.8.htmltc-fq(8)manual:
https://man7.org/linux/man-pages/man8/tc-fq.8.html