PCIe AER Correctable-Error Storm Slippage Playbook
Why this exists
Execution teams often explain sudden tail slippage as "market microstructure noise" when venue-side telemetry looks normal.
A recurring hidden cause is PCIe link-level instability (especially AER-correctable error storms and replay bursts) between NIC and host. Throughput can look fine while latency tails quietly degrade.
If the NIC can still push packets but DMA/completion timing becomes jittery, child-order timing quality collapses before standard dashboards turn red.
Core failure mode
When PCIe electrical integrity degrades (marginal lane quality, retimer issues, connector aging, thermals, power transients):
- Correctable AER counters climb quickly,
- Data-link replays increase,
- NIC DMA completion latency becomes bursty,
- TX/RX service cadence desynchronizes from strategy loop timing,
- child intents bunch after delayed completion windows,
- queue-age quality degrades and fallback aggression rises.
Result: q95/q99 implementation shortfall rises even when average wire RTT and fill rates look "acceptable."
Slippage decomposition with PCIe integrity terms
For parent order (i):
[ IS_i = C_{delay} + C_{impact} + C_{miss} + C_{pcie} ]
Where:
[ C_{pcie} = C_{replay-latency} + C_{dma-jitter} + C_{burst-recovery} ]
- Replay latency: link replay events stretch completion paths
- DMA jitter: host-visible send/receive readiness becomes uneven
- Burst recovery: delayed processing triggers clustered child actions and queue resets
Feature set (production-ready)
1) PCIe + device integrity features
From AER / driver / system counters:
- Correctable error rate per minute (device + upstream port)
- replay-related link events (where exposed)
- receiver error / bad DLLP/TLP deltas
- corrected-error burst length (max consecutive stressed windows)
- NIC PCIe link speed/width downgrade events
From host health context:
- PCIe device temperature and thermal-throttle flags
- power-state transitions impacting NIC latency path
- IRQ service delay drift under equal traffic load
2) Execution-path timing features
- strategy decision-to-send gap quantiles
- send-to-NIC-completion latency (p50/p95/p99)
- completion inter-arrival burstiness index
- cancel/replace clustering after completion stalls
3) Microstructure outcome features
- passive fill ratio by completion-latency bucket
- markout ladders (10ms / 100ms / 1s / 5s)
- completion deficit vs schedule/deadline
- branch labels:
clean,electrical-watch,replay-storm,panic-catchup
Model architecture
Use baseline + hardware-integrity overlay.
- Baseline slippage model
- existing impact/fill/deadline model
- PCIe integrity overlay
- predicts incremental uplift:
delta_is_meandelta_is_q95
- predicts incremental uplift:
Final estimate:
[ \hat{IS}{final} = \hat{IS}{baseline} + \Delta\hat{IS}_{pcie} ]
Train with episode windows around AER/replay bursts and matched controls (same symbol/session/volatility bucket) so hardware-path uplift is separated from market-state effects.
Regime controller
State A: PCIe_CLEAN
- low corrected-error flow, stable completion cadence
- normal tactic policy
State B: PCIe_WATCH
- rising corrected-error slope, early completion-tail widening
- reduce replace churn, cap child burst size
State C: REPLAY_STORM
- sustained error/replay bursts with timing distortion
- avoid fragile passive queue-chasing; shift to smoother pacing template
State D: SAFE_LINK_INTEGRITY
- repeated storm episodes + deadline risk
- conservative completion mode with strict anti-bunch guardrails
Apply hysteresis + minimum dwell time to prevent controller flapping.
Desk metrics
- ESI (Electrical Stress Index): weighted corrected-error pressure score
- RDR (Replay Delay Ratio): stressed-window completion p95 / clean-window p95
- DCJ (DMA Completion Jitter): completion inter-arrival variance score
- BCI (Burst Catch-up Intensity): share of child actions executed in recovery clusters
- HUL (Hardware Uplift Loss): realized IS - baseline IS during integrity-stress windows
Track by host, NIC model/firmware, venue path, and symbol-liquidity bucket.
Mitigation ladder
- Correlation-first observability
- align AER/device counters with execution events on one timeline
- Execution containment in watch states
- bound catch-up burst size
- add paced recovery instead of immediate backlog flush
- Host/path hygiene
- isolate unstable hosts/NICs from high-urgency flows
- enforce firmware/driver baseline consistency
- Operational remediation
- trigger hardware health checks (thermals, seating, retimer path, lane training)
- keep a rapid failover policy for recurring replay storms
Failure drills (must run)
- Synthetic integrity-stress replay
- replay historical AER burst windows and verify early
PCIe_WATCHdetection
- replay historical AER burst windows and verify early
- Catch-up control drill
- confirm bounded recovery pacing reduces queue-reset tax
- Confounder drill
- separate PCIe-path stress from exchange/network-wide latency events
- Host failover drill
- verify fast reroute to healthy hardware without tactic thrash
Anti-patterns
- Treating corrected AER errors as harmless because packets still flow
- Monitoring only wire RTT while ignoring DMA/completion timing quality
- Flushing delayed child backlog immediately after recovery
- Labeling tail slippage as "market noise" without hardware-path attribution
Bottom line
PCIe corrected-error storms can quietly tax execution quality long before obvious outages appear. In low-latency systems, "no hard failure" is not the same as "no economic damage."
Model PCIe integrity uplift explicitly and wire it into execution-state control, or hidden basis-point leakage will persist in tail windows.