PCIe AER Correctable-Error Storm Slippage Playbook

2026-03-17 · finance

PCIe AER Correctable-Error Storm Slippage Playbook

Why this exists

Execution teams often explain sudden tail slippage as "market microstructure noise" when venue-side telemetry looks normal.

A recurring hidden cause is PCIe link-level instability (especially AER-correctable error storms and replay bursts) between NIC and host. Throughput can look fine while latency tails quietly degrade.

If the NIC can still push packets but DMA/completion timing becomes jittery, child-order timing quality collapses before standard dashboards turn red.


Core failure mode

When PCIe electrical integrity degrades (marginal lane quality, retimer issues, connector aging, thermals, power transients):

Result: q95/q99 implementation shortfall rises even when average wire RTT and fill rates look "acceptable."


Slippage decomposition with PCIe integrity terms

For parent order (i):

[ IS_i = C_{delay} + C_{impact} + C_{miss} + C_{pcie} ]

Where:

[ C_{pcie} = C_{replay-latency} + C_{dma-jitter} + C_{burst-recovery} ]


Feature set (production-ready)

1) PCIe + device integrity features

From AER / driver / system counters:

From host health context:

2) Execution-path timing features

3) Microstructure outcome features


Model architecture

Use baseline + hardware-integrity overlay.

  1. Baseline slippage model
    • existing impact/fill/deadline model
  2. PCIe integrity overlay
    • predicts incremental uplift:
      • delta_is_mean
      • delta_is_q95

Final estimate:

[ \hat{IS}{final} = \hat{IS}{baseline} + \Delta\hat{IS}_{pcie} ]

Train with episode windows around AER/replay bursts and matched controls (same symbol/session/volatility bucket) so hardware-path uplift is separated from market-state effects.


Regime controller

State A: PCIe_CLEAN

State B: PCIe_WATCH

State C: REPLAY_STORM

State D: SAFE_LINK_INTEGRITY

Apply hysteresis + minimum dwell time to prevent controller flapping.


Desk metrics

Track by host, NIC model/firmware, venue path, and symbol-liquidity bucket.


Mitigation ladder

  1. Correlation-first observability
    • align AER/device counters with execution events on one timeline
  2. Execution containment in watch states
    • bound catch-up burst size
    • add paced recovery instead of immediate backlog flush
  3. Host/path hygiene
    • isolate unstable hosts/NICs from high-urgency flows
    • enforce firmware/driver baseline consistency
  4. Operational remediation
    • trigger hardware health checks (thermals, seating, retimer path, lane training)
    • keep a rapid failover policy for recurring replay storms

Failure drills (must run)

  1. Synthetic integrity-stress replay
    • replay historical AER burst windows and verify early PCIe_WATCH detection
  2. Catch-up control drill
    • confirm bounded recovery pacing reduces queue-reset tax
  3. Confounder drill
    • separate PCIe-path stress from exchange/network-wide latency events
  4. Host failover drill
    • verify fast reroute to healthy hardware without tactic thrash

Anti-patterns


Bottom line

PCIe corrected-error storms can quietly tax execution quality long before obvious outages appear. In low-latency systems, "no hard failure" is not the same as "no economic damage."

Model PCIe integrity uplift explicitly and wire it into execution-state control, or hidden basis-point leakage will persist in tail windows.