PCIe AER Correctable-Error Storm Slippage Playbook

Why this exists

Execution teams often explain sudden tail slippage as "market microstructure noise" when venue-side telemetry looks normal.

A recurring hidden cause is PCIe link-level instability (especially AER-correctable error storms and replay bursts) between NIC and host. Throughput can look fine while latency tails quietly degrade.

If the NIC can still push packets but DMA/completion timing becomes jittery, child-order timing quality collapses before standard dashboards turn red.

Core failure mode

When PCIe electrical integrity degrades (marginal lane quality, retimer issues, connector aging, thermals, power transients):

Correctable AER counters climb quickly,
Data-link replays increase,
NIC DMA completion latency becomes bursty,
TX/RX service cadence desynchronizes from strategy loop timing,
child intents bunch after delayed completion windows,
queue-age quality degrades and fallback aggression rises.

Result: q95/q99 implementation shortfall rises even when average wire RTT and fill rates look "acceptable."

Slippage decomposition with PCIe integrity terms

For parent order (i):

[ IS_i = C_{delay} + C_{impact} + C_{miss} + C_{pcie} ]

Where:

[ C_{pcie} = C_{replay-latency} + C_{dma-jitter} + C_{burst-recovery} ]

Replay latency: link replay events stretch completion paths
DMA jitter: host-visible send/receive readiness becomes uneven
Burst recovery: delayed processing triggers clustered child actions and queue resets

Feature set (production-ready)

1) PCIe + device integrity features

From AER / driver / system counters:

Correctable error rate per minute (device + upstream port)
replay-related link events (where exposed)
receiver error / bad DLLP/TLP deltas
corrected-error burst length (max consecutive stressed windows)
NIC PCIe link speed/width downgrade events

From host health context:

PCIe device temperature and thermal-throttle flags
power-state transitions impacting NIC latency path
IRQ service delay drift under equal traffic load

2) Execution-path timing features

strategy decision-to-send gap quantiles
send-to-NIC-completion latency (p50/p95/p99)
completion inter-arrival burstiness index
cancel/replace clustering after completion stalls

3) Microstructure outcome features

passive fill ratio by completion-latency bucket
markout ladders (10ms / 100ms / 1s / 5s)
completion deficit vs schedule/deadline
branch labels: clean, electrical-watch, replay-storm, panic-catchup

Model architecture

Use baseline + hardware-integrity overlay.

Baseline slippage model
- existing impact/fill/deadline model
PCIe integrity overlay
- predicts incremental uplift:
  - delta_is_mean
  - delta_is_q95

Final estimate:

[ \hat{IS}{final} = \hat{IS}{baseline} + \Delta\hat{IS}_{pcie} ]

Train with episode windows around AER/replay bursts and matched controls (same symbol/session/volatility bucket) so hardware-path uplift is separated from market-state effects.

Regime controller

State A: `PCIe_CLEAN`

low corrected-error flow, stable completion cadence
normal tactic policy

State B: `PCIe_WATCH`

rising corrected-error slope, early completion-tail widening
reduce replace churn, cap child burst size

State C: `REPLAY_STORM`

sustained error/replay bursts with timing distortion
avoid fragile passive queue-chasing; shift to smoother pacing template

State D: `SAFE_LINK_INTEGRITY`

repeated storm episodes + deadline risk
conservative completion mode with strict anti-bunch guardrails

Apply hysteresis + minimum dwell time to prevent controller flapping.

Desk metrics

ESI (Electrical Stress Index): weighted corrected-error pressure score
RDR (Replay Delay Ratio): stressed-window completion p95 / clean-window p95
DCJ (DMA Completion Jitter): completion inter-arrival variance score
BCI (Burst Catch-up Intensity): share of child actions executed in recovery clusters
HUL (Hardware Uplift Loss): realized IS - baseline IS during integrity-stress windows

Track by host, NIC model/firmware, venue path, and symbol-liquidity bucket.

Mitigation ladder

Correlation-first observability
- align AER/device counters with execution events on one timeline
Execution containment in watch states
- bound catch-up burst size
- add paced recovery instead of immediate backlog flush
Host/path hygiene
- isolate unstable hosts/NICs from high-urgency flows
- enforce firmware/driver baseline consistency
Operational remediation
- trigger hardware health checks (thermals, seating, retimer path, lane training)
- keep a rapid failover policy for recurring replay storms

Failure drills (must run)

Synthetic integrity-stress replay
- replay historical AER burst windows and verify early PCIe_WATCH detection
Catch-up control drill
- confirm bounded recovery pacing reduces queue-reset tax
Confounder drill
- separate PCIe-path stress from exchange/network-wide latency events
Host failover drill
- verify fast reroute to healthy hardware without tactic thrash

Anti-patterns

Treating corrected AER errors as harmless because packets still flow
Monitoring only wire RTT while ignoring DMA/completion timing quality
Flushing delayed child backlog immediately after recovery
Labeling tail slippage as "market noise" without hardware-path attribution

Bottom line

PCIe corrected-error storms can quietly tax execution quality long before obvious outages appear. In low-latency systems, "no hard failure" is not the same as "no economic damage."

Model PCIe integrity uplift explicitly and wire it into execution-state control, or hidden basis-point leakage will persist in tail windows.

PCIe AER Correctable-Error Storm Slippage Playbook

PCIe AER Correctable-Error Storm Slippage Playbook

Why this exists

Core failure mode

Slippage decomposition with PCIe integrity terms

Feature set (production-ready)

1) PCIe + device integrity features

2) Execution-path timing features

3) Microstructure outcome features

Model architecture

Regime controller

State A: PCIe_CLEAN

State B: PCIe_WATCH

State C: REPLAY_STORM

State D: SAFE_LINK_INTEGRITY

Desk metrics

Mitigation ladder

Failure drills (must run)

Anti-patterns

Bottom line

State A: `PCIe_CLEAN`

State B: `PCIe_WATCH`

State C: `REPLAY_STORM`

State D: `SAFE_LINK_INTEGRITY`