IOMMU TLB Flush-Storm DMA-Remap Slippage Playbook
Why this exists
Execution stacks can pass ordinary CPU/network health checks and still leak p95/p99 implementation shortfall.
One under-modeled source is IOMMU translation pressure: NIC DMA mappings/unmappings trigger IOTLB invalidations, and under bursty traffic or allocator churn this can become a flush storm.
When that happens, packet/descriptor handling latency becomes uneven, and order-path timing starts paying an invisible tax.
Core failure mode
Under high packet turnover and frequent DMA map updates:
- IOTLB invalidations spike,
- DMA completion latency becomes bursty,
- RX/TX service cadence loses smoothness,
- market-data ingest and order-send timing dephase,
- cancel/replace loops bunch,
- queue quality decays,
- late-cycle urgency raises convex crossing cost.
Result: tail slippage rises even if average RTT/CPU still looks "normal."
Slippage decomposition with IOMMU term
For parent order (i):
[ IS_i = C_{delay} + C_{impact} + C_{miss} + C_{iommu} ]
Where:
[ C_{iommu} = C_{dma-jitter} + C_{service-burst} + C_{queue-decay} ]
- DMA jitter cost: variable NIC DMA completion/descriptor service timing
- Service burst cost: uneven packet processing cadence (microbursting at app layer)
- Queue decay cost: stale reaction windows and reset-heavy retries
Feature set (production-ready)
1) Host / DMA-path features
- IOTLB flush rate and burst quantiles
- DMA map/unmap rate by queue
- NIC ring occupancy oscillation amplitude
- RX/TX NAPI poll duration variance
- per-NUMA memory locality for NIC buffers
2) Execution timing features
- market-data ingress gap variance
- decision-to-send latency quantiles (p50/p95/p99)
- cancel-to-ack and replace-to-ack drift
- child-order inter-dispatch burst index
- scheduler phase error vs intended dispatch grid
3) Outcome features
- passive fill ratio by flush-pressure bucket
- short-horizon markout ladder (10ms / 100ms / 1s / 5s)
- completion deficit under matched liquidity regime
- branch labels:
map-stable,pressure-watch,flush-storm,deadline-chase
Model architecture
Use baseline + remap-overlay design:
- Baseline slippage model
- spread/impact/fill/deadline stack
- IOMMU pressure overlay
- predicts incremental uplift:
delta_is_meandelta_is_q95
- predicts incremental uplift:
Final estimate:
[ \hat{IS}{final} = \hat{IS}{baseline} + \Delta\hat{IS}_{iommu} ]
Train with matched market windows (symbol/session/volatility/liquidity bucket) across different remap-pressure states to isolate infra effects from market confounders.
Regime controller
State A: MAP_STABLE
- low flush pressure, stable DMA cadence
- normal execution policy
State B: PRESSURE_WATCH
- flush bursts rising, timing tails widening
- reduce unnecessary replace churn, smooth pacing
State C: FLUSH_STORM
- sustained invalidation bursts + packet-service oscillation
- cap burst size, increase minimum spacing, avoid fragile queue races
State D: SAFE_DMA_CONTAIN
- repeated storm + deadline pressure
- route urgency-sensitive flow through validated low-pressure host/queue paths, conservative completion policy
Use hysteresis + minimum dwell time to prevent policy flapping.
Desk metrics
- DFI (DMA Flush Intensity): invalidation pressure score
- DJS (DMA Jitter Spread): completion-time variability severity
- PSO (Packet Service Oscillation): ingest/send cadence instability
- QDL (Queue Decay Loss): passive quality degradation under remap pressure
- IUL (IOMMU Uplift Loss): realized IS minus baseline IS in high-pressure regimes
Track by host pool, NIC model/driver, NUMA placement, symbol-liquidity bucket, and session segment.
Mitigation ladder
- Mapping churn reduction
- prefer stable DMA mapping strategies and buffer lifecycle discipline
- NUMA and queue locality hygiene
- align NIC queues, CPU affinity, and memory locality
- Burst-containment execution policy
- bounded catch-up pacing over panic flushes
- Topology-aware routing
- route urgent flow away from hosts/queues with rising DFI/PSO
- Change-aware recalibration
- re-fit overlay after kernel/NIC-driver/IOMMU config updates
Failure drills (must run)
- Flush-burst replay drill
- verify early transition to
PRESSURE_WATCH
- verify early transition to
- Storm containment drill
- confirm bounded recovery beats panic catch-up on q95 IS
- Confounder separation drill
- distinguish remap-pressure effects from pure venue/network latency shocks
- Fallback path drill
- validate safe reroute to low-pressure host/queue pools under stress
Anti-patterns
- Treating average RTT as complete timing truth
- Ignoring DMA map/unmap churn in low-latency hosts
- Disabling IOMMU blindly without security/compliance review
- Running retry-heavy execution logic that amplifies cadence oscillation
Bottom line
IOMMU is often viewed as a security/performance toggle, but in execution systems the real issue is translation-pressure dynamics.
If IOTLB flush storms are not modeled as a slippage factor, tail cost will keep leaking through “normal-looking” infra dashboards.