Net DIM Adaptive-Interrupt Oscillation Slippage Playbook

2026-03-22 Β· finance

Net DIM Adaptive-Interrupt Oscillation Slippage Playbook

Date: 2026-03-22
Category: research
Scope: How dynamic interrupt moderation (DIM) mode-flips in Linux NIC drivers create tail-latency bursts and execution slippage

Why this matters

Most low-latency teams tune NIC coalescing once and move on.

But when adaptive moderation (Net DIM / driver-specific DIM logic) is enabled, the NIC+driver can continuously move between interrupt profiles (e.g., low-usec/high-irq vs high-usec/low-irq). Under mixed burst regimes, this can become a flip-flop control loop instead of stable optimization.

For execution stacks, that looks like:

This is not a hard outage. It is a state-dependent latency tax that hides inside β€œnormal adaptive behavior.”


Failure mechanism (one timeline)

  1. Market-data/order-ack load alternates between microbursts and lulls.
  2. DIM classifier marks one interval as throughput-favoring (more moderation).
  3. Next interval flips to latency-favoring (less moderation).
  4. Profile keeps hopping (left/right in moderation space) before the system settles.
  5. Softirq/NAPI service-time variance increases; packet release becomes bursty.
  6. Strategy reacts to stale or phase-shifted market snapshots.
  7. Slippage tails rise even if median latency barely moves.

Pathology: oscillatory moderation creates cadence distortion.


Extend slippage decomposition with DIM oscillation tax

[ IS = IS_{market} + IS_{impact} + IS_{timing} + IS_{fees} + \underbrace{IS_{dim}}_{\text{adaptive moderation tax}} ]

Practical approximation:

[ IS_{dim,t} \approx a\cdot PFS_t + b\cdot FRS_t + c\cdot JI95_t + d\cdot SDR_t ]

Where:


What to measure (production features)

1) Profile Flip Score (PFS)

[ PFS = \frac{#(profile\ changes)}{\Delta t} ]

Use per RX/TX queue and aggregate with traffic-weighted average.

2) Flip Reversal Score (FRS)

[ FRS = \frac{#(A\to B\to A\ \text{within short window})}{\Delta t} ]

Separates healthy adaptation from unstable oscillation.

3) Moderation Span (MS)

Difference between max and min effective coalescing usec visited in a short bucket.

High MS + high PFS is usually the most toxic combination.

4) Inter-Interrupt Jitter p95/p99 (JI95/JI99)

Compute per queue from IRQ timestamp deltas. Rising tails indicate unstable pacing.

5) NAPI Cycle Skew (NCS)

Variance of packets-per-poll and time-per-poll across consecutive cycles.

6) Dispatch Clump Factor (DCF)

[ DCF = \frac{\text{p95 child-order inter-send gap}}{\text{p50 child-order inter-send gap}} ]

Inflates when dispatch becomes lumpy.

7) Decision-to-Wire Tail (DWT95/DWT99)

Critical execution latency metric; should be conditioned on DIM regime.

8) DIM Stress Markout Gap (DSMG)

Matched-cohort markout delta between dim_stress=1 and baseline windows.


Minimal model architecture

Stage 1: DIM stress classifier

Inputs:

Output:

Stage 2: Conditional execution cost model

Predict:

Key interaction:

[ \Delta IS \sim \beta_1,urgency + \beta_2,DIM_STRESS + \beta_3,(urgency \times DIM_STRESS) ]

Interpretation: urgent flow pays disproportionately during oscillation regimes.


Controller state machine

GREEN β€” DIM_STABLE

YELLOW β€” DIM_DRIFT

ORANGE β€” DIM_OSCILLATION

RED β€” DIM_UNSTABLE_TAIL

Use hysteresis + minimum dwell to avoid control flapping.


Engineering mitigations (highest ROI first)

  1. Separate critical vs non-critical queues
    Keep latency-sensitive execution traffic away from queues carrying noisy bulk/background flow.

  2. Bound adaptation range
    Narrow allowable moderation-profile span on execution-critical queues to prevent wide oscillation.

  3. Tune sampling/decision cadence
    DIM reacts to sample deltas; overly reactive settings can chase noise.

  4. Queue-local policy, not host-global policy
    Different queues have different burst structure; one-size adaptive policy is fragile.

  5. Couple DIM telemetry into execution controller
    Treat DIM regime as first-class model feature, not a postmortem artifact.

  6. Run controlled burst-replay tests before rollout
    Validate profile stability under synthetic open/close and event-driven bursts.


Validation protocol

  1. Label dim_stress windows with thresholds over PFS/FRS/JI95.
  2. Build matched cohorts by symbol, spread, volatility, participation, and time bucket.
  3. Estimate (\Delta E[IS]), (\Delta q95(IS)), completion-risk uplift.
  4. Shadow controller actions (no-trade-impact mode) first.
  5. Promote only if out-of-sample tails improve without throughput collapse.

Practical observability checklist

Success criterion: tail execution stability under bursty flow, not just higher average throughput.


Pseudocode sketch

features = collect_dim_features()  # PFS, FRS, MS, JI95, NCS, DWT95
p_stress = dim_stress_model.predict_proba(features)
state = decode_dim_state(p_stress, features)

if state == "GREEN":
    params = normal_policy()
elif state == "YELLOW":
    params = mild_clip_trim_and_send_jitter()
elif state == "ORANGE":
    params = cap_urgency_catchup_reduce_fanout()
else:  # RED
    params = containment_policy_with_tail_budget()

execute_with(params)
log(state=state, p_stress=p_stress)

Bottom line

Adaptive interrupt moderation is useful, but in mixed burst regimes it can become a control-loop instability source.

If you do not model DIM oscillation, you will misclassify infra-induced timing error as market randomness. Treat moderation state transitions as a live slippage feature, and attach explicit guardrails before tails eat your edge.


References