SMT Sibling Contention Dispatch-Jitter Slippage Playbook

2026-03-18 · finance

SMT Sibling Contention Dispatch-Jitter Slippage Playbook

Why this exists

Execution hosts can look idle enough on average while still leaking p95/p99 implementation shortfall.

A frequent hidden source is SMT (Hyper-Threading) sibling contention: a latency-critical thread shares one physical core with another runnable thread (or noisy IRQ/ksoftirq work), causing bursty dispatch delays.

These delays are often too small for coarse dashboards but large enough to degrade queue quality in fast books.


Core failure mode

When execution-critical threads share physical cores with active siblings:

Result: tail slippage rises even when mean CPU utilization looks acceptable.


Slippage decomposition with SMT term

For parent order (i):

[ IS_i = C_{delay} + C_{impact} + C_{miss} + C_{smt} ]

Where:

[ C_{smt} = C_{runqueue} + C_{shared-core-jitter} + C_{queue-decay} ]


Feature set (production-ready)

1) Host scheduling / CPU-topology features

2) Execution-path timing features

3) Outcome features


Model architecture

Use baseline + SMT-overlay design:

  1. Baseline slippage model
    • spread/impact/fill/deadline stack
  2. SMT contention overlay
    • predicts incremental uplift:
      • delta_is_mean
      • delta_is_q95

Final estimate:

[ \hat{IS}{final} = \hat{IS}{baseline} + \Delta\hat{IS}_{smt} ]

Train on matched market windows (symbol/session/volatility/liquidity bucket) with different sibling-load states to isolate host-topology effects from market confounders.


Regime controller

State A: CORE_ISOLATED

State B: SIBLING_WATCH

State C: CONTENTION_STRESS

State D: SAFE_ISOLATION_MODE

Use hysteresis and minimum dwell times to prevent policy flapping.


Desk metrics

Track by host pool, core topology profile, symbol-liquidity bucket, and session segment.


Mitigation ladder

  1. Critical-thread core isolation
    • pin latency-critical paths away from busy siblings where possible
  2. IRQ/softirq hygiene
    • keep noisy interrupt paths off critical sibling pairs
  3. Execution containment under watch/stress
    • bounded catch-up pacing, no blind backlog flush
  4. Topology-aware routing
    • send urgency-sensitive parents only through validated low-contention hosts
  5. Continuous recalibration
    • re-fit SMT uplift after kernel, affinity, or fleet-profile changes

Failure drills (must run)

  1. Sibling-load injection drill
    • replay known contention patterns and verify early SIBLING_WATCH detection
  2. Burst-containment drill
    • confirm bounded recovery outperforms panic flush on q95 IS
  3. Confounder drill
    • separate SMT effects from NIC/network/venue latency spikes
  4. Isolation fallback drill
    • verify rapid migration to safe host/core pools under stress

Anti-patterns


Bottom line

SMT is not inherently harmful, but unmanaged sibling contention can become a hidden slippage tax.

If you do not model host-topology-induced timing distortion, queue-quality erosion will keep leaking basis points in tail windows.