SMT Sibling Contention Dispatch-Jitter Slippage Playbook
Why this exists
Execution hosts can look idle enough on average while still leaking p95/p99 implementation shortfall.
A frequent hidden source is SMT (Hyper-Threading) sibling contention: a latency-critical thread shares one physical core with another runnable thread (or noisy IRQ/ksoftirq work), causing bursty dispatch delays.
These delays are often too small for coarse dashboards but large enough to degrade queue quality in fast books.
Core failure mode
When execution-critical threads share physical cores with active siblings:
- run-queue competition adds variable scheduling delay,
- shared core resources (frontend/backend/cache/TLB bandwidth) create execution-time jitter,
- decision and cancel/replace loops become phase-noisy,
- child orders arrive in uneven bursts,
- queue-age quality decays,
- late-cycle urgency rises and crossing cost convexifies.
Result: tail slippage rises even when mean CPU utilization looks acceptable.
Slippage decomposition with SMT term
For parent order (i):
[ IS_i = C_{delay} + C_{impact} + C_{miss} + C_{smt} ]
Where:
[ C_{smt} = C_{runqueue} + C_{shared-core-jitter} + C_{queue-decay} ]
- Runqueue cost: scheduling wait added by sibling activity
- Shared-core jitter cost: variable compute/dispatch latency from resource contention
- Queue decay cost: stale quotes and reset-heavy retries caused by timing instability
Feature set (production-ready)
1) Host scheduling / CPU-topology features
- physical-core vs logical-core pin map for execution threads
- sibling runnable ratio (critical thread sibling busy-time %)
- per-core runqueue depth quantiles
- context-switch and migration rate on critical cores
- IRQ/softirq load overlapping critical sibling
2) Execution-path timing features
- decision-to-send latency quantiles (p50/p95/p99)
- inter-dispatch gap variance and burst index
- cancel-to-replace turnaround drift
- timing phase error vs intended schedule grid
3) Outcome features
- passive fill ratio by sibling-load bucket
- markout ladder (10ms / 100ms / 1s / 5s)
- completion deficit vs schedule under same liquidity regime
- branch labels:
isolated-core,mild-contention,contention-burst,deadline-chase
Model architecture
Use baseline + SMT-overlay design:
- Baseline slippage model
- spread/impact/fill/deadline stack
- SMT contention overlay
- predicts incremental uplift:
delta_is_meandelta_is_q95
- predicts incremental uplift:
Final estimate:
[ \hat{IS}{final} = \hat{IS}{baseline} + \Delta\hat{IS}_{smt} ]
Train on matched market windows (symbol/session/volatility/liquidity bucket) with different sibling-load states to isolate host-topology effects from market confounders.
Regime controller
State A: CORE_ISOLATED
- critical thread has low sibling pressure
- normal execution policy
State B: SIBLING_WATCH
- sibling runnable ratio rising, timing tails widening
- reduce replace churn, smooth child pacing
State C: CONTENTION_STRESS
- sustained sibling pressure + bursty dispatch
- cap burst size, enforce minimum spacing, avoid fragile queue-chasing
State D: SAFE_ISOLATION_MODE
- repeated stress + deadline pressure
- route urgent flow only to isolated cores/hosts, conservative completion policy
Use hysteresis and minimum dwell times to prevent policy flapping.
Desk metrics
- SCI (Sibling Contention Index): pressure from SMT sibling activity
- RQS (RunQueue Stress): scheduler backlog severity on critical cores
- DBI (Dispatch Burst Index): uneven child-order release intensity
- QRL (Queue Reliability Loss): passive-fill quality degradation under contention
- SUL (SMT Uplift Loss): realized IS - baseline IS in contention regimes
Track by host pool, core topology profile, symbol-liquidity bucket, and session segment.
Mitigation ladder
- Critical-thread core isolation
- pin latency-critical paths away from busy siblings where possible
- IRQ/softirq hygiene
- keep noisy interrupt paths off critical sibling pairs
- Execution containment under watch/stress
- bounded catch-up pacing, no blind backlog flush
- Topology-aware routing
- send urgency-sensitive parents only through validated low-contention hosts
- Continuous recalibration
- re-fit SMT uplift after kernel, affinity, or fleet-profile changes
Failure drills (must run)
- Sibling-load injection drill
- replay known contention patterns and verify early
SIBLING_WATCHdetection
- replay known contention patterns and verify early
- Burst-containment drill
- confirm bounded recovery outperforms panic flush on q95 IS
- Confounder drill
- separate SMT effects from NIC/network/venue latency spikes
- Isolation fallback drill
- verify rapid migration to safe host/core pools under stress
Anti-patterns
- Trusting average CPU% as latency-health truth
- Co-locating critical execution threads with unbounded sibling workload
- Ignoring topology/affinity drift after host maintenance
- Aggressive retry loops that amplify contention-caused timing bursts
Bottom line
SMT is not inherently harmful, but unmanaged sibling contention can become a hidden slippage tax.
If you do not model host-topology-induced timing distortion, queue-quality erosion will keep leaking basis points in tail windows.