Runqueue Migration Cache-Cold Slippage Playbook
Date: 2026-03-18
Category: research
Why this exists
Execution engines often optimize network and venue latency but under-model a quieter infra tax: scheduler-driven cross-core task migration.
When critical threads bounce between CPU cores, they lose cache warmth (L1/L2/LLC locality), pay branch-predictor relearn cost, and stretch decision→dispatch latency right where queue priority matters most.
The host can still show healthy average CPU utilization while p95/p99 execution cost worsens.
Core failure mode
A latency-sensitive strategy thread starts on core A, then is migrated to core B due to load balancing, IRQ pressure, cgroup placement, or competing tasks.
That migration can trigger:
- Cold-start compute window (cache/predictor warm-up)
- Decision loop stall inflation (longer compute-to-send gap)
- Dispatch bunching (late micro-bursts after catch-up)
- Queue-age loss (slower amend/cancel keeps stale intent live)
- Tail slippage expansion
In fast books, the hidden damage is not mean delay but timing-shape distortion of child flow.
Slippage decomposition with migration term
For parent order (i):
[ IS_i = C_{spread} + C_{impact} + C_{opportunity} + C_{migration} ]
Where:
[ C_{migration} = C_{cold} + C_{burst} + C_{queue_decay} + C_{phase_error} ]
- (C_{cold}): cache-cold compute penalty post migration
- (C_{burst}): burstier dispatch after delayed decision loop
- (C_{queue_decay}): queue rank erosion from delayed lifecycle actions
- (C_{phase_error}): timing mismatch versus microstructure refill cadence
Production observability (minimum)
1) Scheduler / placement telemetry
- per-thread CPU residency timeline
- context switches (voluntary/involuntary)
- migration count (per thread / per second)
- run-queue delay quantiles
2) Hardware locality hints
- LLC miss rate (or proxy counters)
- cycles-per-instruction drift during migration bursts
- optional branch-miss and iTLB/dTLB miss deltas
3) Execution-path telemetry
- decision-to-send latency (p50/p95/p99)
- cancel-to-replace latency
- inter-child dispatch gap variance
- burstiness index of child sends
4) Outcome telemetry
- IS by urgency/liquidity bucket
- short-horizon markout ladder (10ms/100ms/1s)
- completion deficit near deadline windows
Desk metrics to track
Use rolling windows (e.g., 1m / 5m):
- TMR (Thread Migration Rate)
[ TMR = \frac{\Delta migrations}{\Delta time} ]
- RQS (Runqueue Stretch)
[ RQS = p95(runqueue_delay) - p50(runqueue_delay) ]
- CLI (Cache Locality Impairment)
Normalized LLC-miss uplift during high migration windows vs matched baseline.
- DGI (Dispatch Gap Inflation)
[ DGI = \frac{p95(\Delta dispatch_gap)}{median(\Delta dispatch_gap)} ]
- QDI (Queue Decay Impact)
Passive fill-ratio drop conditioned on migration spikes vs matched calm windows.
Modeling approach
Use baseline + migration-uplift overlay.
Stage A: baseline cost model
Standard spread/impact/fill model with market-state features.
Stage B: migration uplift model
Predict incremental:
- (\Delta IS_{mean})
- (\Delta IS_{q95})
with features:
- TMR, RQS, CLI, DGI, QDI
- CPU residency entropy (how dispersed thread placement is)
- cgroup/cpuset topology hints
- urgency, participation, symbol liquidity regime
Final estimate:
[ \hat{IS}{final} = \hat{IS}{baseline} + \Delta\hat{IS}_{migration} ]
Use matched windows (same symbol/session/volatility regime) to separate infra uplift from market turbulence.
Controller state machine
State 1 — PINNED_STABLE
- low TMR
- stable runqueue tails
- no meaningful CLI uplift
Action: standard policy.
State 2 — MIGRATION_PRESSURE
- TMR rising
- RQS widening
Action: reduce replace churn, smooth child cadence, increase passive selectivity.
State 3 — CACHE_COLD_DRIFT
- persistent migration + CLI/DGI elevation
- markout tails worsening
Action: cap burst size, enforce min inter-send spacing, reduce tactic oscillation.
State 4 — SAFE_AFFINITY_MODE
- sustained
CACHE_COLD_DRIFTwith deadline risk
Action: affinity-hardened execution profile (pin critical workers, tighten concurrency, conservative completion mode).
Use hysteresis and minimum dwell times to avoid policy flapping.
Mitigation ladder
Pin truly critical execution threads
- CPU affinity/cpuset for decision + dispatch hot paths
- keep housekeeping/background workers outside the same core island
Tune migration sensitivity, not just CPU usage
- review scheduler balancing behavior and migration-cost heuristics
- avoid over-reactive balancing in latency-critical pools
Isolate interrupt pressure
- align IRQ affinity away from core(s) hosting execution-critical threads
- avoid hidden preemption that triggers downstream migrations
Bound catch-up behavior
- never repay decision lag with uncontrolled child-order bursts
- use capped repayment slope
Topology-aware host classes
- separate latency-critical execution nodes from noisy multi-tenant workloads
Validation drills
Synthetic migration stress
- induce controlled scheduler churn and verify uplift detector response.
Affinity A/B canary
- compare pinned vs floating thread placement on matched symbols/time slices.
Burst-policy A/B
- naive catch-up vs capped repayment under migration stress.
Confounder split
- prove migration uplift remains after controlling for spread/volatility regime shifts.
Anti-patterns
- “CPU% is low, so scheduler placement cannot hurt us.”
- optimizing median latency while ignoring p95/p99 dispatch gaps
- co-locating strategy and noisy background workers on same core set
- letting migration bursts trigger aggressive catch-up sends
- modeling infra only at host level (not per-thread residency)
Practical rollout checklist
- Add per-thread CPU residency + migration counters to telemetry.
- Dashboard TMR/RQS/CLI/DGI/QDI by strategy and host class.
- Label migration-spike windows in TCA pipeline.
- Train migration-uplift overlay and backtest tail impact.
- Shadow-run state machine (
PINNED_STABLE→SAFE_AFFINITY_MODE). - Canary affinity-hardened profile with q95 IS and completion stability gates.
Bottom line
Cross-core task migration is not just an OS detail; in low-latency execution, it is a queue-priority and tail-cost control variable.
If you ignore scheduler locality dynamics, you may keep blaming “market noise” for slippage that is largely self-inflicted by cache-cold timing drift.
References
- Linux scheduler design docs:
https://www.kernel.org/doc/html/latest/scheduler/index.html - CFS scheduler internals:
https://www.kernel.org/doc/html/latest/scheduler/sched-design-CFS.html - cgroup v2 CPU controller docs:
https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html - perf examples (PMU counters and scheduler events):
https://www.brendangregg.com/perf.html