Runqueue Migration Cache-Cold Slippage Playbook

Date: 2026-03-18
Category: research

Why this exists

Execution engines often optimize network and venue latency but under-model a quieter infra tax: scheduler-driven cross-core task migration.

When critical threads bounce between CPU cores, they lose cache warmth (L1/L2/LLC locality), pay branch-predictor relearn cost, and stretch decision→dispatch latency right where queue priority matters most.

The host can still show healthy average CPU utilization while p95/p99 execution cost worsens.

Core failure mode

A latency-sensitive strategy thread starts on core A, then is migrated to core B due to load balancing, IRQ pressure, cgroup placement, or competing tasks.

That migration can trigger:

Cold-start compute window (cache/predictor warm-up)
Decision loop stall inflation (longer compute-to-send gap)
Dispatch bunching (late micro-bursts after catch-up)
Queue-age loss (slower amend/cancel keeps stale intent live)
Tail slippage expansion

In fast books, the hidden damage is not mean delay but timing-shape distortion of child flow.

Slippage decomposition with migration term

For parent order (i):

[ IS_i = C_{spread} + C_{impact} + C_{opportunity} + C_{migration} ]

Where:

[ C_{migration} = C_{cold} + C_{burst} + C_{queue_decay} + C_{phase_error} ]

(C_{cold}): cache-cold compute penalty post migration
(C_{burst}): burstier dispatch after delayed decision loop
(C_{queue_decay}): queue rank erosion from delayed lifecycle actions
(C_{phase_error}): timing mismatch versus microstructure refill cadence

Production observability (minimum)

1) Scheduler / placement telemetry

per-thread CPU residency timeline
context switches (voluntary/involuntary)
migration count (per thread / per second)
run-queue delay quantiles

2) Hardware locality hints

LLC miss rate (or proxy counters)
cycles-per-instruction drift during migration bursts
optional branch-miss and iTLB/dTLB miss deltas

3) Execution-path telemetry

decision-to-send latency (p50/p95/p99)
cancel-to-replace latency
inter-child dispatch gap variance
burstiness index of child sends

4) Outcome telemetry

IS by urgency/liquidity bucket
short-horizon markout ladder (10ms/100ms/1s)
completion deficit near deadline windows

Desk metrics to track

Use rolling windows (e.g., 1m / 5m):

TMR (Thread Migration Rate)

[ TMR = \frac{\Delta migrations}{\Delta time} ]

RQS (Runqueue Stretch)

[ RQS = p95(runqueue_delay) - p50(runqueue_delay) ]

CLI (Cache Locality Impairment)

Normalized LLC-miss uplift during high migration windows vs matched baseline.

DGI (Dispatch Gap Inflation)

[ DGI = \frac{p95(\Delta dispatch_gap)}{median(\Delta dispatch_gap)} ]

QDI (Queue Decay Impact)

Passive fill-ratio drop conditioned on migration spikes vs matched calm windows.

Modeling approach

Use baseline + migration-uplift overlay.

Stage A: baseline cost model

Standard spread/impact/fill model with market-state features.

Stage B: migration uplift model

Predict incremental:

(\Delta IS_{mean})
(\Delta IS_{q95})

with features:

TMR, RQS, CLI, DGI, QDI
CPU residency entropy (how dispersed thread placement is)
cgroup/cpuset topology hints
urgency, participation, symbol liquidity regime

Final estimate:

[ \hat{IS}{final} = \hat{IS}{baseline} + \Delta\hat{IS}_{migration} ]

Use matched windows (same symbol/session/volatility regime) to separate infra uplift from market turbulence.

Controller state machine

State 1 — `PINNED_STABLE`

low TMR
stable runqueue tails
no meaningful CLI uplift

Action: standard policy.

State 2 — `MIGRATION_PRESSURE`

TMR rising
RQS widening

Action: reduce replace churn, smooth child cadence, increase passive selectivity.

State 3 — `CACHE_COLD_DRIFT`

persistent migration + CLI/DGI elevation
markout tails worsening

Action: cap burst size, enforce min inter-send spacing, reduce tactic oscillation.

State 4 — `SAFE_AFFINITY_MODE`

sustained CACHE_COLD_DRIFT with deadline risk

Action: affinity-hardened execution profile (pin critical workers, tighten concurrency, conservative completion mode).

Use hysteresis and minimum dwell times to avoid policy flapping.

Mitigation ladder

Pin truly critical execution threads
- CPU affinity/cpuset for decision + dispatch hot paths
- keep housekeeping/background workers outside the same core island
Tune migration sensitivity, not just CPU usage
- review scheduler balancing behavior and migration-cost heuristics
- avoid over-reactive balancing in latency-critical pools
Isolate interrupt pressure
- align IRQ affinity away from core(s) hosting execution-critical threads
- avoid hidden preemption that triggers downstream migrations
Bound catch-up behavior
- never repay decision lag with uncontrolled child-order bursts
- use capped repayment slope
Topology-aware host classes
- separate latency-critical execution nodes from noisy multi-tenant workloads

Validation drills

Synthetic migration stress
- induce controlled scheduler churn and verify uplift detector response.
Affinity A/B canary
- compare pinned vs floating thread placement on matched symbols/time slices.
Burst-policy A/B
- naive catch-up vs capped repayment under migration stress.
Confounder split
- prove migration uplift remains after controlling for spread/volatility regime shifts.

Anti-patterns

“CPU% is low, so scheduler placement cannot hurt us.”
optimizing median latency while ignoring p95/p99 dispatch gaps
co-locating strategy and noisy background workers on same core set
letting migration bursts trigger aggressive catch-up sends
modeling infra only at host level (not per-thread residency)

Practical rollout checklist

Add per-thread CPU residency + migration counters to telemetry.
Dashboard TMR/RQS/CLI/DGI/QDI by strategy and host class.
Label migration-spike windows in TCA pipeline.
Train migration-uplift overlay and backtest tail impact.
Shadow-run state machine (PINNED_STABLE → SAFE_AFFINITY_MODE).
Canary affinity-hardened profile with q95 IS and completion stability gates.

Bottom line

Cross-core task migration is not just an OS detail; in low-latency execution, it is a queue-priority and tail-cost control variable.

If you ignore scheduler locality dynamics, you may keep blaming “market noise” for slippage that is largely self-inflicted by cache-cold timing drift.

References

Linux scheduler design docs:
https://www.kernel.org/doc/html/latest/scheduler/index.html
CFS scheduler internals:
https://www.kernel.org/doc/html/latest/scheduler/sched-design-CFS.html
cgroup v2 CPU controller docs:
https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html
perf examples (PMU counters and scheduler events):
https://www.brendangregg.com/perf.html

Runqueue Migration Cache-Cold Slippage Playbook

Runqueue Migration Cache-Cold Slippage Playbook

Why this exists

Core failure mode

Slippage decomposition with migration term

Production observability (minimum)

1) Scheduler / placement telemetry

2) Hardware locality hints

3) Execution-path telemetry

4) Outcome telemetry

Desk metrics to track

Modeling approach

Stage A: baseline cost model

Stage B: migration uplift model

Controller state machine

State 1 — PINNED_STABLE

State 2 — MIGRATION_PRESSURE

State 3 — CACHE_COLD_DRIFT

State 4 — SAFE_AFFINITY_MODE

Mitigation ladder

Validation drills

Anti-patterns

Practical rollout checklist

Bottom line

References

State 1 — `PINNED_STABLE`

State 2 — `MIGRATION_PRESSURE`

State 3 — `CACHE_COLD_DRIFT`

State 4 — `SAFE_AFFINITY_MODE`