Clocksource Instability & Timestamp Jitter Slippage Playbook

2026-03-19 · finance

Clocksource Instability & Timestamp Jitter Slippage Playbook

Why this matters

Most slippage stacks model spread + impact + queue risk, but assume event timing is trustworthy.

When host timekeeping becomes unstable (TSC drift/skew, clocksource fallback, aggressive clock corrections), event ordering and age features get noisy. That can mis-time child orders, over-trust stale quotes, and silently inflate tail implementation shortfall.


Failure mechanism (infra -> execution)

  1. Clock instability appears (cross-core TSC inconsistency, watchdog-triggered clocksource fallback, or abrupt offset correction).
  2. Timestamp deltas become noisy or occasionally discontinuous.
  3. Market-data age / decision-latency features are mis-estimated.
  4. Router chooses wrong urgency (too passive on stale info, or panic-aggressive on phantom lag).
  5. Queue priority and adverse-selection outcomes degrade, especially in p95/p99.

This is a time-truth failure: the model may be correct, but its time inputs are corrupted.


Observable metrics

Use a dedicated time-integrity bundle.

1) CSD — Clocksource Switch Density

2) TJO95 — Timestamp Jump Offset p95

3) OOR — Ordering-Override Rate

4) ABE — Age-Bucket Error

5) DCL — Dispatch-Cadence Lift


Modeling pattern

Augment residual model with time-integrity state:

Train both:

  1. Mean residual head (baseline cost)
  2. q95 residual head (tail protection)

Time-instability features often appear weak in mean, but dominant in tails.


Regime state machine

CLOCK_CLEAN

CLOCK_DRIFTING

CLOCK_UNSTABLE

SAFE_TIME_CONTAIN

Use hysteresis and minimum dwell times to avoid flip-flop behavior.


Control actions by state

CLOCK_CLEAN -> CLOCK_DRIFTING

CLOCK_DRIFTING -> CLOCK_UNSTABLE

CLOCK_UNSTABLE -> SAFE_TIME_CONTAIN


Fast diagnostics checklist

  1. Did slippage tails widen with stable spread/impact but rising OOR/ABE?
  2. Are clocksource-switch or time-jump signals present near degradation windows?
  3. Do affected hosts show stronger residual drift than unaffected hosts?
  4. Does rerouting to time-clean hosts reduce q95 residual quickly?

If yes, this is likely timestamp-integrity-driven slippage, not pure market regime change.


Deployment playbook (safe rollout)

  1. Shadow: log time-integrity bundle and attribution only
  2. Advisory: produce non-binding state recommendations
  3. Canary: enable controls for a small flow slice
  4. Promotion: require q95 improvement with no completion-rate collapse
  5. Rollback: auto-disable if underfill/opportunity-cost exceeds budget

Common mistakes


Bottom line

Clock instability is execution risk, not just observability noise.

If you do not model time-integrity regimes, your router can make confidently wrong decisions on stale or misordered timing signals—and pay hidden basis-point tax in the tails.