Clocksource Instability & Timestamp Jitter Slippage Playbook

Why this matters

Most slippage stacks model spread + impact + queue risk, but assume event timing is trustworthy.

When host timekeeping becomes unstable (TSC drift/skew, clocksource fallback, aggressive clock corrections), event ordering and age features get noisy. That can mis-time child orders, over-trust stale quotes, and silently inflate tail implementation shortfall.

Failure mechanism (infra -> execution)

Clock instability appears (cross-core TSC inconsistency, watchdog-triggered clocksource fallback, or abrupt offset correction).
Timestamp deltas become noisy or occasionally discontinuous.
Market-data age / decision-latency features are mis-estimated.
Router chooses wrong urgency (too passive on stale info, or panic-aggressive on phantom lag).
Queue priority and adverse-selection outcomes degrade, especially in p95/p99.

This is a time-truth failure: the model may be correct, but its time inputs are corrupted.

Observable metrics

Use a dedicated time-integrity bundle.

1) CSD — Clocksource Switch Density

Count of clocksource changes per host/hour (e.g., tsc -> hpet)
Non-zero CSD is a strong instability warning in latency-sensitive systems

2) TJO95 — Timestamp Jump Offset p95

p95 absolute jump in consecutive monotonic deltas beyond expected jitter envelope
Detects discontinuity-like behavior in local timing

3) OOR — Ordering-Override Rate

Fraction of event pairs where arrival/order disagrees with expected causal sequence under normal latency bounds
Practical proxy for time-order trust degradation

4) ABE — Age-Bucket Error

Error between expected quote-age bucket and realized post-trade age diagnostics
Bridges timekeeping noise into execution feature corruption

5) DCL — Dispatch-Cadence Lift

Incremental lift in inter-dispatch gap variance vs clean baseline
Captures downstream cadence damage from timing uncertainty

Modeling pattern

Augment residual model with time-integrity state:

IS_residual_t = f(market_state_t, order_state_t, time_integrity_t)
time_integrity_t = {CSD, TJO95, OOR, ABE, DCL}

Train both:

Mean residual head (baseline cost)
q95 residual head (tail protection)

Time-instability features often appear weak in mean, but dominant in tails.

Regime state machine

CLOCK_CLEAN

Stable clocksource, low OOR/ABE
Normal execution policy

CLOCK_DRIFTING

Early rise in ABE or DCL without hard discontinuity
Pre-emptive caution mode

CLOCK_UNSTABLE

CSD > 0 or TJO95/OOR jump beyond limits
High risk of wrong urgency and stale-feature decisions

SAFE_TIME_CONTAIN

Persistent instability or repeated switch events
Force conservative execution + host remediation

Use hysteresis and minimum dwell times to avoid flip-flop behavior.

Control actions by state

CLOCK_CLEAN -> CLOCK_DRIFTING

Tighten time-integrity monitoring window and alerting
Reduce dependence on ultra-fine age features (coarser buckets)
Increase weighting of robust market-state features

CLOCK_DRIFTING -> CLOCK_UNSTABLE

Cap aggression escalation triggered purely by latency/age signals
Increase passive-order timeout conservatism to avoid stale joins
Apply stricter tail-budget gating on fast reprice loops

CLOCK_UNSTABLE -> SAFE_TIME_CONTAIN

Freeze non-essential strategy processes on affected hosts
Route sensitive flow to time-clean hosts/instances
Trigger host remediation lane (clocksource/PTP discipline checks) before restoring normal mode

Fast diagnostics checklist

Did slippage tails widen with stable spread/impact but rising OOR/ABE?
Are clocksource-switch or time-jump signals present near degradation windows?
Do affected hosts show stronger residual drift than unaffected hosts?
Does rerouting to time-clean hosts reduce q95 residual quickly?

If yes, this is likely timestamp-integrity-driven slippage, not pure market regime change.

Deployment playbook (safe rollout)

Shadow: log time-integrity bundle and attribution only
Advisory: produce non-binding state recommendations
Canary: enable controls for a small flow slice
Promotion: require q95 improvement with no completion-rate collapse
Rollback: auto-disable if underfill/opportunity-cost exceeds budget

Common mistakes

Assuming monotonic timestamps are always trustworthy in production
Ignoring clocksource switch events as “just system logs”
Overfitting urgency logic to fine-grained age features without integrity checks
Treating time sync as compliance-only, not execution alpha-protection

Bottom line

Clock instability is execution risk, not just observability noise.

If you do not model time-integrity regimes, your router can make confidently wrong decisions on stale or misordered timing signals—and pay hidden basis-point tax in the tails.