Clocksource Instability & Timestamp Jitter Slippage Playbook
Why this matters
Most slippage stacks model spread + impact + queue risk, but assume event timing is trustworthy.
When host timekeeping becomes unstable (TSC drift/skew, clocksource fallback, aggressive clock corrections), event ordering and age features get noisy. That can mis-time child orders, over-trust stale quotes, and silently inflate tail implementation shortfall.
Failure mechanism (infra -> execution)
- Clock instability appears (cross-core TSC inconsistency, watchdog-triggered clocksource fallback, or abrupt offset correction).
- Timestamp deltas become noisy or occasionally discontinuous.
- Market-data age / decision-latency features are mis-estimated.
- Router chooses wrong urgency (too passive on stale info, or panic-aggressive on phantom lag).
- Queue priority and adverse-selection outcomes degrade, especially in p95/p99.
This is a time-truth failure: the model may be correct, but its time inputs are corrupted.
Observable metrics
Use a dedicated time-integrity bundle.
1) CSD — Clocksource Switch Density
- Count of clocksource changes per host/hour (e.g.,
tsc -> hpet) - Non-zero CSD is a strong instability warning in latency-sensitive systems
2) TJO95 — Timestamp Jump Offset p95
- p95 absolute jump in consecutive monotonic deltas beyond expected jitter envelope
- Detects discontinuity-like behavior in local timing
3) OOR — Ordering-Override Rate
- Fraction of event pairs where arrival/order disagrees with expected causal sequence under normal latency bounds
- Practical proxy for time-order trust degradation
4) ABE — Age-Bucket Error
- Error between expected quote-age bucket and realized post-trade age diagnostics
- Bridges timekeeping noise into execution feature corruption
5) DCL — Dispatch-Cadence Lift
- Incremental lift in inter-dispatch gap variance vs clean baseline
- Captures downstream cadence damage from timing uncertainty
Modeling pattern
Augment residual model with time-integrity state:
IS_residual_t = f(market_state_t, order_state_t, time_integrity_t)time_integrity_t = {CSD, TJO95, OOR, ABE, DCL}
Train both:
- Mean residual head (baseline cost)
- q95 residual head (tail protection)
Time-instability features often appear weak in mean, but dominant in tails.
Regime state machine
CLOCK_CLEAN
- Stable clocksource, low OOR/ABE
- Normal execution policy
CLOCK_DRIFTING
- Early rise in ABE or DCL without hard discontinuity
- Pre-emptive caution mode
CLOCK_UNSTABLE
- CSD > 0 or TJO95/OOR jump beyond limits
- High risk of wrong urgency and stale-feature decisions
SAFE_TIME_CONTAIN
- Persistent instability or repeated switch events
- Force conservative execution + host remediation
Use hysteresis and minimum dwell times to avoid flip-flop behavior.
Control actions by state
CLOCK_CLEAN -> CLOCK_DRIFTING
- Tighten time-integrity monitoring window and alerting
- Reduce dependence on ultra-fine age features (coarser buckets)
- Increase weighting of robust market-state features
CLOCK_DRIFTING -> CLOCK_UNSTABLE
- Cap aggression escalation triggered purely by latency/age signals
- Increase passive-order timeout conservatism to avoid stale joins
- Apply stricter tail-budget gating on fast reprice loops
CLOCK_UNSTABLE -> SAFE_TIME_CONTAIN
- Freeze non-essential strategy processes on affected hosts
- Route sensitive flow to time-clean hosts/instances
- Trigger host remediation lane (clocksource/PTP discipline checks) before restoring normal mode
Fast diagnostics checklist
- Did slippage tails widen with stable spread/impact but rising OOR/ABE?
- Are clocksource-switch or time-jump signals present near degradation windows?
- Do affected hosts show stronger residual drift than unaffected hosts?
- Does rerouting to time-clean hosts reduce q95 residual quickly?
If yes, this is likely timestamp-integrity-driven slippage, not pure market regime change.
Deployment playbook (safe rollout)
- Shadow: log time-integrity bundle and attribution only
- Advisory: produce non-binding state recommendations
- Canary: enable controls for a small flow slice
- Promotion: require q95 improvement with no completion-rate collapse
- Rollback: auto-disable if underfill/opportunity-cost exceeds budget
Common mistakes
- Assuming monotonic timestamps are always trustworthy in production
- Ignoring clocksource switch events as “just system logs”
- Overfitting urgency logic to fine-grained age features without integrity checks
- Treating time sync as compliance-only, not execution alpha-protection
Bottom line
Clock instability is execution risk, not just observability noise.
If you do not model time-integrity regimes, your router can make confidently wrong decisions on stale or misordered timing signals—and pay hidden basis-point tax in the tails.