PTP Grandmaster Failover Clock-Step Slippage Playbook
Date: 2026-03-25
Category: research
Audience: low-latency execution operators with PTP/NTP-synchronized trading stacks
Why this note
Many slippage models assume timestamps are trustworthy enough to align signal, decision, and fill events.
That assumption breaks during time-sync regime transitions (PTP grandmaster failover, BMCA re-election, bad holdover, sudden NTP step correction). When local clocks step or rapidly slew, you can get:
- causal misordering (fill appears before child send),
- stale alpha acceptance (signal-age gate bypass),
- broken queue/latency attribution,
- and wrong emergency behavior (controller believes it is “fast” while actually blind).
This is a hidden source of tail slippage because execution policy still looks normal while time integrity is degraded.
1) Cost decomposition with time-integrity penalty
For child order decision at time (t):
[ \mathbb{E}[C_t] = \mathbb{E}[C_{exec}\mid a_t, s_t] + \lambda_\tau \cdot \mathbb{E}[C_{time}\mid z_t] ]
Where:
- (C_{exec}): spread + impact + fees + opportunity cost
- (z_t): time-integrity state (offset, drift, step events, source quality)
- (C_{time}): excess cost from timestamp corruption (mis-gating, stale decisions, unsafe pacing)
Practical approximation:
[ C_{time} \approx p_{inv}\cdot L_{inv} + p_{stale}\cdot L_{stale} + p_{ctrl}\cdot L_{ctrl} ]
- (p_{inv}): probability of causal inversion in event joins
- (p_{stale}): probability that feature age is under-estimated
- (p_{ctrl}): probability of control-loop instability from bad latency clocks
2) Data contract (must be point-in-time)
A) Clock-health telemetry (host + NIC)
- offset-to-master (ns/us)
- frequency adjustment (ppm)
- servo state (locked, holdover, freerun)
- step/slew events with magnitude and direction
- source identity + quality (GM ID, stratum/priority, PTP domain)
B) Execution-path timing
- decision timestamp (mono + wall)
- send timestamp (kernel/NIC if available)
- ACK/reject/fill timestamps (exchange + local receive)
- feed event receive timestamps and sequence numbers
C) Session/change events
- BMCA winner changes
- grandmaster switchover windows
- NTP step-correction events
- leap-second handling mode
If mono/wall dual-timestamping is not available, confidence in attribution should be marked low automatically.
3) Modeling stack
A) Time-integrity state classifier
Define states from clock telemetry:
- LOCKED (low offset, stable ppm)
- SLEWING (offset correcting without step)
- STEP_EVENT (discrete jump)
- HOLDOVER (source lost, oscillator-only)
- UNTRUSTED (integrity unknown)
Use either rules + hysteresis or a small HMM/state-space filter with observed (offset,\ ppm,\ source_changes).
B) Causal inversion risk model
Estimate
[ P(inv\mid z_t, \Delta_{path}, \sigma_{jit}) ]
with labels from impossible event orderings (e.g., fill-ts < send-ts by nontrivial margin after known transport bounds).
C) Staleness under-report model
For each feature used by routing/sizing, estimate probability that true age exceeds policy budget under current clock state.
[ P(age_{true} > age_{budget} \mid z_t) ]
This term feeds urgency/routing clamps.
D) Tail slippage conditional model
Model q90/q95 slippage conditioned on time state + liquidity state:
- time state: LOCKED/SLEWING/STEP/HOLDOVER/UNTRUSTED
- liquidity: spread/depth/imbalance regime
- session: auction/open/close/news windows
4) Policy layer (what to do live)
Use an explicit execution safety ladder:
GREEN (LOCKED)
Normal policy.YELLOW (SLEWING / mild degradation)
Tighten feature-age thresholds, reduce passive patience, lower child size.ORANGE (STEP_EVENT / HOLDOVER)
Freeze time-sensitive alpha features, switch to conservative schedule, widen cancel/replace hysteresis to avoid oscillation.RED (UNTRUSTED)
Disable aggressive discretionary logic, allow only minimal-risk unwind/kill-switch policy until integrity recovers.
Never blend GREEN and ORANGE behaviors silently. Transition rules must be auditable.
5) Production KPIs
- TII (Time Integrity Index): weighted score from offset, ppm, source stability
- CIR (Causal Inversion Rate): impossible ordering incidents per 10k child orders
- SAR (Stale Acceptance Rate): decisions accepted with true feature-age > budget
- TSC (Time-State Cost): incremental slippage vs LOCKED baseline by state
- RRT (Recovery Re-lock Time): failover/step to stable LOCKED duration
Alert examples:
- CIR spike + STEP_EVENT detected → force ORANGE
- TII below threshold for >N seconds → RED fallback
- RRT above historical p95 → investigate GM/servo configuration
6) Validation ladder
- Historical reconstruction: replay with real clock telemetry + execution logs.
- Fault injection: synthetic step/slew and GM failover in staging.
- Shadow state machine: run ladder without acting, compare counterfactual cost.
- Canary enforcement: enforce ORANGE/RED on small flow slice with rollback triggers.
Key anti-pattern: training/validating with corrected timestamps only. That erases the very failure mode you need to control.
7) Two-week implementation plan
Days 1-2
Add dual timestamp fields (monotonic + wall) end-to-end; record clock-source metadata.
Days 3-4
Build TII calculator and state classifier (LOCKED→UNTRUSTED).
Days 5-7
Label causal inversions/stale accepts; train simple risk models.
Days 8-9
Integrate execution ladder with explicit GREEN/YELLOW/ORANGE/RED transitions.
Days 10-11
Run shadow-mode and compare state-conditional q95 slippage.
Days 12-13
Canary deploy with strict rollback guardrails.
Day 14
Finalize runbook for GM failover drills and post-incident attribution.
Common mistakes
Using only wall-clock for sequencing
Always keep monotonic timeline for local causality.Assuming NTP/PTP corrections are harmless
Small average offset can still hide destructive step events.No hysteresis in safety states
Without dwell/hysteresis, controller can flap between modes.Attribution without clock confidence
If time integrity is low, queue/latency attribution should be down-weighted.
Bottom line
Clock integrity is not just compliance plumbing; it is a direct input to slippage control.
A practical production setup needs:
- explicit time-integrity states,
- state-conditional slippage modeling,
- and deterministic safety transitions when clocks degrade.
Treating clock failover as “infra-only” is how timestamp bugs become trading losses.
References
IEEE 1588-2019, Precision Clock Synchronization Protocol for Networked Measurement and Control Systems
https://ieeexplore.ieee.org/document/9120376ESMA, MiFID II RTS 25 – Clock Synchronisation
https://www.esma.europa.eu/policy-rules/mifid-ii-and-mifir/mifid-ii-technical-standardsFINRA Rule 4590, Synchronization of Member Business Clocks
https://www.finra.org/rules-guidance/rulebooks/finra-rules/4590Perold, A. F. (1988), The Implementation Shortfall: Paper versus Reality
https://www.hbs.edu/faculty/Pages/item.aspx?num=2083Marzullo, K., Owicki, S. (1983), Maintaining the Time in a Distributed System
https://dl.acm.org/doi/10.1145/358527.358537