PTP Grandmaster Failover Clock-Step Slippage Playbook

Date: 2026-03-25
Category: research
Audience: low-latency execution operators with PTP/NTP-synchronized trading stacks

Why this note

Many slippage models assume timestamps are trustworthy enough to align signal, decision, and fill events.

That assumption breaks during time-sync regime transitions (PTP grandmaster failover, BMCA re-election, bad holdover, sudden NTP step correction). When local clocks step or rapidly slew, you can get:

causal misordering (fill appears before child send),
stale alpha acceptance (signal-age gate bypass),
broken queue/latency attribution,
and wrong emergency behavior (controller believes it is “fast” while actually blind).

This is a hidden source of tail slippage because execution policy still looks normal while time integrity is degraded.

1) Cost decomposition with time-integrity penalty

For child order decision at time (t):

[ \mathbb{E}[C_t] = \mathbb{E}[C_{exec}\mid a_t, s_t] + \lambda_\tau \cdot \mathbb{E}[C_{time}\mid z_t] ]

Where:

(C_{exec}): spread + impact + fees + opportunity cost
(z_t): time-integrity state (offset, drift, step events, source quality)
(C_{time}): excess cost from timestamp corruption (mis-gating, stale decisions, unsafe pacing)

Practical approximation:

[ C_{time} \approx p_{inv}\cdot L_{inv} + p_{stale}\cdot L_{stale} + p_{ctrl}\cdot L_{ctrl} ]

(p_{inv}): probability of causal inversion in event joins
(p_{stale}): probability that feature age is under-estimated
(p_{ctrl}): probability of control-loop instability from bad latency clocks

2) Data contract (must be point-in-time)

A) Clock-health telemetry (host + NIC)

offset-to-master (ns/us)
frequency adjustment (ppm)
servo state (locked, holdover, freerun)
step/slew events with magnitude and direction
source identity + quality (GM ID, stratum/priority, PTP domain)

B) Execution-path timing

decision timestamp (mono + wall)
send timestamp (kernel/NIC if available)
ACK/reject/fill timestamps (exchange + local receive)
feed event receive timestamps and sequence numbers

C) Session/change events

BMCA winner changes
grandmaster switchover windows
NTP step-correction events
leap-second handling mode

If mono/wall dual-timestamping is not available, confidence in attribution should be marked low automatically.

3) Modeling stack

A) Time-integrity state classifier

Define states from clock telemetry:

LOCKED (low offset, stable ppm)
SLEWING (offset correcting without step)
STEP_EVENT (discrete jump)
HOLDOVER (source lost, oscillator-only)
UNTRUSTED (integrity unknown)

Use either rules + hysteresis or a small HMM/state-space filter with observed (offset,\ ppm,\ source_changes).

B) Causal inversion risk model

Estimate

[ P(inv\mid z_t, \Delta_{path}, \sigma_{jit}) ]

with labels from impossible event orderings (e.g., fill-ts < send-ts by nontrivial margin after known transport bounds).

C) Staleness under-report model

For each feature used by routing/sizing, estimate probability that true age exceeds policy budget under current clock state.

[ P(age_{true} > age_{budget} \mid z_t) ]

This term feeds urgency/routing clamps.

D) Tail slippage conditional model

Model q90/q95 slippage conditioned on time state + liquidity state:

time state: LOCKED/SLEWING/STEP/HOLDOVER/UNTRUSTED
liquidity: spread/depth/imbalance regime
session: auction/open/close/news windows

4) Policy layer (what to do live)

Use an explicit execution safety ladder:

GREEN (LOCKED)
Normal policy.
YELLOW (SLEWING / mild degradation)
Tighten feature-age thresholds, reduce passive patience, lower child size.
ORANGE (STEP_EVENT / HOLDOVER)
Freeze time-sensitive alpha features, switch to conservative schedule, widen cancel/replace hysteresis to avoid oscillation.
RED (UNTRUSTED)
Disable aggressive discretionary logic, allow only minimal-risk unwind/kill-switch policy until integrity recovers.

Never blend GREEN and ORANGE behaviors silently. Transition rules must be auditable.

5) Production KPIs

TII (Time Integrity Index): weighted score from offset, ppm, source stability
CIR (Causal Inversion Rate): impossible ordering incidents per 10k child orders
SAR (Stale Acceptance Rate): decisions accepted with true feature-age > budget
TSC (Time-State Cost): incremental slippage vs LOCKED baseline by state
RRT (Recovery Re-lock Time): failover/step to stable LOCKED duration

Alert examples:

CIR spike + STEP_EVENT detected → force ORANGE
TII below threshold for >N seconds → RED fallback
RRT above historical p95 → investigate GM/servo configuration

6) Validation ladder

Historical reconstruction: replay with real clock telemetry + execution logs.
Fault injection: synthetic step/slew and GM failover in staging.
Shadow state machine: run ladder without acting, compare counterfactual cost.
Canary enforcement: enforce ORANGE/RED on small flow slice with rollback triggers.

Key anti-pattern: training/validating with corrected timestamps only. That erases the very failure mode you need to control.

7) Two-week implementation plan

Days 1-2
Add dual timestamp fields (monotonic + wall) end-to-end; record clock-source metadata.

Days 3-4
Build TII calculator and state classifier (LOCKED→UNTRUSTED).

Days 5-7
Label causal inversions/stale accepts; train simple risk models.

Days 8-9
Integrate execution ladder with explicit GREEN/YELLOW/ORANGE/RED transitions.

Days 10-11
Run shadow-mode and compare state-conditional q95 slippage.

Days 12-13
Canary deploy with strict rollback guardrails.

Day 14
Finalize runbook for GM failover drills and post-incident attribution.

Common mistakes

Using only wall-clock for sequencing
Always keep monotonic timeline for local causality.
Assuming NTP/PTP corrections are harmless
Small average offset can still hide destructive step events.
No hysteresis in safety states
Without dwell/hysteresis, controller can flap between modes.
Attribution without clock confidence
If time integrity is low, queue/latency attribution should be down-weighted.

Bottom line

Clock integrity is not just compliance plumbing; it is a direct input to slippage control.

A practical production setup needs:

explicit time-integrity states,
state-conditional slippage modeling,
and deterministic safety transitions when clocks degrade.

Treating clock failover as “infra-only” is how timestamp bugs become trading losses.

References

IEEE 1588-2019, Precision Clock Synchronization Protocol for Networked Measurement and Control Systems
https://ieeexplore.ieee.org/document/9120376
ESMA, MiFID II RTS 25 – Clock Synchronisation
https://www.esma.europa.eu/policy-rules/mifid-ii-and-mifir/mifid-ii-technical-standards
FINRA Rule 4590, Synchronization of Member Business Clocks
https://www.finra.org/rules-guidance/rulebooks/finra-rules/4590
Perold, A. F. (1988), The Implementation Shortfall: Paper versus Reality
https://www.hbs.edu/faculty/Pages/item.aspx?num=2083
Marzullo, K., Owicki, S. (1983), Maintaining the Time in a Distributed System
https://dl.acm.org/doi/10.1145/358527.358537