PTP Grandmaster Failover Clock-Step Slippage Playbook

2026-03-25 · finance

PTP Grandmaster Failover Clock-Step Slippage Playbook

Date: 2026-03-25
Category: research
Audience: low-latency execution operators with PTP/NTP-synchronized trading stacks


Why this note

Many slippage models assume timestamps are trustworthy enough to align signal, decision, and fill events.

That assumption breaks during time-sync regime transitions (PTP grandmaster failover, BMCA re-election, bad holdover, sudden NTP step correction). When local clocks step or rapidly slew, you can get:

This is a hidden source of tail slippage because execution policy still looks normal while time integrity is degraded.


1) Cost decomposition with time-integrity penalty

For child order decision at time (t):

[ \mathbb{E}[C_t] = \mathbb{E}[C_{exec}\mid a_t, s_t] + \lambda_\tau \cdot \mathbb{E}[C_{time}\mid z_t] ]

Where:

Practical approximation:

[ C_{time} \approx p_{inv}\cdot L_{inv} + p_{stale}\cdot L_{stale} + p_{ctrl}\cdot L_{ctrl} ]


2) Data contract (must be point-in-time)

A) Clock-health telemetry (host + NIC)

B) Execution-path timing

C) Session/change events

If mono/wall dual-timestamping is not available, confidence in attribution should be marked low automatically.


3) Modeling stack

A) Time-integrity state classifier

Define states from clock telemetry:

  1. LOCKED (low offset, stable ppm)
  2. SLEWING (offset correcting without step)
  3. STEP_EVENT (discrete jump)
  4. HOLDOVER (source lost, oscillator-only)
  5. UNTRUSTED (integrity unknown)

Use either rules + hysteresis or a small HMM/state-space filter with observed (offset,\ ppm,\ source_changes).

B) Causal inversion risk model

Estimate

[ P(inv\mid z_t, \Delta_{path}, \sigma_{jit}) ]

with labels from impossible event orderings (e.g., fill-ts < send-ts by nontrivial margin after known transport bounds).

C) Staleness under-report model

For each feature used by routing/sizing, estimate probability that true age exceeds policy budget under current clock state.

[ P(age_{true} > age_{budget} \mid z_t) ]

This term feeds urgency/routing clamps.

D) Tail slippage conditional model

Model q90/q95 slippage conditioned on time state + liquidity state:


4) Policy layer (what to do live)

Use an explicit execution safety ladder:

  1. GREEN (LOCKED)
    Normal policy.

  2. YELLOW (SLEWING / mild degradation)
    Tighten feature-age thresholds, reduce passive patience, lower child size.

  3. ORANGE (STEP_EVENT / HOLDOVER)
    Freeze time-sensitive alpha features, switch to conservative schedule, widen cancel/replace hysteresis to avoid oscillation.

  4. RED (UNTRUSTED)
    Disable aggressive discretionary logic, allow only minimal-risk unwind/kill-switch policy until integrity recovers.

Never blend GREEN and ORANGE behaviors silently. Transition rules must be auditable.


5) Production KPIs

Alert examples:


6) Validation ladder

  1. Historical reconstruction: replay with real clock telemetry + execution logs.
  2. Fault injection: synthetic step/slew and GM failover in staging.
  3. Shadow state machine: run ladder without acting, compare counterfactual cost.
  4. Canary enforcement: enforce ORANGE/RED on small flow slice with rollback triggers.

Key anti-pattern: training/validating with corrected timestamps only. That erases the very failure mode you need to control.


7) Two-week implementation plan

Days 1-2
Add dual timestamp fields (monotonic + wall) end-to-end; record clock-source metadata.

Days 3-4
Build TII calculator and state classifier (LOCKED→UNTRUSTED).

Days 5-7
Label causal inversions/stale accepts; train simple risk models.

Days 8-9
Integrate execution ladder with explicit GREEN/YELLOW/ORANGE/RED transitions.

Days 10-11
Run shadow-mode and compare state-conditional q95 slippage.

Days 12-13
Canary deploy with strict rollback guardrails.

Day 14
Finalize runbook for GM failover drills and post-incident attribution.


Common mistakes

  1. Using only wall-clock for sequencing
    Always keep monotonic timeline for local causality.

  2. Assuming NTP/PTP corrections are harmless
    Small average offset can still hide destructive step events.

  3. No hysteresis in safety states
    Without dwell/hysteresis, controller can flap between modes.

  4. Attribution without clock confidence
    If time integrity is low, queue/latency attribution should be down-weighted.


Bottom line

Clock integrity is not just compliance plumbing; it is a direct input to slippage control.

A practical production setup needs:

Treating clock failover as “infra-only” is how timestamp bugs become trading losses.


References