Trading Clock Synchronization Playbook (NTP/PTP, Compliance, and Incident Hygiene)

2026-02-25 · finance

Trading Clock Synchronization Playbook (NTP/PTP, Compliance, and Incident Hygiene)

Date: 2026-02-25 (KST)

TL;DR

Bad timestamps are not just an audit problem — they are a trading control problem.

If your clocks drift, you get:

This playbook gives a practical way to run time as production infrastructure: regulatory target → timing budget → measurement loop → incident runbook.


1) Why this matters in live execution

In a modern execution stack (signal → router → broker/API → venue), even small clock errors can break analysis:

Treat time sync as a first-class SRE + risk-control domain, not an OS default.


2) Regulatory baseline you should design against

2.1 EU (MiFID II / RTS 25)

Commission Delegated Regulation (EU) 2017/574 (RTS 25) sets explicit UTC traceability and timestamp accuracy/granularity requirements.

Key thresholds from annex tables:

Practical implication: if your strategy is fast enough to care about queue priority, your time architecture must support sub-millisecond discipline.

2.2 US (FINRA / CAT context)

FINRA Rule 6820 requires business clocks to be synchronized to NIST time with:

FINRA Regulatory Notice 16-23 explains the 1s → 50ms tightening and includes transmission delay + drift in tolerance interpretation.


3) NTP vs PTP: when to use what

3.1 NTP (Chrony) default for broad infra

For most servers/services, NTP with Chrony is the pragmatic baseline:

Chrony’s own comparison highlights faster convergence and better behavior in intermittent/congested environments versus legacy ntpd defaults.

3.2 PTP for low-latency timestamp domains

Use PTP (IEEE 1588) when you need tighter and more stable offsets, especially for:

On Linux, typical production chain is:

  1. ptp4l syncs PHC (NIC hardware clock) to grandmaster
  2. phc2sys syncs CLOCK_REALTIME to PHC (or vice versa per design)

Important detail from linuxptp docs: in hardware timestamping mode, PHC and system clock timescales/offset handling must be explicit (especially around UTC offset/leap handling).


4) Architecture pattern (small quant team friendly)

Use a 3-tier model:

  1. Source layer

    • GNSS-disciplined source and/or high-quality upstream stratum
    • at least 2 independent references (avoid single timing SPOF)
  2. Distribution layer

    • internal NTP/Chrony servers for general fleet
    • optional PTP grandmaster/boundary path for low-latency segment
  3. Consumer layer

    • execution/mdata nodes with explicit clock role
    • app-level monotonic + wall-clock separation

Design principle: timestamp as close as possible to ingress/egress edge where decisions happen, and document that point.


5) Timing budget (don’t just set “<X ms”)

Define an additive error budget per node:

total_divergence <= source_error + network_delay_asymmetry + local_clock_drift + measurement_uncertainty

Then assign hard budgets, e.g.:

This makes incident triage faster: you know which component can mathematically explain a breach.


6) Leap second policy (pick one, enforce everywhere)

Leap-second handling causes real outages when mixed inconsistently.

Options:

Google’s public guidance describes a 24h linear smear (noon-to-noon UTC). This is operationally smooth for distributed systems, but it intentionally deviates from strict UTC during smear windows.

Rule of thumb:


7) Minimal observability set

Collect and alert on these by host role:

And operational signals:


8) Incident runbook (fast path)

Trigger examples

Immediate actions

  1. Freeze non-essential deploys touching execution path.
  2. Mark affected timestamps with quality flag in logs/warehouse.
  3. Compare reference hierarchy (source → distribution → host).
  4. If needed, isolate bad source and force known-good source set.
  5. Reconstruct impacted interval with monotonic-clock aids.

Post-incident checklist


9) Practical command-level checks (Linux)

NTP/Chrony quick checks:

PTP quick checks:

Automate periodic snapshots into a retained audit log.


10) Implementation ladder (2-week pragmatic rollout)

  1. Inventory all timestamp-producing systems.
  2. Classify them by required tolerance (100µs / 1ms / 50ms / 1s).
  3. Define one canonical UTC traceability design doc.
  4. Add offset/frequency metrics + alerts.
  5. Dry-run leap-second policy simulation in staging.
  6. Run quarterly “clock chaos” drill (source loss, asymmetry spike, bad NTP peer).

One-line takeaway

Time is a risk control surface: if clock discipline is explicit, measured, and rehearsed, execution analytics stay trustworthy and compliance gets boring.


References