Trading Clock Synchronization Playbook (NTP/PTP, Compliance, and Incident Hygiene)
Date: 2026-02-25 (KST)
TL;DR
Bad timestamps are not just an audit problem — they are a trading control problem.
If your clocks drift, you get:
- wrong event ordering,
- misleading slippage attribution,
- broken cross-system joins,
- and regulatory exposure.
This playbook gives a practical way to run time as production infrastructure: regulatory target → timing budget → measurement loop → incident runbook.
1) Why this matters in live execution
In a modern execution stack (signal → router → broker/API → venue), even small clock errors can break analysis:
- You misread whether delay happened before or after a venue event.
- You over/underestimate adverse selection windows.
- You cannot reliably reconstruct a parent order timeline.
- Post-trade controls become un-auditable.
Treat time sync as a first-class SRE + risk-control domain, not an OS default.
2) Regulatory baseline you should design against
2.1 EU (MiFID II / RTS 25)
Commission Delegated Regulation (EU) 2017/574 (RTS 25) sets explicit UTC traceability and timestamp accuracy/granularity requirements.
Key thresholds from annex tables:
- Trading venues
- gateway-to-gateway latency > 1 ms → max divergence 1 ms, granularity 1 ms or better
- gateway-to-gateway latency ≤ 1 ms → max divergence 100 µs, granularity 1 µs or better
- Members/participants
- high-frequency algorithmic activity → max divergence 100 µs, granularity 1 µs or better
- other activity tiers can be 1 ms or 1 s depending on trading model
Practical implication: if your strategy is fast enough to care about queue priority, your time architecture must support sub-millisecond discipline.
2.2 US (FINRA / CAT context)
FINRA Rule 6820 requires business clocks to be synchronized to NIST time with:
- 50 ms tolerance for most electronic order-event clocks,
- 1 s tolerance for manual-order/allocation-only clocks,
- and explicit logging/documentation of sync checks and drift events.
FINRA Regulatory Notice 16-23 explains the 1s → 50ms tightening and includes transmission delay + drift in tolerance interpretation.
3) NTP vs PTP: when to use what
3.1 NTP (Chrony) default for broad infra
For most servers/services, NTP with Chrony is the pragmatic baseline:
- easier to operate,
- robust over imperfect networks,
- supports NTS and modern filtering.
Chrony’s own comparison highlights faster convergence and better behavior in intermittent/congested environments versus legacy ntpd defaults.
3.2 PTP for low-latency timestamp domains
Use PTP (IEEE 1588) when you need tighter and more stable offsets, especially for:
- market data capture,
- execution gateways,
- packet timestamping NIC paths.
On Linux, typical production chain is:
ptp4lsyncs PHC (NIC hardware clock) to grandmasterphc2syssyncsCLOCK_REALTIMEto PHC (or vice versa per design)
Important detail from linuxptp docs: in hardware timestamping mode, PHC and system clock timescales/offset handling must be explicit (especially around UTC offset/leap handling).
4) Architecture pattern (small quant team friendly)
Use a 3-tier model:
Source layer
- GNSS-disciplined source and/or high-quality upstream stratum
- at least 2 independent references (avoid single timing SPOF)
Distribution layer
- internal NTP/Chrony servers for general fleet
- optional PTP grandmaster/boundary path for low-latency segment
Consumer layer
- execution/mdata nodes with explicit clock role
- app-level monotonic + wall-clock separation
Design principle: timestamp as close as possible to ingress/egress edge where decisions happen, and document that point.
5) Timing budget (don’t just set “<X ms”)
Define an additive error budget per node:
total_divergence <= source_error + network_delay_asymmetry + local_clock_drift + measurement_uncertainty
Then assign hard budgets, e.g.:
- source contribution: 20%
- network/asymmetry: 30%
- host oscillator/drift: 30%
- measurement/observability error: 20%
This makes incident triage faster: you know which component can mathematically explain a breach.
6) Leap second policy (pick one, enforce everywhere)
Leap-second handling causes real outages when mixed inconsistently.
Options:
- Step (strict UTC insertion)
- Smear (gradual adjustment)
Google’s public guidance describes a 24h linear smear (noon-to-noon UTC). This is operationally smooth for distributed systems, but it intentionally deviates from strict UTC during smear windows.
Rule of thumb:
- compliance-critical event stamping domains: prefer strict/controlled UTC policy,
- internal analytics domains: smear can be acceptable,
- never mix smeared and unsmeared clocks in the same event-ordering pipeline without explicit conversion metadata.
7) Minimal observability set
Collect and alert on these by host role:
clock_offset_ns(vs chosen reference)clock_freq_ppmlast_sync_age_secmax_offset_5m_ns,p99_offset_1h_ns- PTP state (
MASTER/SLAVE, port state transitions) - source quality / source switch count
And operational signals:
- leap announcement state,
- step events count,
- “time jumped backward” detector in app logs.
8) Incident runbook (fast path)
Trigger examples
- offset breaches threshold for N minutes
- sudden source flip + increasing jitter
- downstream event ordering anomalies
Immediate actions
- Freeze non-essential deploys touching execution path.
- Mark affected timestamps with quality flag in logs/warehouse.
- Compare reference hierarchy (source → distribution → host).
- If needed, isolate bad source and force known-good source set.
- Reconstruct impacted interval with monotonic-clock aids.
Post-incident checklist
- root cause category (source/network/host/config/mixed-timescale)
- whether breach crossed regulatory tolerance bucket
- exact interval of unreliable ordering
- preventive config/test added (not just docs)
9) Practical command-level checks (Linux)
NTP/Chrony quick checks:
chronyc trackingchronyc sources -vchronyc sourcestats
PTP quick checks:
ptp4l ... -m(state transitions, offsets)phc2sys ... -m(PHC↔system alignment)pmcqueries for grandmaster/offset status
Automate periodic snapshots into a retained audit log.
10) Implementation ladder (2-week pragmatic rollout)
- Inventory all timestamp-producing systems.
- Classify them by required tolerance (100µs / 1ms / 50ms / 1s).
- Define one canonical UTC traceability design doc.
- Add offset/frequency metrics + alerts.
- Dry-run leap-second policy simulation in staging.
- Run quarterly “clock chaos” drill (source loss, asymmetry spike, bad NTP peer).
One-line takeaway
Time is a risk control surface: if clock discipline is explicit, measured, and rehearsed, execution analytics stay trustworthy and compliance gets boring.
References
- FINRA Rule 6820 (Clock Synchronization):
- FINRA Regulatory Notice 16-23:
- EU RTS 25 (Commission Delegated Regulation (EU) 2017/574):
- chrony comparison of NTP implementations:
- linuxptp docs:
- Google leap smear guidance:
- NIST leap second / UT1-UTC notes: