Clock Synchronization & Timestamp Integrity: Production Playbook

Date: 2026-03-03
Category: knowledge
Domain: software / timekeeping / trading infrastructure

Why this matters

If your clocks are inconsistent, your data lies:

event order becomes ambiguous,
latency metrics get distorted,
incident timelines break,
compliance/audit evidence becomes weak.

In market systems, this is not a “nice to have.” It is operational and regulatory risk.

Core principle

Treat time as a first-class production dependency, not a host-level default.

That means you need:

a declared reference timescale,
explicit sync architecture (NTP/PTP + distribution model),
timestamp semantics per use case,
continuous drift/error monitoring,
documented incident handling for clock faults.

1) Start with a time model (the part teams skip)

Define these explicitly in architecture docs:

Reference clock for audit/compliance: UTC traceable source
Clock for elapsed durations: monotonic clock (CLOCK_MONOTONIC family)
Clock for absolute labels/logs: wall clock (CLOCK_REALTIME/UTC)
Clock for leap-second-aware workflows: consider CLOCK_TAI where needed

From Linux clock_gettime docs:

CLOCK_REALTIME is wall time, can jump, and is adjusted by NTP-like mechanisms.
CLOCK_MONOTONIC does not go backward but can be frequency-adjusted.
CLOCK_MONOTONIC_RAW is monotonic and not subject to frequency adjustments.
CLOCK_TAI counts leap seconds and avoids leap insertion discontinuities.

Operational rule:

use monotonic clocks for measuring durations/timeouts/retries,
use UTC wall time for externally visible timestamps and cross-system correlation.

2) Pick synchronization architecture by requirement, not habit

Baseline mode (general infra)

chrony + reliable NTP sources
monitor offset/jitter continuously
keep stratum and source diversity sane

Chrony examples show real-world behavior:

public-pool client setups are often within a few milliseconds, but excursions happen,
local-server setups with tighter polling and better network conditions can be much tighter.

Precision mode (low-latency/strict ordering domains)

Use PTP stack where required:

ptp4l for PTP clocking (OC/BC/TC modes),
phc2sys to synchronize PHC ↔ system clock,
hardware timestamping when available.

ptp4l/phc2sys docs make this clear: software supports boundary/ordinary/transparent clock operation and explicit system-clock synchronization flows.

Practical decision:

don’t deploy PTP “just because.”
deploy PTP when your error budget and downstream consumers require it.

3) Regulatory nuance (important update)

Historically, many teams anchored to MiFID RTS 25 (EU 2017/574), including UTC traceability and strict divergence/granularity classes.

But note: the regulation text indicates 2017/574 is repealed (end of validity: 2026-03-01) and replaced by EU 2025/1155.

Action item

If your runbooks still quote only 2017/574 tables, update them immediately:

confirm current applicable thresholds in 2025/1155 and local supervisory guidance,
align monitoring alarms and evidence collection to current rules,
keep old thresholds only as historical context.

4) Leap-second policy: choose one, then standardize everywhere

Leap handling mismatch across services causes silent multi-hundred-ms to second-level disagreement.

Google Public NTP’s documented approach is a 24-hour linear leap smear (noon-to-noon UTC), with about 11.6 ppm frequency change and temporary deviation from unsmeared UTC during the smear window.

Required governance

declare whether your estate is smear or step,
do not mix policies inside a tightly-coupled event pipeline,
if you must interoperate, explicitly convert/normalize at boundaries.

5) Timestamping semantics at the application layer

Define timestamp fields with meaning, not just datatype.

Recommended event schema:

event_time_utc (RFC3339 with subsecond precision)
event_time_source (exchange, gateway, app, db)
ingest_time_utc
process_monotonic_ns
clock_state (offset estimate, sync source id, leap state)

This makes backtesting, forensic reconstruction, and audit defensible.

6) Monitoring & SLOs for time quality

Treat time as an SLO surface.

Minimum metrics

estimated offset to reference
drift rate (ppm)
jitter / RMS offset
source reachability and source changes
leap status/state
step events and large corrections

Example SLO framing

% hosts in-sync within target offset
% events with trusted clock_state metadata
max unresolved clock incidents per month

7) Incident runbook (clock fault)

When time quality degrades:

Freeze assumptions: stop trusting cross-host order blindly.
Capture evidence: sync daemon status, source lists, offset history, leap flags.
Bound blast radius: isolate affected hosts/services from decision-critical paths.
Recover safely: prefer controlled slew/step strategy per service sensitivity.
Annotate data windows: mark suspect intervals for analytics/trading/reporting.
Postmortem update: root cause + control gap + prevention control.

8) 12-point production checklist

Time model documented (UTC vs monotonic vs TAI use cases)
Sync architecture declared (NTP/PTP and source topology)
Leap policy declared and uniform
App event schema includes clock provenance
Drift/offset dashboards with alerting exist
On-host sync health included in host readiness checks
Clock step events are audited
Cross-region source diversity validated
Runbooks updated for current regulation set (incl. 2025/1155)
Backtest/analytics pipelines handle suspect-time intervals
Chaos drill includes clock skew scenario
Incident review template includes “time quality” section

Common anti-patterns

using CLOCK_REALTIME for elapsed-time calculations,
assuming “NTP installed” means “time quality solved,”
mixing smeared and unsmeared UTC silently,
storing timestamps without source/provenance,
ignoring regulation changes after initial compliance project.

References

Linux man-pages — clock_gettime(2)
https://man7.org/linux/man-pages/man2/clock_gettime.2.html
Chrony examples (config + observed accuracy behavior)
https://chrony-project.org/examples.html
LinuxPTP — ptp4l
https://www.linuxptp.org/documentation/ptp4l/
LinuxPTP — phc2sys
https://www.linuxptp.org/documentation/phc2sys/
Google Public NTP leap smear
https://developers.google.com/time/smear
UK legislation mirror — EU 2017/574 (includes Annex tables and repeal status metadata)
https://www.legislation.gov.uk/eur/2017/574
EUR-Lex — EU 2025/1155 (repeals 2017/574 and updates framework)
https://eur-lex.europa.eu/eli/reg_del/2025/1155/oj/eng

One-line takeaway

If you don’t operate time as infrastructure, every “precise” metric and timeline in your system is potentially fiction.