Clock Synchronization & Timestamp Integrity: Production Playbook
Date: 2026-03-03
Category: knowledge
Domain: software / timekeeping / trading infrastructure
Why this matters
If your clocks are inconsistent, your data lies:
- event order becomes ambiguous,
- latency metrics get distorted,
- incident timelines break,
- compliance/audit evidence becomes weak.
In market systems, this is not a “nice to have.” It is operational and regulatory risk.
Core principle
Treat time as a first-class production dependency, not a host-level default.
That means you need:
- a declared reference timescale,
- explicit sync architecture (NTP/PTP + distribution model),
- timestamp semantics per use case,
- continuous drift/error monitoring,
- documented incident handling for clock faults.
1) Start with a time model (the part teams skip)
Define these explicitly in architecture docs:
- Reference clock for audit/compliance: UTC traceable source
- Clock for elapsed durations: monotonic clock (
CLOCK_MONOTONICfamily) - Clock for absolute labels/logs: wall clock (
CLOCK_REALTIME/UTC) - Clock for leap-second-aware workflows: consider
CLOCK_TAIwhere needed
From Linux clock_gettime docs:
CLOCK_REALTIMEis wall time, can jump, and is adjusted by NTP-like mechanisms.CLOCK_MONOTONICdoes not go backward but can be frequency-adjusted.CLOCK_MONOTONIC_RAWis monotonic and not subject to frequency adjustments.CLOCK_TAIcounts leap seconds and avoids leap insertion discontinuities.
Operational rule:
- use monotonic clocks for measuring durations/timeouts/retries,
- use UTC wall time for externally visible timestamps and cross-system correlation.
2) Pick synchronization architecture by requirement, not habit
Baseline mode (general infra)
- chrony + reliable NTP sources
- monitor offset/jitter continuously
- keep stratum and source diversity sane
Chrony examples show real-world behavior:
- public-pool client setups are often within a few milliseconds, but excursions happen,
- local-server setups with tighter polling and better network conditions can be much tighter.
Precision mode (low-latency/strict ordering domains)
Use PTP stack where required:
ptp4lfor PTP clocking (OC/BC/TC modes),phc2systo synchronize PHC ↔ system clock,- hardware timestamping when available.
ptp4l/phc2sys docs make this clear: software supports boundary/ordinary/transparent clock operation and explicit system-clock synchronization flows.
Practical decision:
- don’t deploy PTP “just because.”
- deploy PTP when your error budget and downstream consumers require it.
3) Regulatory nuance (important update)
Historically, many teams anchored to MiFID RTS 25 (EU 2017/574), including UTC traceability and strict divergence/granularity classes.
But note: the regulation text indicates 2017/574 is repealed (end of validity: 2026-03-01) and replaced by EU 2025/1155.
Action item
If your runbooks still quote only 2017/574 tables, update them immediately:
- confirm current applicable thresholds in 2025/1155 and local supervisory guidance,
- align monitoring alarms and evidence collection to current rules,
- keep old thresholds only as historical context.
4) Leap-second policy: choose one, then standardize everywhere
Leap handling mismatch across services causes silent multi-hundred-ms to second-level disagreement.
Google Public NTP’s documented approach is a 24-hour linear leap smear (noon-to-noon UTC), with about 11.6 ppm frequency change and temporary deviation from unsmeared UTC during the smear window.
Required governance
- declare whether your estate is smear or step,
- do not mix policies inside a tightly-coupled event pipeline,
- if you must interoperate, explicitly convert/normalize at boundaries.
5) Timestamping semantics at the application layer
Define timestamp fields with meaning, not just datatype.
Recommended event schema:
event_time_utc(RFC3339 with subsecond precision)event_time_source(exchange, gateway, app, db)ingest_time_utcprocess_monotonic_nsclock_state(offset estimate, sync source id, leap state)
This makes backtesting, forensic reconstruction, and audit defensible.
6) Monitoring & SLOs for time quality
Treat time as an SLO surface.
Minimum metrics
- estimated offset to reference
- drift rate (ppm)
- jitter / RMS offset
- source reachability and source changes
- leap status/state
- step events and large corrections
Example SLO framing
% hosts in-sync within target offset% events with trusted clock_state metadatamax unresolved clock incidents per month
7) Incident runbook (clock fault)
When time quality degrades:
- Freeze assumptions: stop trusting cross-host order blindly.
- Capture evidence: sync daemon status, source lists, offset history, leap flags.
- Bound blast radius: isolate affected hosts/services from decision-critical paths.
- Recover safely: prefer controlled slew/step strategy per service sensitivity.
- Annotate data windows: mark suspect intervals for analytics/trading/reporting.
- Postmortem update: root cause + control gap + prevention control.
8) 12-point production checklist
- Time model documented (UTC vs monotonic vs TAI use cases)
- Sync architecture declared (NTP/PTP and source topology)
- Leap policy declared and uniform
- App event schema includes clock provenance
- Drift/offset dashboards with alerting exist
- On-host sync health included in host readiness checks
- Clock step events are audited
- Cross-region source diversity validated
- Runbooks updated for current regulation set (incl. 2025/1155)
- Backtest/analytics pipelines handle suspect-time intervals
- Chaos drill includes clock skew scenario
- Incident review template includes “time quality” section
Common anti-patterns
- using
CLOCK_REALTIMEfor elapsed-time calculations, - assuming “NTP installed” means “time quality solved,”
- mixing smeared and unsmeared UTC silently,
- storing timestamps without source/provenance,
- ignoring regulation changes after initial compliance project.
References
- Linux man-pages —
clock_gettime(2)
https://man7.org/linux/man-pages/man2/clock_gettime.2.html - Chrony examples (config + observed accuracy behavior)
https://chrony-project.org/examples.html - LinuxPTP —
ptp4l
https://www.linuxptp.org/documentation/ptp4l/ - LinuxPTP —
phc2sys
https://www.linuxptp.org/documentation/phc2sys/ - Google Public NTP leap smear
https://developers.google.com/time/smear - UK legislation mirror — EU 2017/574 (includes Annex tables and repeal status metadata)
https://www.legislation.gov.uk/eur/2017/574 - EUR-Lex — EU 2025/1155 (repeals 2017/574 and updates framework)
https://eur-lex.europa.eu/eli/reg_del/2025/1155/oj/eng
One-line takeaway
If you don’t operate time as infrastructure, every “precise” metric and timeline in your system is potentially fiction.