Clock Synchronization & Timestamp Integrity: Production Playbook

2026-03-03 · software

Clock Synchronization & Timestamp Integrity: Production Playbook

Date: 2026-03-03
Category: knowledge
Domain: software / timekeeping / trading infrastructure

Why this matters

If your clocks are inconsistent, your data lies:

In market systems, this is not a “nice to have.” It is operational and regulatory risk.


Core principle

Treat time as a first-class production dependency, not a host-level default.

That means you need:

  1. a declared reference timescale,
  2. explicit sync architecture (NTP/PTP + distribution model),
  3. timestamp semantics per use case,
  4. continuous drift/error monitoring,
  5. documented incident handling for clock faults.

1) Start with a time model (the part teams skip)

Define these explicitly in architecture docs:

From Linux clock_gettime docs:

Operational rule:


2) Pick synchronization architecture by requirement, not habit

Baseline mode (general infra)

Chrony examples show real-world behavior:

Precision mode (low-latency/strict ordering domains)

Use PTP stack where required:

ptp4l/phc2sys docs make this clear: software supports boundary/ordinary/transparent clock operation and explicit system-clock synchronization flows.

Practical decision:


3) Regulatory nuance (important update)

Historically, many teams anchored to MiFID RTS 25 (EU 2017/574), including UTC traceability and strict divergence/granularity classes.

But note: the regulation text indicates 2017/574 is repealed (end of validity: 2026-03-01) and replaced by EU 2025/1155.

Action item

If your runbooks still quote only 2017/574 tables, update them immediately:


4) Leap-second policy: choose one, then standardize everywhere

Leap handling mismatch across services causes silent multi-hundred-ms to second-level disagreement.

Google Public NTP’s documented approach is a 24-hour linear leap smear (noon-to-noon UTC), with about 11.6 ppm frequency change and temporary deviation from unsmeared UTC during the smear window.

Required governance


5) Timestamping semantics at the application layer

Define timestamp fields with meaning, not just datatype.

Recommended event schema:

This makes backtesting, forensic reconstruction, and audit defensible.


6) Monitoring & SLOs for time quality

Treat time as an SLO surface.

Minimum metrics

Example SLO framing


7) Incident runbook (clock fault)

When time quality degrades:

  1. Freeze assumptions: stop trusting cross-host order blindly.
  2. Capture evidence: sync daemon status, source lists, offset history, leap flags.
  3. Bound blast radius: isolate affected hosts/services from decision-critical paths.
  4. Recover safely: prefer controlled slew/step strategy per service sensitivity.
  5. Annotate data windows: mark suspect intervals for analytics/trading/reporting.
  6. Postmortem update: root cause + control gap + prevention control.

8) 12-point production checklist


Common anti-patterns


References


One-line takeaway

If you don’t operate time as infrastructure, every “precise” metric and timeline in your system is potentially fiction.