Leap-Smear Mixed Time Sources & Timeout-Convexity Slippage Playbook

Date: 2026-03-25
Category: research
Audience: low-latency execution operators using mixed NTP/PTP/clock domains

Why this note

Most execution systems treat "clock quality" as a monitoring concern, not a slippage variable.

That assumption fails around leap-second handling windows when infrastructure mixes:

smeared UTC sources (gradual adjustment), and
unsmeared/step UTC sources (discrete leap insertion).

Even if both are "correct" relative to their own standard, they can disagree by up to ~0.5 s around the event depending on smear profile. In execution pipelines, this can quietly produce:

stale-signal admission mistakes,
timeout misclassification,
out-of-order event stitching,
and forced urgency bursts near schedule deadlines.

Treating this as generic latency noise is a category error. It is a clock-regime branch risk.

1) Cost decomposition with clock-regime term

For child decision at time (t):

[ \mathbb{E}[C_t] = \mathbb{E}[C_{base} \mid a_t, s_t] + \lambda_{clk},\mathbb{E}[C_{clk} \mid r_t] ]

Where:

(C_{base}): spread + impact + fees + opportunity cost,
(r_t): clock-regime state (aligned / mixed-smear / split),
(C_{clk}): incremental cost from clock inconsistency.

Branch approximation:

[ C_{clk} \approx p_{stale}L_{stale} + p_{timeout}L_{timeout} + p_{reorder}L_{reorder} + p_{deadline}L_{deadline} ]

Key latent variable:

(\delta_{clk}(t)): effective cross-component timestamp offset (decision engine vs gateway vs TCA/logging vs venue-clock proxy).

When (|\delta_{clk}|) grows, branch probabilities rise nonlinearly.

2) Observability contract (required fields)

A) Clock provenance (every host/service)

active time source (ptp, ntp, phc, vendor local clock)
smear policy (none, linear24h, custom)
measured offset/drift to reference source
leap indicator state
monotonic-vs-realtime delta stability

B) Execution timeline (point-in-time)

decision timestamp (monotonic + realtime pair)
send/ACK/reject/fill timestamps
timeout budget and timeout firing instant
feature-age-at-decision and estimated feature-age-at-arrival

C) Data quality tags

source clock domain ID per event
confidence class: TRUSTED, DEGRADED, UNRECONCILED

Without explicit clock-domain tags, replay/TCA can fabricate causal narratives.

3) Practical model components

A) Clock split detector

States:

TIME_ALIGNED — negligible cross-domain offset
SMEAR_MIX_RISK — smear and non-smear coexist, growing divergence
CLOCK_SPLIT_ACTIVE — offset above safety threshold; branch risk material
RECOVERY_RECONCILE — offsets normalize, reconciliation still pending

Implementation: rules + hysteresis first, HMM later.

B) Timeout misfire model

Estimate:

[ P(\text{false timeout} \mid \delta_{clk}, venue, path, load) ]

False timeout = local timeout fires while external execution path was still healthy.

C) Causal inversion model

Estimate probability that event ordering flips under clock drift:

[ P(T_{ack} < T_{send});\text{or};P(T_{fill} < T_{ack})\text{ (after normalization)} ]

Use to down-weight fragile labels in online learning and TCA attribution.

D) Deadline convexity under clock uncertainty

[ \Delta C_{deadline}(u,\delta_{clk}) = C_{forced}(u,\delta_{clk}) - C_{smooth}(u) ]

As deadline approaches, even modest (\delta_{clk}) inflates forced-aggression probability.

4) Live execution controller

GREEN — TIME_ALIGNED

Normal policy.

AMBER — SMEAR_MIX_RISK

tighten stale-signal admission windows,
reduce child slice size and burst tolerance,
increase timeout buffers with bounded retry pacing,
mark TCA confidence as degraded.

RED — CLOCK_SPLIT_ACTIVE

disable short-half-life alpha branches,
freeze aggressive deadline catch-up logic,
cap participation and avoid panic-cross loops,
switch to conservative completion mode with explicit operator alert.

BLUE — RECOVERY_RECONCILE

keep caps until offset and ordering metrics stabilize for dwell period,
reconcile positions/fills using monotonic sequence + venue IDs,
no direct RED→GREEN jump.

5) KPI stack

CMD (Clock Mismatch Distance): p95 absolute cross-domain offset
FTR (False Timeout Rate): timeout-fired but later-valid ACK/fill fraction
CII (Causal Inversion Incidence): impossible ordering rate after normalization
SBF (Stale-Branch Fraction): fraction of decisions violating signal freshness because of clock mismatch
DCS-CLK (Deadline Convexity Slope, clock-conditioned): urgency-cost steepness conditioned on clock regime

Alert examples:

CMD↑ + CII↑ for N minutes => enter RED
FTR spike with stable network RTT => likely clock, not transport
SBF above weekly p95 near auction windows => force conservative mode

6) Validation ladder

Historical relabeling
Recompute event timelines under unified reference clock and compare branch outcomes.
Counterfactual replay
Simulate AMBER/RED controls during known clock divergence windows.
Leap-window game day
Inject synthetic smear/non-smear offsets in staging; verify timeout and ordering resilience.
Shadow -> canary
Shadow detector first, then canary controls on low-risk flow.

Primary success metric: lower q95/q99 slippage and fewer timeout/retry bursts with bounded completion degradation.

7) 14-day implementation plan

Days 1-2
Add clock-domain metadata and smear-policy telemetry to all execution events.

Days 3-4
Build CMD/FTR/CII/SBF dashboards and baseline distributions.

Days 5-6
Implement TIME_ALIGNED/SMEAR_MIX_RISK/CLOCK_SPLIT rules + hysteresis.

Days 7-9
Train timeout-misfire and stale-arrival branch estimators.

Days 10-11
Integrate AMBER/RED policy guards in shadow mode.

Days 12-13
Canary enablement with hard rollback gates.

Day 14
Finalize incident runbook + reconciliation checklist.

Common mistakes

Mixing smeared NTP and non-smeared PTP without explicit policy
This is documented risk on cloud platforms and creates silent split-brain time.
Using realtime clock for interval measurement
Interval logic (timeouts, RTT estimates) should rely on monotonic clocks.
Assuming transport degradation when timeouts spike
False timeout spikes can be clock artifacts, not venue/network deterioration.
TCA attribution without clock-confidence tagging
You will overfit policy to timestamp artifacts.

Bottom line

Leap-second handling is no longer just a timekeeping detail.

In mixed clock-source environments, smear-vs-step divergence can become a first-class slippage driver through timeout, ordering, and deadline-convexity branches.

Model it explicitly, gate live policy by clock-regime state, and enforce conservative recovery + reconciliation. That is how you prevent "time semantics" from turning into hidden basis-point leakage.

References

Google Public NTP — Leap Smear
https://developers.google.com/time/smear
AWS EC2 Time Sync docs (smear vs PTP non-smear caution)
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time.html
BIPM — Resolution 4 (27th CGPM, 2022), future UTC/leap-second direction
https://www.bipm.org/en/cgpm-2022/resolution-4
Cloudflare postmortem: leap-second time assumptions and DNS impact
https://blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns/
LKML thread: 2012 leap-second futex timeout behavior
https://lkml.iu.edu/hypermail/linux/kernel/1206.3/03186.html