Leap-Smear Mixed Time Sources & Timeout-Convexity Slippage Playbook
Date: 2026-03-25
Category: research
Audience: low-latency execution operators using mixed NTP/PTP/clock domains
Why this note
Most execution systems treat "clock quality" as a monitoring concern, not a slippage variable.
That assumption fails around leap-second handling windows when infrastructure mixes:
- smeared UTC sources (gradual adjustment), and
- unsmeared/step UTC sources (discrete leap insertion).
Even if both are "correct" relative to their own standard, they can disagree by up to ~0.5 s around the event depending on smear profile. In execution pipelines, this can quietly produce:
- stale-signal admission mistakes,
- timeout misclassification,
- out-of-order event stitching,
- and forced urgency bursts near schedule deadlines.
Treating this as generic latency noise is a category error. It is a clock-regime branch risk.
1) Cost decomposition with clock-regime term
For child decision at time (t):
[ \mathbb{E}[C_t] = \mathbb{E}[C_{base} \mid a_t, s_t] + \lambda_{clk},\mathbb{E}[C_{clk} \mid r_t] ]
Where:
- (C_{base}): spread + impact + fees + opportunity cost,
- (r_t): clock-regime state (aligned / mixed-smear / split),
- (C_{clk}): incremental cost from clock inconsistency.
Branch approximation:
[ C_{clk} \approx p_{stale}L_{stale} + p_{timeout}L_{timeout} + p_{reorder}L_{reorder} + p_{deadline}L_{deadline} ]
Key latent variable:
- (\delta_{clk}(t)): effective cross-component timestamp offset (decision engine vs gateway vs TCA/logging vs venue-clock proxy).
When (|\delta_{clk}|) grows, branch probabilities rise nonlinearly.
2) Observability contract (required fields)
A) Clock provenance (every host/service)
- active time source (
ptp,ntp,phc, vendor local clock) - smear policy (
none,linear24h, custom) - measured offset/drift to reference source
- leap indicator state
- monotonic-vs-realtime delta stability
B) Execution timeline (point-in-time)
- decision timestamp (monotonic + realtime pair)
- send/ACK/reject/fill timestamps
- timeout budget and timeout firing instant
- feature-age-at-decision and estimated feature-age-at-arrival
C) Data quality tags
- source clock domain ID per event
- confidence class:
TRUSTED,DEGRADED,UNRECONCILED
Without explicit clock-domain tags, replay/TCA can fabricate causal narratives.
3) Practical model components
A) Clock split detector
States:
- TIME_ALIGNED — negligible cross-domain offset
- SMEAR_MIX_RISK — smear and non-smear coexist, growing divergence
- CLOCK_SPLIT_ACTIVE — offset above safety threshold; branch risk material
- RECOVERY_RECONCILE — offsets normalize, reconciliation still pending
Implementation: rules + hysteresis first, HMM later.
B) Timeout misfire model
Estimate:
[ P(\text{false timeout} \mid \delta_{clk}, venue, path, load) ]
False timeout = local timeout fires while external execution path was still healthy.
C) Causal inversion model
Estimate probability that event ordering flips under clock drift:
[ P(T_{ack} < T_{send});\text{or};P(T_{fill} < T_{ack})\text{ (after normalization)} ]
Use to down-weight fragile labels in online learning and TCA attribution.
D) Deadline convexity under clock uncertainty
[ \Delta C_{deadline}(u,\delta_{clk}) = C_{forced}(u,\delta_{clk}) - C_{smooth}(u) ]
As deadline approaches, even modest (\delta_{clk}) inflates forced-aggression probability.
4) Live execution controller
GREEN — TIME_ALIGNED
- Normal policy.
AMBER — SMEAR_MIX_RISK
- tighten stale-signal admission windows,
- reduce child slice size and burst tolerance,
- increase timeout buffers with bounded retry pacing,
- mark TCA confidence as degraded.
RED — CLOCK_SPLIT_ACTIVE
- disable short-half-life alpha branches,
- freeze aggressive deadline catch-up logic,
- cap participation and avoid panic-cross loops,
- switch to conservative completion mode with explicit operator alert.
BLUE — RECOVERY_RECONCILE
- keep caps until offset and ordering metrics stabilize for dwell period,
- reconcile positions/fills using monotonic sequence + venue IDs,
- no direct RED→GREEN jump.
5) KPI stack
- CMD (Clock Mismatch Distance): p95 absolute cross-domain offset
- FTR (False Timeout Rate): timeout-fired but later-valid ACK/fill fraction
- CII (Causal Inversion Incidence): impossible ordering rate after normalization
- SBF (Stale-Branch Fraction): fraction of decisions violating signal freshness because of clock mismatch
- DCS-CLK (Deadline Convexity Slope, clock-conditioned): urgency-cost steepness conditioned on clock regime
Alert examples:
- CMD↑ + CII↑ for N minutes => enter RED
- FTR spike with stable network RTT => likely clock, not transport
- SBF above weekly p95 near auction windows => force conservative mode
6) Validation ladder
Historical relabeling
Recompute event timelines under unified reference clock and compare branch outcomes.Counterfactual replay
Simulate AMBER/RED controls during known clock divergence windows.Leap-window game day
Inject synthetic smear/non-smear offsets in staging; verify timeout and ordering resilience.Shadow -> canary
Shadow detector first, then canary controls on low-risk flow.
Primary success metric: lower q95/q99 slippage and fewer timeout/retry bursts with bounded completion degradation.
7) 14-day implementation plan
Days 1-2
Add clock-domain metadata and smear-policy telemetry to all execution events.
Days 3-4
Build CMD/FTR/CII/SBF dashboards and baseline distributions.
Days 5-6
Implement TIME_ALIGNED/SMEAR_MIX_RISK/CLOCK_SPLIT rules + hysteresis.
Days 7-9
Train timeout-misfire and stale-arrival branch estimators.
Days 10-11
Integrate AMBER/RED policy guards in shadow mode.
Days 12-13
Canary enablement with hard rollback gates.
Day 14
Finalize incident runbook + reconciliation checklist.
Common mistakes
Mixing smeared NTP and non-smeared PTP without explicit policy
This is documented risk on cloud platforms and creates silent split-brain time.Using realtime clock for interval measurement
Interval logic (timeouts, RTT estimates) should rely on monotonic clocks.Assuming transport degradation when timeouts spike
False timeout spikes can be clock artifacts, not venue/network deterioration.TCA attribution without clock-confidence tagging
You will overfit policy to timestamp artifacts.
Bottom line
Leap-second handling is no longer just a timekeeping detail.
In mixed clock-source environments, smear-vs-step divergence can become a first-class slippage driver through timeout, ordering, and deadline-convexity branches.
Model it explicitly, gate live policy by clock-regime state, and enforce conservative recovery + reconciliation. That is how you prevent "time semantics" from turning into hidden basis-point leakage.
References
Google Public NTP — Leap Smear
https://developers.google.com/time/smearAWS EC2 Time Sync docs (smear vs PTP non-smear caution)
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time.htmlBIPM — Resolution 4 (27th CGPM, 2022), future UTC/leap-second direction
https://www.bipm.org/en/cgpm-2022/resolution-4Cloudflare postmortem: leap-second time assumptions and DNS impact
https://blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns/LKML thread: 2012 leap-second futex timeout behavior
https://lkml.iu.edu/hypermail/linux/kernel/1206.3/03186.html