Leap-Smear Mixed Time Sources & Timeout-Convexity Slippage Playbook

2026-03-25 · finance

Leap-Smear Mixed Time Sources & Timeout-Convexity Slippage Playbook

Date: 2026-03-25
Category: research
Audience: low-latency execution operators using mixed NTP/PTP/clock domains


Why this note

Most execution systems treat "clock quality" as a monitoring concern, not a slippage variable.

That assumption fails around leap-second handling windows when infrastructure mixes:

Even if both are "correct" relative to their own standard, they can disagree by up to ~0.5 s around the event depending on smear profile. In execution pipelines, this can quietly produce:

Treating this as generic latency noise is a category error. It is a clock-regime branch risk.


1) Cost decomposition with clock-regime term

For child decision at time (t):

[ \mathbb{E}[C_t] = \mathbb{E}[C_{base} \mid a_t, s_t] + \lambda_{clk},\mathbb{E}[C_{clk} \mid r_t] ]

Where:

Branch approximation:

[ C_{clk} \approx p_{stale}L_{stale} + p_{timeout}L_{timeout} + p_{reorder}L_{reorder} + p_{deadline}L_{deadline} ]

Key latent variable:

When (|\delta_{clk}|) grows, branch probabilities rise nonlinearly.


2) Observability contract (required fields)

A) Clock provenance (every host/service)

B) Execution timeline (point-in-time)

C) Data quality tags

Without explicit clock-domain tags, replay/TCA can fabricate causal narratives.


3) Practical model components

A) Clock split detector

States:

  1. TIME_ALIGNED — negligible cross-domain offset
  2. SMEAR_MIX_RISK — smear and non-smear coexist, growing divergence
  3. CLOCK_SPLIT_ACTIVE — offset above safety threshold; branch risk material
  4. RECOVERY_RECONCILE — offsets normalize, reconciliation still pending

Implementation: rules + hysteresis first, HMM later.

B) Timeout misfire model

Estimate:

[ P(\text{false timeout} \mid \delta_{clk}, venue, path, load) ]

False timeout = local timeout fires while external execution path was still healthy.

C) Causal inversion model

Estimate probability that event ordering flips under clock drift:

[ P(T_{ack} < T_{send});\text{or};P(T_{fill} < T_{ack})\text{ (after normalization)} ]

Use to down-weight fragile labels in online learning and TCA attribution.

D) Deadline convexity under clock uncertainty

[ \Delta C_{deadline}(u,\delta_{clk}) = C_{forced}(u,\delta_{clk}) - C_{smooth}(u) ]

As deadline approaches, even modest (\delta_{clk}) inflates forced-aggression probability.


4) Live execution controller

GREEN — TIME_ALIGNED

AMBER — SMEAR_MIX_RISK

RED — CLOCK_SPLIT_ACTIVE

BLUE — RECOVERY_RECONCILE


5) KPI stack

Alert examples:


6) Validation ladder

  1. Historical relabeling
    Recompute event timelines under unified reference clock and compare branch outcomes.

  2. Counterfactual replay
    Simulate AMBER/RED controls during known clock divergence windows.

  3. Leap-window game day
    Inject synthetic smear/non-smear offsets in staging; verify timeout and ordering resilience.

  4. Shadow -> canary
    Shadow detector first, then canary controls on low-risk flow.

Primary success metric: lower q95/q99 slippage and fewer timeout/retry bursts with bounded completion degradation.


7) 14-day implementation plan

Days 1-2
Add clock-domain metadata and smear-policy telemetry to all execution events.

Days 3-4
Build CMD/FTR/CII/SBF dashboards and baseline distributions.

Days 5-6
Implement TIME_ALIGNED/SMEAR_MIX_RISK/CLOCK_SPLIT rules + hysteresis.

Days 7-9
Train timeout-misfire and stale-arrival branch estimators.

Days 10-11
Integrate AMBER/RED policy guards in shadow mode.

Days 12-13
Canary enablement with hard rollback gates.

Day 14
Finalize incident runbook + reconciliation checklist.


Common mistakes

  1. Mixing smeared NTP and non-smeared PTP without explicit policy
    This is documented risk on cloud platforms and creates silent split-brain time.

  2. Using realtime clock for interval measurement
    Interval logic (timeouts, RTT estimates) should rely on monotonic clocks.

  3. Assuming transport degradation when timeouts spike
    False timeout spikes can be clock artifacts, not venue/network deterioration.

  4. TCA attribution without clock-confidence tagging
    You will overfit policy to timestamp artifacts.


Bottom line

Leap-second handling is no longer just a timekeeping detail.

In mixed clock-source environments, smear-vs-step divergence can become a first-class slippage driver through timeout, ordering, and deadline-convexity branches.

Model it explicitly, gate live policy by clock-regime state, and enforce conservative recovery + reconciliation. That is how you prevent "time semantics" from turning into hidden basis-point leakage.


References