The Exactly-Once Illusion: Distributed Cron & Scheduler Correctness Playbook

2026-03-02 · software

The Exactly-Once Illusion: Distributed Cron & Scheduler Correctness Playbook

Date: 2026-03-02
Category: knowledge
Domain: software / distributed systems / operations

Why this matters

Periodic jobs look simple until they fail in production:

In distributed systems, "exactly once" is usually a product requirement implemented on top of at-least-once infrastructure.


Core operating principle

Treat schedule dispatch as control-plane intent, and job side effects as data-plane transactions.

That means:

  1. Scheduler guarantees: when/what should run
  2. Worker guarantees: what side effects are allowed once per logical run
  3. Reconciliation guarantees: what to do when either side is uncertain

Failure model you should assume (always)

  1. Scheduler leader crashes after sending some (not all) dispatch RPCs
  2. Job starts, succeeds, but ack/update is lost
  3. Clock skew or controller lag causes late starts and backlog catch-up
  4. Retry policy and next schedule overlap
  5. Control plane emits duplicate creates for the same nominal time

If your design cannot survive these five, it is not production-safe.


Define the unit of uniqueness first

Create a logical run key and use it everywhere:

run_key = <job_id>#<scheduled_time_utc>#<version>

Rules:

Every queue message, DB mutation, and audit event should carry run_key.


Overlap policy = business semantics

Choose overlap behavior per workload, not per platform default.

Practical mapping:


Missed-run and catch-up policy must be explicit

Never leave outage behavior implicit.

For each schedule, define:

  1. max_lateness (catch-up window)
  2. backfill mode (none, sequential, bounded_parallel)
  3. safety gate (e.g., stop catch-up when downstream error rate > threshold)

Good default:


Idempotency is the real correctness boundary

Use a durable idempotency table keyed by run_key:

Execution rule:

  1. upsert run_key with compare-and-set semantics
  2. if already succeeded, no-op
  3. if started and stale heartbeat, route to recovery path
  4. commit side effect and success marker atomically where possible

If atomic commit is impossible, use outbox/inbox or compensating transaction patterns.


Leasing and fencing for active-active safety

Leader election alone is insufficient.

Use a lease epoch (fencing token):

This blocks split-brain leaders from issuing valid-looking late commands.


Retry policy should not fight schedule policy

A common production bug:

Guardrails:


Time correctness rules (non-negotiable)

  1. Store and compare schedule times in UTC
  2. Persist canonical scheduled timestamp with each run
  3. Version-control timezone intent and DST policy
  4. Alert on clock skew and scheduler/controller lag

Human-readable local time is a display concern, not storage truth.


Load-shaping: avoid synchronized thundering herds

If thousands of nodes run 00 * * * *, you manufactured your own incident.

Use both:

Goal: preserve cadence while spreading load.


Observability contract for scheduled systems

Track at least these metrics:

And one crucial derived SLI:

run_completeness = succeeded_runs / expected_runs

This catches silent misses better than raw failure rate.


Incident runbook (minimum)

When alerts fire:

  1. freeze schedule if duplicates are suspected
  2. identify affected run_key range
  3. reconcile state from idempotency table + downstream ledger
  4. replay only missing keys through controlled backfill mode
  5. publish postmortem including policy mismatch (overlap/catch-up/retry)

10-point pre-production checklist


Platform notes (quick translation layer)

Use these as mechanism primitives, not as a substitute for application-level idempotency.


References


One-line takeaway

Distributed cron correctness is not "did it run?" — it is "did each logical run cause permitted side effects exactly once, even across crashes, retries, and time drift?"