The Exactly-Once Illusion: Distributed Cron & Scheduler Correctness Playbook

Date: 2026-03-02
Category: knowledge
Domain: software / distributed systems / operations

Why this matters

Periodic jobs look simple until they fail in production:

duplicate runs trigger double-billing, duplicate emails, or duplicate state transitions
missed runs silently break compliance, backups, settlements, and reports
retry storms align on minute boundaries and overload downstream systems

In distributed systems, "exactly once" is usually a product requirement implemented on top of at-least-once infrastructure.

Core operating principle

Treat schedule dispatch as control-plane intent, and job side effects as data-plane transactions.

That means:

Scheduler guarantees: when/what should run
Worker guarantees: what side effects are allowed once per logical run
Reconciliation guarantees: what to do when either side is uncertain

Failure model you should assume (always)

Scheduler leader crashes after sending some (not all) dispatch RPCs
Job starts, succeeds, but ack/update is lost
Clock skew or controller lag causes late starts and backlog catch-up
Retry policy and next schedule overlap
Control plane emits duplicate creates for the same nominal time

If your design cannot survive these five, it is not production-safe.

Define the unit of uniqueness first

Create a logical run key and use it everywhere:

run_key = <job_id>#<scheduled_time_utc>#<version>

Rules:

job_id: stable identity of schedule intent
scheduled_time_utc: canonical planned fire time (not worker receive time)
version: schedule spec version (protects after spec edits)

Every queue message, DB mutation, and audit event should carry run_key.

Overlap policy = business semantics

Choose overlap behavior per workload, not per platform default.

Skip / Forbid: if freshness is less important than single in-flight run
BufferOne / BufferAll: if every period matters (e.g., hourly reconciliation)
Replace / Cancel / Terminate: if only latest state matters
AllowAll: only when side effects are partitioned or naturally idempotent

Practical mapping:

payroll, settlements, invoices → avoid overlap + explicit backfill process
cache warmers, price snapshots → replace/cancel can be valid

Missed-run and catch-up policy must be explicit

Never leave outage behavior implicit.

For each schedule, define:

max_lateness (catch-up window)
backfill mode (none, sequential, bounded_parallel)
safety gate (e.g., stop catch-up when downstream error rate > threshold)

Good default:

compliance jobs: long catch-up window + sequential replay
high-volume non-critical jobs: short window + skip stale runs

Idempotency is the real correctness boundary

Use a durable idempotency table keyed by run_key:

status: received | started | succeeded | failed
side-effect fingerprint (optional but useful)
first_seen / last_updated timestamps

Execution rule:

upsert run_key with compare-and-set semantics
if already succeeded, no-op
if started and stale heartbeat, route to recovery path
commit side effect and success marker atomically where possible

If atomic commit is impossible, use outbox/inbox or compensating transaction patterns.

Leasing and fencing for active-active safety

Leader election alone is insufficient.

Use a lease epoch (fencing token):

dispatcher includes epoch with each dispatch
worker rejects commands with stale epoch
downstream write path validates epoch monotonicity when applicable

This blocks split-brain leaders from issuing valid-looking late commands.

Retry policy should not fight schedule policy

A common production bug:

retry of run T collides with next scheduled run T+1

Guardrails:

cap retry duration to less than period if overlap is forbidden
bind retries to same run_key (do not mint a new identity)
route exhausted retries to DLQ + operator-visible replay command

Time correctness rules (non-negotiable)

Store and compare schedule times in UTC
Persist canonical scheduled timestamp with each run
Version-control timezone intent and DST policy
Alert on clock skew and scheduler/controller lag

Human-readable local time is a display concern, not storage truth.

Load-shaping: avoid synchronized thundering herds

If thousands of nodes run 00 * * * *, you manufactured your own incident.

Use both:

stable per-job offset (deterministic jitter per identity)
bounded random delay where strict alignment is unnecessary

Goal: preserve cadence while spreading load.

Observability contract for scheduled systems

Track at least these metrics:

scheduler_dispatch_total{job_id,result}
scheduler_dispatch_lag_seconds (actual dispatch - scheduled time)
job_run_total{job_id,status}
job_run_duration_seconds
job_duplicate_suppressed_total
job_backfill_total
job_missed_total

And one crucial derived SLI:

run_completeness = succeeded_runs / expected_runs

This catches silent misses better than raw failure rate.

Incident runbook (minimum)

When alerts fire:

freeze schedule if duplicates are suspected
identify affected run_key range
reconcile state from idempotency table + downstream ledger
replay only missing keys through controlled backfill mode
publish postmortem including policy mismatch (overlap/catch-up/retry)

10-point pre-production checklist

Logical run key defined and propagated end-to-end
Idempotency table durability + retention policy
Overlap policy mapped to business semantics
Catch-up/backfill policy documented per job class
Retry window aligned with schedule period
DLQ and replay tooling tested
Lease + fencing token behavior validated
UTC + timezone handling tested around DST boundaries
Load-shaping (stable offset/jitter) configured
Completeness SLI and lag alerts live

Platform notes (quick translation layer)

Kubernetes CronJob: supports concurrency policies (Allow, Forbid, Replace) and starting deadlines; docs explicitly warn job creation is approximate and jobs should be idempotent.
Temporal Schedules: overlap policies (Skip, Buffer*, CancelOther, TerminateOther, AllowAll) plus catch-up window and backfill controls.
EventBridge Scheduler: supports flexible windows, retry policy limits, max event age, and DLQ integration.
Google Cloud Scheduler: configurable exponential backoff knobs (retryCount, backoff bounds, max doublings, retry duration).
systemd timers: supports persistent catch-up (Persistent=) and load spreading (RandomizedDelaySec=, stable offset options).

Use these as mechanism primitives, not as a substitute for application-level idempotency.

References

Google SRE Book — Distributed Periodic Scheduling with Cron
https://sre.google/sre-book/distributed-periodic-scheduling/
Kubernetes CronJob concepts
https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/
Temporal Schedule docs
https://docs.temporal.io/schedule
Amazon EventBridge Scheduler user guide
https://docs.aws.amazon.com/eventbridge/latest/userguide/using-eventbridge-scheduler.html
Amazon EventBridge Scheduler RetryPolicy API
https://docs.aws.amazon.com/scheduler/latest/APIReference/API_RetryPolicy.html
Google Cloud Scheduler retry configuration
https://cloud.google.com/scheduler/docs/configuring/retry-jobs
systemd.timer man page
https://manpages.debian.org/testing/systemd/systemd.timer.5.en.html

One-line takeaway

Distributed cron correctness is not "did it run?" — it is "did each logical run cause permitted side effects exactly once, even across crashes, retries, and time drift?"