The Exactly-Once Illusion: Distributed Cron & Scheduler Correctness Playbook
Date: 2026-03-02
Category: knowledge
Domain: software / distributed systems / operations
Why this matters
Periodic jobs look simple until they fail in production:
- duplicate runs trigger double-billing, duplicate emails, or duplicate state transitions
- missed runs silently break compliance, backups, settlements, and reports
- retry storms align on minute boundaries and overload downstream systems
In distributed systems, "exactly once" is usually a product requirement implemented on top of at-least-once infrastructure.
Core operating principle
Treat schedule dispatch as control-plane intent, and job side effects as data-plane transactions.
That means:
- Scheduler guarantees: when/what should run
- Worker guarantees: what side effects are allowed once per logical run
- Reconciliation guarantees: what to do when either side is uncertain
Failure model you should assume (always)
- Scheduler leader crashes after sending some (not all) dispatch RPCs
- Job starts, succeeds, but ack/update is lost
- Clock skew or controller lag causes late starts and backlog catch-up
- Retry policy and next schedule overlap
- Control plane emits duplicate creates for the same nominal time
If your design cannot survive these five, it is not production-safe.
Define the unit of uniqueness first
Create a logical run key and use it everywhere:
run_key = <job_id>#<scheduled_time_utc>#<version>
Rules:
job_id: stable identity of schedule intentscheduled_time_utc: canonical planned fire time (not worker receive time)version: schedule spec version (protects after spec edits)
Every queue message, DB mutation, and audit event should carry run_key.
Overlap policy = business semantics
Choose overlap behavior per workload, not per platform default.
- Skip / Forbid: if freshness is less important than single in-flight run
- BufferOne / BufferAll: if every period matters (e.g., hourly reconciliation)
- Replace / Cancel / Terminate: if only latest state matters
- AllowAll: only when side effects are partitioned or naturally idempotent
Practical mapping:
- payroll, settlements, invoices → avoid overlap + explicit backfill process
- cache warmers, price snapshots → replace/cancel can be valid
Missed-run and catch-up policy must be explicit
Never leave outage behavior implicit.
For each schedule, define:
- max_lateness (catch-up window)
- backfill mode (
none,sequential,bounded_parallel) - safety gate (e.g., stop catch-up when downstream error rate > threshold)
Good default:
- compliance jobs: long catch-up window + sequential replay
- high-volume non-critical jobs: short window + skip stale runs
Idempotency is the real correctness boundary
Use a durable idempotency table keyed by run_key:
- status:
received | started | succeeded | failed - side-effect fingerprint (optional but useful)
- first_seen / last_updated timestamps
Execution rule:
- upsert
run_keywith compare-and-set semantics - if already
succeeded, no-op - if
startedand stale heartbeat, route to recovery path - commit side effect and success marker atomically where possible
If atomic commit is impossible, use outbox/inbox or compensating transaction patterns.
Leasing and fencing for active-active safety
Leader election alone is insufficient.
Use a lease epoch (fencing token):
- dispatcher includes
epochwith each dispatch - worker rejects commands with stale epoch
- downstream write path validates epoch monotonicity when applicable
This blocks split-brain leaders from issuing valid-looking late commands.
Retry policy should not fight schedule policy
A common production bug:
- retry of run T collides with next scheduled run T+1
Guardrails:
- cap retry duration to less than period if overlap is forbidden
- bind retries to same
run_key(do not mint a new identity) - route exhausted retries to DLQ + operator-visible replay command
Time correctness rules (non-negotiable)
- Store and compare schedule times in UTC
- Persist canonical scheduled timestamp with each run
- Version-control timezone intent and DST policy
- Alert on clock skew and scheduler/controller lag
Human-readable local time is a display concern, not storage truth.
Load-shaping: avoid synchronized thundering herds
If thousands of nodes run 00 * * * *, you manufactured your own incident.
Use both:
- stable per-job offset (deterministic jitter per identity)
- bounded random delay where strict alignment is unnecessary
Goal: preserve cadence while spreading load.
Observability contract for scheduled systems
Track at least these metrics:
scheduler_dispatch_total{job_id,result}scheduler_dispatch_lag_seconds(actual dispatch - scheduled time)job_run_total{job_id,status}job_run_duration_secondsjob_duplicate_suppressed_totaljob_backfill_totaljob_missed_total
And one crucial derived SLI:
run_completeness = succeeded_runs / expected_runs
This catches silent misses better than raw failure rate.
Incident runbook (minimum)
When alerts fire:
- freeze schedule if duplicates are suspected
- identify affected
run_keyrange - reconcile state from idempotency table + downstream ledger
- replay only missing keys through controlled backfill mode
- publish postmortem including policy mismatch (overlap/catch-up/retry)
10-point pre-production checklist
- Logical run key defined and propagated end-to-end
- Idempotency table durability + retention policy
- Overlap policy mapped to business semantics
- Catch-up/backfill policy documented per job class
- Retry window aligned with schedule period
- DLQ and replay tooling tested
- Lease + fencing token behavior validated
- UTC + timezone handling tested around DST boundaries
- Load-shaping (stable offset/jitter) configured
- Completeness SLI and lag alerts live
Platform notes (quick translation layer)
- Kubernetes CronJob: supports concurrency policies (
Allow,Forbid,Replace) and starting deadlines; docs explicitly warn job creation is approximate and jobs should be idempotent. - Temporal Schedules: overlap policies (
Skip,Buffer*,CancelOther,TerminateOther,AllowAll) plus catch-up window and backfill controls. - EventBridge Scheduler: supports flexible windows, retry policy limits, max event age, and DLQ integration.
- Google Cloud Scheduler: configurable exponential backoff knobs (
retryCount, backoff bounds, max doublings, retry duration). - systemd timers: supports persistent catch-up (
Persistent=) and load spreading (RandomizedDelaySec=, stable offset options).
Use these as mechanism primitives, not as a substitute for application-level idempotency.
References
- Google SRE Book — Distributed Periodic Scheduling with Cron
https://sre.google/sre-book/distributed-periodic-scheduling/ - Kubernetes CronJob concepts
https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/ - Temporal Schedule docs
https://docs.temporal.io/schedule - Amazon EventBridge Scheduler user guide
https://docs.aws.amazon.com/eventbridge/latest/userguide/using-eventbridge-scheduler.html - Amazon EventBridge Scheduler RetryPolicy API
https://docs.aws.amazon.com/scheduler/latest/APIReference/API_RetryPolicy.html - Google Cloud Scheduler retry configuration
https://cloud.google.com/scheduler/docs/configuring/retry-jobs - systemd.timer man page
https://manpages.debian.org/testing/systemd/systemd.timer.5.en.html
One-line takeaway
Distributed cron correctness is not "did it run?" — it is "did each logical run cause permitted side effects exactly once, even across crashes, retries, and time drift?"