Lease-Based Distributed Locks Without Illusions: Fencing Token Playbook

2026-02-24 · software

Lease-Based Distributed Locks Without Illusions: Fencing Token Playbook

Date: 2026-02-24
Category: knowledge
Domain: distributed systems / correctness engineering

Why this matters

Many teams use "distributed lock = mutual exclusion" as if it were a local mutex.
In production, that assumption fails under exactly the conditions you care about:

A lock lease can expire while the old holder is still alive, and the old holder may still write stale state later.
If your system has correctness requirements (not just efficiency), a lock alone is not enough.

Core principle

For correctness-critical work:

  1. Use lock/lease only to reduce overlap probability.
  2. Attach a monotonic fencing token to each critical operation.
  3. Enforce token monotonicity at the resource being protected.

If the protected resource cannot reject stale tokens, your safety story is incomplete.


First decision: efficiency lock or correctness lock?

Borrow this practical split:

If this is efficiency-only, a simple lease is often enough.
If this is correctness-critical, move to fencing-token design immediately.

Failure mode in one timeline

  1. Client A acquires lease + lock.
  2. Client A pauses for longer than lease TTL.
  3. Client B acquires new lease + lock and performs update.
  4. Client A resumes and writes stale update.

Without downstream token checks, both writes may succeed in bad order.


What a fencing token is

A fencing token is a strictly increasing value issued on lock acquisition (or ownership epoch change).

This converts "who thinks they hold lock" into "what writes are admissible".

Safety invariant

For each protected resource R:

Anything weaker (e.g., checking token outside write transaction) can still race.


Implementation patterns (practical)

Pattern A) SQL row/resource with token column

Add columns:

Write with compare condition:

UPDATE account_positions
SET qty = ?, avg_px = ?, lock_epoch = ?
WHERE account_id = ?
  AND lock_epoch < ?;

Interpretation:

Use one transaction boundary for state + epoch.

Pattern B) KV store / document store CAS

Store {value, maxToken} together and update via CAS/precondition:

If DB supports conditional updates directly, encode both conditions server-side.

Pattern C) Object storage metadata gate

Object stores often support conditional writes (e.g., If-Match on ETag).
Use object metadata/state object carrying max_token, then update with conditional request.

Key point: token comparison must be part of authoritative update path, not client-side only.

Pattern D) External side-effects (APIs, queues)

When protected action is "call external system":

If receiver cannot enforce monotonic token, classify as best-effort only.


Where tokens come from

Good token sources are globally ordered and monotonic per lock domain:

Avoid:

Token scope design

Define scope explicitly:

Rule of thumb: scope by actual contention boundary (what can conflict in one correctness domain).


TTL and heartbeat tuning (what actually matters)

TTL tuning does not create correctness by itself; it changes overlap probability.

Still useful:

But even perfect tuning cannot rule out long pauses/delays.
Fencing remains the correctness backstop.


Observability checklist

Track these counters/time series:

Alert examples:


Rollout plan (safe migration)

  1. Instrument first: add token fields and passive logging.
  2. Shadow enforce: compute accept/reject but do not block yet.
  3. Canary enforce: reject stale tokens for small subset.
  4. Full enforce: block globally; keep override only for emergency.
  5. Remove unsafe paths: forbid writes missing token.

Success criteria:


Common footguns

  1. Lock acquired, token ignored downstream.
  2. Token checked in app code but not atomically with write.
  3. Using timestamp as token (clock skew/regression).
  4. Assuming Redlock/lease semantics alone guarantee exclusion.
  5. Calling external system that cannot reject stale epochs.

If any of these are true, document system as "best-effort lock", not strict exclusion.


Quick decision cheat sheet

The practical stance: leases coordinate intent; fencing protects truth.


References (researched)