PostgreSQL `SKIP LOCKED` Job Queue Playbook

2026-03-15 · software

PostgreSQL SKIP LOCKED Job Queue Playbook

Date: 2026-03-15
Category: knowledge

Why this matters

A lot of teams add Redis/Rabbit/Kafka early, even when their first reliability problem is actually:

If your source of truth is already PostgreSQL, a SKIP LOCKED queue can be a very strong default for small-to-medium throughput systems.

The trick is doing it like an ops system, not a demo snippet.


When Postgres queue is the right tool

Use this approach when:

Do not force this if you need:

Then use a dedicated log/broker.


Core mental model

FOR UPDATE SKIP LOCKED lets workers race for jobs without blocking each other:

So your DB row lock becomes the claim primitive.


Minimal schema that scales reasonably

create table jobs (
  id bigserial primary key,
  queue_name text not null default 'default',
  payload jsonb not null,
  status text not null check (status in ('queued','running','done','failed')),
  priority int not null default 0,
  run_at timestamptz not null default now(),
  attempts int not null default 0,
  max_attempts int not null default 20,
  locked_by text,
  locked_at timestamptz,
  last_error text,
  created_at timestamptz not null default now(),
  updated_at timestamptz not null default now()
);

High-value indexes

-- Fast dequeue path (partial index for active queue)
create index idx_jobs_dequeue
  on jobs (queue_name, priority desc, run_at asc, id asc)
  where status = 'queued';

-- Reaper path for stuck workers
create index idx_jobs_running_timeout
  on jobs (locked_at)
  where status = 'running';

Partial indexes are important: keep hot index set small and focused.


Atomic claim query (single statement)

with picked as (
  select id
  from jobs
  where status = 'queued'
    and queue_name = $1
    and run_at <= now()
  order by priority desc, run_at asc, id asc
  for update skip locked
  limit 1
)
update jobs j
set status    = 'running',
    locked_by = $2,
    locked_at = now(),
    attempts  = attempts + 1,
    updated_at = now()
from picked
where j.id = picked.id
returning j.*;

Why this shape:


Worker lifecycle contract

  1. Claim one job (or small batch).
  2. Execute handler with strict timeout.
  3. On success: mark done (or delete/archive).
  4. On failure:
    • if attempts < max: reschedule with backoff (run_at = now() + delay), set status queued
    • else: mark failed (DLQ semantics)
  5. Always write structured failure reason.

Retry policy that avoids retry storms

Use exponential backoff + jitter:

Without jitter, synchronized retries can create periodic load spikes.


Stuck-job recovery (reaper)

Workers crash. Network stalls. Pods get evicted.

Run a periodic reaper:

update jobs
set status = 'queued',
    locked_by = null,
    locked_at = null,
    run_at = now() + interval '5 seconds',
    updated_at = now(),
    last_error = coalesce(last_error, 'requeued by reaper after lock timeout')
where status = 'running'
  and locked_at < now() - interval '10 minutes';

Tune timeout from real p99 handler duration, not optimism.


Idempotency is non-negotiable

SKIP LOCKED gives solid claiming, but your side effects can still duplicate (crash after effect, before ack).

Use one or more:

Goal: at-least-once delivery, effectively-once outcomes.


Throughput knobs (practical)

  1. Batch claim size
    • Start with 1–10 per transaction.
  2. Polling interval
    • Short sleep when empty (e.g. 100–500ms) to reduce DB churn.
  3. LISTEN/NOTIFY assist
    • Push wakeups to reduce blind polling.
  4. Queue partitioning
    • shard by queue_name/tenant for noisy-neighbor control.
  5. Autovacuum hygiene
    • hot update/delete tables need vacuum tuning to avoid bloat drift.

Anti-patterns that hurt in production

  1. Read row, then update in separate transaction
    • causes duplicate claims/races.
  2. No partial index on queued jobs
    • dequeue path degrades as table grows.
  3. Infinite retries without max_attempts
    • poison jobs loop forever.
  4. No stuck-job reaper
    • crashed workers create zombie running jobs.
  5. Non-idempotent handlers
    • transient failures become data corruption.

30-day rollout plan

Week 1 — Foundation

Week 2 — Reliability rails

Week 3 — Scale hygiene

Week 4 — Operational hardening


Suggested metrics

Track at minimum:

If you can’t see queue age and retries, incidents arrive “suddenly.”


Bottom line

Postgres + SKIP LOCKED is a legit production queue pattern when used with:

You won’t out-broker Kafka with this.
But for many systems, you’ll ship faster with fewer moving parts and enough reliability headroom.


References

  1. PostgreSQL docs — SELECT locking clause (FOR UPDATE ... SKIP LOCKED)
    https://www.postgresql.org/docs/current/sql-select.html
  2. PostgreSQL docs — Explicit locking overview
    https://www.postgresql.org/docs/current/explicit-locking.html
  3. PostgreSQL docs — Partial indexes
    https://www.postgresql.org/docs/current/indexes-partial.html
  4. PostgreSQL docs — LISTEN / NOTIFY
    https://www.postgresql.org/docs/current/sql-notify.html
  5. PostgreSQL docs — Routine vacuuming / autovacuum concepts
    https://www.postgresql.org/docs/current/routine-vacuuming.html