PostgreSQL `SKIP LOCKED` Job Queue Playbook

Date: 2026-03-15
Category: knowledge

Why this matters

A lot of teams add Redis/Rabbit/Kafka early, even when their first reliability problem is actually:

“we need background jobs,”
“we need retries and scheduling,”
“we need exactly-once effects (or close enough),”
and “we need fewer moving parts right now.”

If your source of truth is already PostgreSQL, a SKIP LOCKED queue can be a very strong default for small-to-medium throughput systems.

The trick is doing it like an ops system, not a demo snippet.

When Postgres queue is the right tool

Use this approach when:

your app already depends on Postgres,
queue throughput is moderate (not internet-scale fanout),
correctness and operational simplicity matter more than raw broker features,
your team can enforce idempotent workers.

Do not force this if you need:

very high fanout pub/sub,
multi-region low-latency stream semantics,
long-retention replay and consumer groups at broker scale.

Then use a dedicated log/broker.

Core mental model

FOR UPDATE SKIP LOCKED lets workers race for jobs without blocking each other:

each worker tries to lock the next eligible row,
locked rows are skipped by others,
each worker gets different work in parallel.

So your DB row lock becomes the claim primitive.

Minimal schema that scales reasonably

create table jobs (
  id bigserial primary key,
  queue_name text not null default 'default',
  payload jsonb not null,
  status text not null check (status in ('queued','running','done','failed')),
  priority int not null default 0,
  run_at timestamptz not null default now(),
  attempts int not null default 0,
  max_attempts int not null default 20,
  locked_by text,
  locked_at timestamptz,
  last_error text,
  created_at timestamptz not null default now(),
  updated_at timestamptz not null default now()
);

High-value indexes

-- Fast dequeue path (partial index for active queue)
create index idx_jobs_dequeue
  on jobs (queue_name, priority desc, run_at asc, id asc)
  where status = 'queued';

-- Reaper path for stuck workers
create index idx_jobs_running_timeout
  on jobs (locked_at)
  where status = 'running';

Partial indexes are important: keep hot index set small and focused.

Atomic claim query (single statement)

with picked as (
  select id
  from jobs
  where status = 'queued'
    and queue_name = $1
    and run_at <= now()
  order by priority desc, run_at asc, id asc
  for update skip locked
  limit 1
)
update jobs j
set status    = 'running',
    locked_by = $2,
    locked_at = now(),
    attempts  = attempts + 1,
    updated_at = now()
from picked
where j.id = picked.id
returning j.*;

Why this shape:

claim + state transition is atomic,
no separate read-then-update race,
compatible with many workers.

Worker lifecycle contract

Claim one job (or small batch).
Execute handler with strict timeout.
On success: mark done (or delete/archive).
On failure:
- if attempts < max: reschedule with backoff (run_at = now() + delay), set status queued
- else: mark failed (DLQ semantics)
Always write structured failure reason.

Retry policy that avoids retry storms

Use exponential backoff + jitter:

delay = base * 2^(attempts-1) capped at max,
multiply by random factor (e.g. 0.8–1.2).

Without jitter, synchronized retries can create periodic load spikes.

Stuck-job recovery (reaper)

Workers crash. Network stalls. Pods get evicted.

Run a periodic reaper:

update jobs
set status = 'queued',
    locked_by = null,
    locked_at = null,
    run_at = now() + interval '5 seconds',
    updated_at = now(),
    last_error = coalesce(last_error, 'requeued by reaper after lock timeout')
where status = 'running'
  and locked_at < now() - interval '10 minutes';

Tune timeout from real p99 handler duration, not optimism.

Idempotency is non-negotiable

SKIP LOCKED gives solid claiming, but your side effects can still duplicate (crash after effect, before ack).

Use one or more:

idempotency key on external API calls,
unique business key in DB,
outbox/inbox pattern for cross-service effects,
“already-applied” check in handler.

Goal: at-least-once delivery, effectively-once outcomes.

Throughput knobs (practical)

Batch claim size
- Start with 1–10 per transaction.
Polling interval
- Short sleep when empty (e.g. 100–500ms) to reduce DB churn.
LISTEN/NOTIFY assist
- Push wakeups to reduce blind polling.
Queue partitioning
- shard by queue_name/tenant for noisy-neighbor control.
Autovacuum hygiene
- hot update/delete tables need vacuum tuning to avoid bloat drift.

Anti-patterns that hurt in production

Read row, then update in separate transaction
- causes duplicate claims/races.
No partial index on queued jobs
- dequeue path degrades as table grows.
Infinite retries without max_attempts
- poison jobs loop forever.
No stuck-job reaper
- crashed workers create zombie running jobs.
Non-idempotent handlers
- transient failures become data corruption.

30-day rollout plan

Week 1 — Foundation

implement schema + atomic claim query,
set max_attempts + retry backoff policy,
instrument queue depth, claim latency, success/failure rates.

Week 2 — Reliability rails

add handler timeout contract,
add reaper for stale running jobs,
add DLQ/failed-state dashboard.

Week 3 — Scale hygiene

tune partial indexes and vacuum settings,
add LISTEN/NOTIFY wakeups if polling load is high,
define per-queue concurrency limits.

Week 4 — Operational hardening

run failure drills (worker crash, DB failover, poison payload),
verify idempotency on critical handlers,
add alerts for backlog growth, repeated failures, reaper activity spikes.

Suggested metrics

Track at minimum:

queued count by queue_name,
oldest queued age,
claim rate and empty-poll rate,
success/failure/retry counts,
p50/p95 job latency (enqueue -> done),
stuck-job requeue count,
dead-letter (failed terminal) rate.

If you can’t see queue age and retries, incidents arrive “suddenly.”

Bottom line

Postgres + SKIP LOCKED is a legit production queue pattern when used with:

atomic claim queries,
idempotent handlers,
bounded retries,
stale-lock recovery,
queue-focused observability.

You won’t out-broker Kafka with this.
But for many systems, you’ll ship faster with fewer moving parts and enough reliability headroom.

References

PostgreSQL docs — SELECT locking clause (FOR UPDATE ... SKIP LOCKED)
https://www.postgresql.org/docs/current/sql-select.html
PostgreSQL docs — Explicit locking overview
https://www.postgresql.org/docs/current/explicit-locking.html
PostgreSQL docs — Partial indexes
https://www.postgresql.org/docs/current/indexes-partial.html
PostgreSQL docs — LISTEN / NOTIFY
https://www.postgresql.org/docs/current/sql-notify.html
PostgreSQL docs — Routine vacuuming / autovacuum concepts
https://www.postgresql.org/docs/current/routine-vacuuming.html

PostgreSQL `SKIP LOCKED` Job Queue Playbook

PostgreSQL SKIP LOCKED Job Queue Playbook

Why this matters

When Postgres queue is the right tool

Core mental model

Minimal schema that scales reasonably

High-value indexes

Atomic claim query (single statement)

Worker lifecycle contract

Retry policy that avoids retry storms

Stuck-job recovery (reaper)

Idempotency is non-negotiable

Throughput knobs (practical)

Anti-patterns that hurt in production

30-day rollout plan

Week 1 — Foundation

Week 2 — Reliability rails

Week 3 — Scale hygiene

Week 4 — Operational hardening

Suggested metrics

Bottom line

References

PostgreSQL `SKIP LOCKED` Job Queue Playbook