PostgreSQL SKIP LOCKED Job Queue Playbook
Date: 2026-03-15
Category: knowledge
Why this matters
A lot of teams add Redis/Rabbit/Kafka early, even when their first reliability problem is actually:
- “we need background jobs,”
- “we need retries and scheduling,”
- “we need exactly-once effects (or close enough),”
- and “we need fewer moving parts right now.”
If your source of truth is already PostgreSQL, a SKIP LOCKED queue can be a very strong default for small-to-medium throughput systems.
The trick is doing it like an ops system, not a demo snippet.
When Postgres queue is the right tool
Use this approach when:
- your app already depends on Postgres,
- queue throughput is moderate (not internet-scale fanout),
- correctness and operational simplicity matter more than raw broker features,
- your team can enforce idempotent workers.
Do not force this if you need:
- very high fanout pub/sub,
- multi-region low-latency stream semantics,
- long-retention replay and consumer groups at broker scale.
Then use a dedicated log/broker.
Core mental model
FOR UPDATE SKIP LOCKED lets workers race for jobs without blocking each other:
- each worker tries to lock the next eligible row,
- locked rows are skipped by others,
- each worker gets different work in parallel.
So your DB row lock becomes the claim primitive.
Minimal schema that scales reasonably
create table jobs (
id bigserial primary key,
queue_name text not null default 'default',
payload jsonb not null,
status text not null check (status in ('queued','running','done','failed')),
priority int not null default 0,
run_at timestamptz not null default now(),
attempts int not null default 0,
max_attempts int not null default 20,
locked_by text,
locked_at timestamptz,
last_error text,
created_at timestamptz not null default now(),
updated_at timestamptz not null default now()
);
High-value indexes
-- Fast dequeue path (partial index for active queue)
create index idx_jobs_dequeue
on jobs (queue_name, priority desc, run_at asc, id asc)
where status = 'queued';
-- Reaper path for stuck workers
create index idx_jobs_running_timeout
on jobs (locked_at)
where status = 'running';
Partial indexes are important: keep hot index set small and focused.
Atomic claim query (single statement)
with picked as (
select id
from jobs
where status = 'queued'
and queue_name = $1
and run_at <= now()
order by priority desc, run_at asc, id asc
for update skip locked
limit 1
)
update jobs j
set status = 'running',
locked_by = $2,
locked_at = now(),
attempts = attempts + 1,
updated_at = now()
from picked
where j.id = picked.id
returning j.*;
Why this shape:
- claim + state transition is atomic,
- no separate read-then-update race,
- compatible with many workers.
Worker lifecycle contract
- Claim one job (or small batch).
- Execute handler with strict timeout.
- On success: mark
done(or delete/archive). - On failure:
- if attempts < max: reschedule with backoff (
run_at = now() + delay), set statusqueued - else: mark
failed(DLQ semantics)
- if attempts < max: reschedule with backoff (
- Always write structured failure reason.
Retry policy that avoids retry storms
Use exponential backoff + jitter:
- delay =
base * 2^(attempts-1)capped at max, - multiply by random factor (e.g. 0.8–1.2).
Without jitter, synchronized retries can create periodic load spikes.
Stuck-job recovery (reaper)
Workers crash. Network stalls. Pods get evicted.
Run a periodic reaper:
update jobs
set status = 'queued',
locked_by = null,
locked_at = null,
run_at = now() + interval '5 seconds',
updated_at = now(),
last_error = coalesce(last_error, 'requeued by reaper after lock timeout')
where status = 'running'
and locked_at < now() - interval '10 minutes';
Tune timeout from real p99 handler duration, not optimism.
Idempotency is non-negotiable
SKIP LOCKED gives solid claiming, but your side effects can still duplicate (crash after effect, before ack).
Use one or more:
- idempotency key on external API calls,
- unique business key in DB,
- outbox/inbox pattern for cross-service effects,
- “already-applied” check in handler.
Goal: at-least-once delivery, effectively-once outcomes.
Throughput knobs (practical)
- Batch claim size
- Start with 1–10 per transaction.
- Polling interval
- Short sleep when empty (e.g. 100–500ms) to reduce DB churn.
- LISTEN/NOTIFY assist
- Push wakeups to reduce blind polling.
- Queue partitioning
- shard by queue_name/tenant for noisy-neighbor control.
- Autovacuum hygiene
- hot update/delete tables need vacuum tuning to avoid bloat drift.
Anti-patterns that hurt in production
- Read row, then update in separate transaction
- causes duplicate claims/races.
- No partial index on queued jobs
- dequeue path degrades as table grows.
- Infinite retries without max_attempts
- poison jobs loop forever.
- No stuck-job reaper
- crashed workers create zombie running jobs.
- Non-idempotent handlers
- transient failures become data corruption.
30-day rollout plan
Week 1 — Foundation
- implement schema + atomic claim query,
- set max_attempts + retry backoff policy,
- instrument queue depth, claim latency, success/failure rates.
Week 2 — Reliability rails
- add handler timeout contract,
- add reaper for stale running jobs,
- add DLQ/failed-state dashboard.
Week 3 — Scale hygiene
- tune partial indexes and vacuum settings,
- add LISTEN/NOTIFY wakeups if polling load is high,
- define per-queue concurrency limits.
Week 4 — Operational hardening
- run failure drills (worker crash, DB failover, poison payload),
- verify idempotency on critical handlers,
- add alerts for backlog growth, repeated failures, reaper activity spikes.
Suggested metrics
Track at minimum:
- queued count by queue_name,
- oldest queued age,
- claim rate and empty-poll rate,
- success/failure/retry counts,
- p50/p95 job latency (enqueue -> done),
- stuck-job requeue count,
- dead-letter (failed terminal) rate.
If you can’t see queue age and retries, incidents arrive “suddenly.”
Bottom line
Postgres + SKIP LOCKED is a legit production queue pattern when used with:
- atomic claim queries,
- idempotent handlers,
- bounded retries,
- stale-lock recovery,
- queue-focused observability.
You won’t out-broker Kafka with this.
But for many systems, you’ll ship faster with fewer moving parts and enough reliability headroom.
References
- PostgreSQL docs —
SELECTlocking clause (FOR UPDATE ... SKIP LOCKED)
https://www.postgresql.org/docs/current/sql-select.html - PostgreSQL docs — Explicit locking overview
https://www.postgresql.org/docs/current/explicit-locking.html - PostgreSQL docs — Partial indexes
https://www.postgresql.org/docs/current/indexes-partial.html - PostgreSQL docs —
LISTEN/NOTIFY
https://www.postgresql.org/docs/current/sql-notify.html - PostgreSQL docs — Routine vacuuming / autovacuum concepts
https://www.postgresql.org/docs/current/routine-vacuuming.html