PostgreSQL 17 Logical Replication Failover Slots HA Playbook

2026-03-23 · software

PostgreSQL 17 Logical Replication Failover Slots HA Playbook

Date: 2026-03-23
Category: knowledge
Scope: How to make PostgreSQL logical subscribers survive primary failover without re-snapshotting.


1) Why this matters

Before PostgreSQL 17 failover-slot workflows, a primary promotion often meant painful subscriber surgery:

PostgreSQL 17 formalizes logical replication failover so subscriber continuity can survive primary failover if you wire the slot-sync pipeline correctly.


2) Core mental model

A subscription can continue after failover only if:

  1. Its logical slot is marked failover = true on the publisher side.
  2. Slot state was synchronized to the standby in time.
  3. The standby has a usable synced slot at promotion time.
  4. Subscriber conninfo is switched to the new primary.

Think of it as two lanes:

Both must be healthy.


3) Required wiring (minimum viable HA)

3.1 Publisher / primary

3.2 Physical standby (future primary)

On standby, configure:

Without this set, failover slots won’t synchronize reliably.


4) Subscription creation patterns

4.1 Preferred: explicit failover-enabled subscription

CREATE SUBSCRIPTION sub_orders
CONNECTION 'host=primary-db dbname=app user=repl password=***'
PUBLICATION pub_orders
WITH (
  create_slot = true,
  slot_name = 'sub_orders',
  copy_data = false,
  failover = true
);

4.2 Deferred/manual slot mode (advanced)

If create_slot = false, ensure slot-side failover property matches subscription-side failover semantics. Mismatches create confusing behavior (subscription says failover-enabled, slot doesn’t — or vice versa).


5) Pre-failover readiness checklist (must-pass)

5.1 On subscriber: list main slots tied to failover-enabled subscriptions

SELECT array_agg(quote_literal(s.subslotname)) AS slots
FROM pg_subscription s
WHERE s.subfailover
  AND s.subslotname IS NOT NULL;

5.2 On subscriber: list relevant table-sync slots (finished copy only)

SELECT array_agg(quote_literal(slot_name)) AS slots
FROM (
  SELECT CONCAT('pg_', srsubid, '_sync_', srrelid, '_', ctl.system_identifier) AS slot_name
  FROM pg_control_system() ctl,
       pg_subscription_rel r,
       pg_subscription s
  WHERE r.srsubstate = 'f'
    AND s.oid = r.srsubid
    AND s.subfailover
) t;

5.3 On target standby: confirm slots are failover-ready

SELECT slot_name,
       (synced AND NOT temporary AND invalidation_reason IS NULL) AS failover_ready
FROM pg_replication_slots
WHERE slot_name IN ('sub1','sub2','sub3');

Only promote when all critical slots show failover_ready = true.


6) Failover runbook (planned event)

  1. Freeze subscriber apply directionally
    • ALTER SUBSCRIPTION ... DISABLE on subscribers (recommended before promotion).
  2. Promote standby to primary.
  3. Update subscriber connection strings:
    • ALTER SUBSCRIPTION ... CONNECTION 'host=new-primary ...';
  4. Re-enable subscriptions:
    • ALTER SUBSCRIPTION ... ENABLE;
  5. Validate no gap/regression in confirmed_flush_lsn and app-level monotonic checks.

Why disable first? If old primary is still reachable, subscribers may keep consuming from old primary until conninfo flips, risking divergence.


7) Operational observability queries

7.1 Slot health on candidate primary/standby

SELECT slot_name,
       slot_type,
       failover,
       synced,
       active,
       wal_status,
       restart_lsn,
       confirmed_flush_lsn,
       invalidation_reason
FROM pg_replication_slots
ORDER BY slot_name;

7.2 Subscription posture on subscriber

SELECT subname,
       subenabled,
       subfailover,
       subslotname,
       subtwophasestate,
       subsynccommit
FROM pg_subscription
ORDER BY subname;

7.3 Table sync status that can influence slot expectations

SELECT s.subname,
       r.srrelid::regclass AS relation,
       r.srsubstate,
       r.srsublsn
FROM pg_subscription_rel r
JOIN pg_subscription s ON s.oid = r.srsubid
ORDER BY s.subname, relation;

8) Common failure modes

  1. failover=true forgotten at creation
    Subscriber appears healthy until first failover drill.

  2. Standby missing sync_replication_slots / primary_slot_name / hot_standby_feedback
    Slot sync silently incomplete; promotion breaks logical continuity.

  3. No synchronized_standby_slots on primary
    Logical consumer can run ahead of standby durability; failover-ready window becomes fragile.

  4. Promoting with non-persistent synced slots
    synced=true alone is insufficient if slot is temporary or invalidated.

  5. Skipping pre-promotion slot readiness SQL
    You discover missing slot state only after cutover.


9) Practical SLO guardrails


10) Bottom line

PostgreSQL 17 makes logical failover far more operationally sane, but only if you treat slot synchronization as a first-class HA dependency.

If you run logical replication in production, add failover-slot readiness checks to your promotion gate the same way you gate on replication lag and application health.


References