Zero-Downtime Schema Migrations: Expand/Contract Playbook

2026-03-03 · software

Zero-Downtime Schema Migrations: Expand/Contract Playbook

Date: 2026-03-03
Category: software
Purpose: A practical guide for shipping database schema changes without breaking production traffic.


Why this matters

Most outages during feature releases are not caused by complicated algorithms—they are caused by schema changes that assume the app and DB update atomically. In real systems, deploys are staggered, workers are long-lived, and rollbacks happen.

The safe default is to design migrations around a phased model:

  1. Expand: add backwards-compatible schema
  2. Migrate: gradually backfill / dual-write / verify
  3. Contract: remove old schema only after full cutover

This is often called the expand/contract (or parallel change) pattern.


Core compatibility rules

For every release, ask:

If any answer is “no,” treat the migration as risky and redesign the phases.


The 4-phase migration model

0) Design phase (before writing SQL)

Clarify:

Deliverables:


1) Expand phase (safe schema additions)

Typical operations:

Avoid in this phase:

Example (PostgreSQL):

ALTER TABLE orders ADD COLUMN executed_at timestamptz;
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_orders_executed_at ON orders(executed_at);

App behavior after expand:


2) Migrate phase (data + traffic transition)

2.1 Backfill

Backfill in small batches with checkpoints:

Pseudo-SQL strategy:

-- repeated in job loop
UPDATE orders
SET executed_at = created_at
WHERE executed_at IS NULL
  AND id > :last_id
ORDER BY id
LIMIT 1000;

(Use DB-specific safe batching primitives; some engines need a subquery/CTE pattern.)

2.2 Dual-write / dual-read

2.3 Verification

Track:

Cutover only when mismatch is near-zero and stable.


3) Cutover phase

If issues appear:


4) Contract phase (remove legacy)

Only after stable cutover:

Example:

ALTER TABLE orders DROP COLUMN legacy_status;

Treat contract as a separate release with its own observability.


High-risk change patterns and safer alternatives

Column rename

Risky:

Safer:

  1. Add new column
  2. Dual-write
  3. Backfill
  4. Cutover reads
  5. Drop old column

Type change (int → bigint, text → enum, etc.)

Risky:

Safer:

NOT NULL enforcement

Risky:

Safer:

Splitting one table into many

Safer flow:


Operational checklist (copy/paste)

Pre-deploy

Expand deployed

Migrate running

Cutover

Contract


Observability signals that matter

Minimum migration dashboard:

Alert on trends, not single spikes, unless safety thresholds are hard (e.g., lock wait > N sec).


Common failure modes

  1. Contract too early
    • Dropping old column before all workers deploy
  2. Unbounded backfill
    • Massive update creates lock pressure and replication lag
  3. No rollback flag
    • Forced emergency DB rewrites under pressure
  4. Silent divergence
    • Dual-write mismatches not monitored
  5. Migration script not idempotent
    • Retry corrupts or duplicates data

Practical rule of thumb

If a schema change cannot survive:

it is not production-ready yet.

Design for mixed versions first. Cleanup later.


Further reading keywords