Zero-Downtime Schema Migrations: Expand/Contract Playbook

Date: 2026-03-03
Category: software
Purpose: A practical guide for shipping database schema changes without breaking production traffic.

Why this matters

Most outages during feature releases are not caused by complicated algorithms—they are caused by schema changes that assume the app and DB update atomically. In real systems, deploys are staggered, workers are long-lived, and rollbacks happen.

The safe default is to design migrations around a phased model:

Expand: add backwards-compatible schema
Migrate: gradually backfill / dual-write / verify
Contract: remove old schema only after full cutover

This is often called the expand/contract (or parallel change) pattern.

Core compatibility rules

For every release, ask:

Can old code run on new schema?
Can new code run on old schema (during rollout delay)?
Can I roll back application code without emergency DB surgery?

If any answer is “no,” treat the migration as risky and redesign the phases.

The 4-phase migration model

0) Design phase (before writing SQL)

Clarify:

Data shape change type (add column, split table, rename semantics, etc.)
Read/write path changes
Rollback plan
Verification metrics
Estimated backfill duration and load impact

Deliverables:

Migration plan doc (with phases and owner)
Kill switch / feature flag strategy
Runbook for rollback and pause

1) Expand phase (safe schema additions)

Typical operations:

Add nullable columns
Add new tables/indexes
Add non-blocking constraints (or deferred validation)
Add triggers or write adapters for transitional sync

Avoid in this phase:

Dropping columns/tables
Tightening constraints that old code violates
Renaming fields used by live app versions

Example (PostgreSQL):

ALTER TABLE orders ADD COLUMN executed_at timestamptz;
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_orders_executed_at ON orders(executed_at);

App behavior after expand:

Continue serving with old code unchanged
Optionally enable new code to read/write new fields behind a flag

2) Migrate phase (data + traffic transition)

2.1 Backfill

Backfill in small batches with checkpoints:

Limit rows per batch
Sleep between batches if needed
Record progress cursor
Retry idempotently

Pseudo-SQL strategy:

-- repeated in job loop
UPDATE orders
SET executed_at = created_at
WHERE executed_at IS NULL
  AND id > :last_id
ORDER BY id
LIMIT 1000;

(Use DB-specific safe batching primitives; some engines need a subquery/CTE pattern.)

2.2 Dual-write / dual-read

Start writing to both old and new fields
Prefer read-old fallback-to-new or read-new fallback-to-old during ramp
Measure mismatch rates

2.3 Verification

Track:

Null ratio in new column
Old/new value mismatch count
Error rate and latency shifts
Replication lag / lock time

Cutover only when mismatch is near-zero and stable.

3) Cutover phase

Flip read path fully to new schema
Keep dual-write briefly as safety buffer
Monitor error budget window (e.g., 24–72h)

If issues appear:

Revert flag to old read path
Keep schema intact
Diagnose without destructive rollback

4) Contract phase (remove legacy)

Only after stable cutover:

Disable dual-write
Drop old columns/tables/indexes
Enforce stricter constraints if needed
Remove dead code and migration flags

Example:

ALTER TABLE orders DROP COLUMN legacy_status;

Treat contract as a separate release with its own observability.

High-risk change patterns and safer alternatives

Column rename

Risky:

ALTER TABLE ... RENAME COLUMN ... while mixed app versions exist

Safer:

Add new column
Dual-write
Backfill
Cutover reads
Drop old column

Type change (int → bigint, text → enum, etc.)

Risky:

In-place conversion on hot table

Safer:

Add new typed column
Convert gradually in app/backfill
Verify
Swap reads and contract

NOT NULL enforcement

Risky:

Add NOT NULL before all writers are compliant

Safer:

Keep nullable during migration
Monitor null-inserts = 0 over window
Then enforce NOT NULL

Splitting one table into many

Safer flow:

Add new tables
Dual-write from app layer (or CDC/triggers)
Backfill history
Switch reads by domain
Remove old table in final phase

Operational checklist (copy/paste)

Pre-deploy

Compatibility matrix validated (old/new app vs old/new schema)
Rollback path tested in staging
Feature flags prepared
Migration runtime and lock risk estimated
Dashboard + alerts prepared

Expand deployed

Schema applied without blocking incidents
New indexes built safely
Error rate unchanged

Migrate running

Backfill progress visible
Batch size tuned to stay under SLO impact
Dual-write mismatch monitored
Canary/partial traffic validated

Cutover

Read path switched via flag
Mismatch remains near-zero
Rollback switch verified

Contract

Legacy reads/writes removed in code
Old schema dropped in separate deploy
Cleanup PR merged (flags/runbooks/docs)

Observability signals that matter

Minimum migration dashboard:

migration_backfill_rows_total
migration_backfill_rows_remaining
dual_write_mismatch_total
migration_lock_wait_seconds
App p95 latency and 5xx rate
Replication lag / queue depth

Alert on trends, not single spikes, unless safety thresholds are hard (e.g., lock wait > N sec).

Common failure modes

Contract too early
- Dropping old column before all workers deploy
Unbounded backfill
- Massive update creates lock pressure and replication lag
No rollback flag
- Forced emergency DB rewrites under pressure
Silent divergence
- Dual-write mismatches not monitored
Migration script not idempotent
- Retry corrupts or duplicates data

Practical rule of thumb

If a schema change cannot survive:

staggered deploy,
delayed workers,
and one-click app rollback,

it is not production-ready yet.

Design for mixed versions first. Cleanup later.

Zero-Downtime Schema Migrations: Expand/Contract Playbook

Zero-Downtime Schema Migrations: Expand/Contract Playbook

Why this matters

Core compatibility rules

The 4-phase migration model

0) Design phase (before writing SQL)

1) Expand phase (safe schema additions)

2) Migrate phase (data + traffic transition)

2.1 Backfill

2.2 Dual-write / dual-read

2.3 Verification

3) Cutover phase

4) Contract phase (remove legacy)

High-risk change patterns and safer alternatives

Column rename

Type change (int → bigint, text → enum, etc.)

NOT NULL enforcement

Splitting one table into many

Operational checklist (copy/paste)

Pre-deploy

Expand deployed

Migrate running

Cutover

Contract

Observability signals that matter

Common failure modes

Practical rule of thumb

Further reading keywords