Zero-Downtime Schema Migrations: Expand/Contract Playbook
Date: 2026-03-03
Category: software
Purpose: A practical guide for shipping database schema changes without breaking production traffic.
Why this matters
Most outages during feature releases are not caused by complicated algorithms—they are caused by schema changes that assume the app and DB update atomically. In real systems, deploys are staggered, workers are long-lived, and rollbacks happen.
The safe default is to design migrations around a phased model:
- Expand: add backwards-compatible schema
- Migrate: gradually backfill / dual-write / verify
- Contract: remove old schema only after full cutover
This is often called the expand/contract (or parallel change) pattern.
Core compatibility rules
For every release, ask:
- Can old code run on new schema?
- Can new code run on old schema (during rollout delay)?
- Can I roll back application code without emergency DB surgery?
If any answer is “no,” treat the migration as risky and redesign the phases.
The 4-phase migration model
0) Design phase (before writing SQL)
Clarify:
- Data shape change type (add column, split table, rename semantics, etc.)
- Read/write path changes
- Rollback plan
- Verification metrics
- Estimated backfill duration and load impact
Deliverables:
- Migration plan doc (with phases and owner)
- Kill switch / feature flag strategy
- Runbook for rollback and pause
1) Expand phase (safe schema additions)
Typical operations:
- Add nullable columns
- Add new tables/indexes
- Add non-blocking constraints (or deferred validation)
- Add triggers or write adapters for transitional sync
Avoid in this phase:
- Dropping columns/tables
- Tightening constraints that old code violates
- Renaming fields used by live app versions
Example (PostgreSQL):
ALTER TABLE orders ADD COLUMN executed_at timestamptz;
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_orders_executed_at ON orders(executed_at);
App behavior after expand:
- Continue serving with old code unchanged
- Optionally enable new code to read/write new fields behind a flag
2) Migrate phase (data + traffic transition)
2.1 Backfill
Backfill in small batches with checkpoints:
- Limit rows per batch
- Sleep between batches if needed
- Record progress cursor
- Retry idempotently
Pseudo-SQL strategy:
-- repeated in job loop
UPDATE orders
SET executed_at = created_at
WHERE executed_at IS NULL
AND id > :last_id
ORDER BY id
LIMIT 1000;
(Use DB-specific safe batching primitives; some engines need a subquery/CTE pattern.)
2.2 Dual-write / dual-read
- Start writing to both old and new fields
- Prefer read-old fallback-to-new or read-new fallback-to-old during ramp
- Measure mismatch rates
2.3 Verification
Track:
- Null ratio in new column
- Old/new value mismatch count
- Error rate and latency shifts
- Replication lag / lock time
Cutover only when mismatch is near-zero and stable.
3) Cutover phase
- Flip read path fully to new schema
- Keep dual-write briefly as safety buffer
- Monitor error budget window (e.g., 24–72h)
If issues appear:
- Revert flag to old read path
- Keep schema intact
- Diagnose without destructive rollback
4) Contract phase (remove legacy)
Only after stable cutover:
- Disable dual-write
- Drop old columns/tables/indexes
- Enforce stricter constraints if needed
- Remove dead code and migration flags
Example:
ALTER TABLE orders DROP COLUMN legacy_status;
Treat contract as a separate release with its own observability.
High-risk change patterns and safer alternatives
Column rename
Risky:
ALTER TABLE ... RENAME COLUMN ...while mixed app versions exist
Safer:
- Add new column
- Dual-write
- Backfill
- Cutover reads
- Drop old column
Type change (int → bigint, text → enum, etc.)
Risky:
- In-place conversion on hot table
Safer:
- Add new typed column
- Convert gradually in app/backfill
- Verify
- Swap reads and contract
NOT NULL enforcement
Risky:
- Add NOT NULL before all writers are compliant
Safer:
- Keep nullable during migration
- Monitor null-inserts = 0 over window
- Then enforce NOT NULL
Splitting one table into many
Safer flow:
- Add new tables
- Dual-write from app layer (or CDC/triggers)
- Backfill history
- Switch reads by domain
- Remove old table in final phase
Operational checklist (copy/paste)
Pre-deploy
- Compatibility matrix validated (old/new app vs old/new schema)
- Rollback path tested in staging
- Feature flags prepared
- Migration runtime and lock risk estimated
- Dashboard + alerts prepared
Expand deployed
- Schema applied without blocking incidents
- New indexes built safely
- Error rate unchanged
Migrate running
- Backfill progress visible
- Batch size tuned to stay under SLO impact
- Dual-write mismatch monitored
- Canary/partial traffic validated
Cutover
- Read path switched via flag
- Mismatch remains near-zero
- Rollback switch verified
Contract
- Legacy reads/writes removed in code
- Old schema dropped in separate deploy
- Cleanup PR merged (flags/runbooks/docs)
Observability signals that matter
Minimum migration dashboard:
migration_backfill_rows_totalmigration_backfill_rows_remainingdual_write_mismatch_totalmigration_lock_wait_seconds- App p95 latency and 5xx rate
- Replication lag / queue depth
Alert on trends, not single spikes, unless safety thresholds are hard (e.g., lock wait > N sec).
Common failure modes
- Contract too early
- Dropping old column before all workers deploy
- Unbounded backfill
- Massive update creates lock pressure and replication lag
- No rollback flag
- Forced emergency DB rewrites under pressure
- Silent divergence
- Dual-write mismatches not monitored
- Migration script not idempotent
- Retry corrupts or duplicates data
Practical rule of thumb
If a schema change cannot survive:
- staggered deploy,
- delayed workers,
- and one-click app rollback,
it is not production-ready yet.
Design for mixed versions first. Cleanup later.
Further reading keywords
- Parallel change pattern
- Expand and contract migrations
- Online schema change (OSC)
- Backfill throttling
- Progressive delivery + feature flags
- Data contract testing