Zero-Downtime Schema Migration (Expand/Contract) Operations Playbook
Date: 2026-03-01
Category: knowledge
Domain: software / databases / reliability engineering
Why this matters
Most production incidents around schema changes are not caused by SQL syntax errors. They come from coordination failures:
- code expects new columns before backfill is done,
- DDL lock behavior was misunderstood,
- rollback path was not prepared,
- migration throughput was tuned by hope instead of guardrails.
A robust schema migration approach treats change as a state machine, not a one-shot ALTER.
Mental model: Parallel Change (Expand → Migrate → Contract)
Martin Fowler’s Parallel Change pattern (aka expand/contract) gives the safest default:
- Expand: introduce new schema in a backward-compatible way.
- Migrate: move data + traffic gradually while both old/new paths coexist.
- Contract: remove legacy schema only after proving parity and rollback irrelevance.
This avoids “big-bang” breakage and keeps each release reversible.
Step 0: Classify the DDL before touching production
Before writing migration SQL, classify each change into one of three risk buckets:
A) Metadata-only / cheap
Examples: nullable column add (engine/version dependent), comment changes.
B) Online but resource-heavy
Examples: concurrent index build, background table copy tools.
C) Blocking / rewrite-prone
Examples: type rewrites, non-compatible defaults, lock-heavy ALTER variants.
If you don’t know the bucket, assume C and stage accordingly.
Engine reality check (what operators forget)
PostgreSQL
ALTER TABLElock level differs by subform, but default assumption should be strict: many forms takeACCESS EXCLUSIVEunless explicitly noted.CREATE INDEX CONCURRENTLYallows concurrent writes, but is slower and has caveats.- In lock terms, only
ACCESS EXCLUSIVEblocks plainSELECT; this matters for incident triage.
Operational takeaway: always preflight lock semantics for your exact subcommand, not “ALTER TABLE in general”.
MySQL ecosystems (InnoDB/Vitess tooling)
Two common online migration families:
- pt-online-schema-change: chunk-copy + triggers + atomic rename.
- gh-ost: chunk-copy + binlog apply (triggerless) + controllable cutover.
Trade-off intuition:
- pt-osc is widely used but trigger constraints/foreign-key handling need special care.
- gh-ost improves operational control (pause/throttle/cutover control) and avoids trigger overhead.
Managed platforms (e.g., PlanetScale safe migrations) add queueing/verification/throttling around this class of online copy-based change.
Production runbook (operator grade)
Phase 1 — Expand (compatible schema first)
Rules:
- Add new columns/tables as nullable or default-safe.
- Do not drop/rename old structures yet.
- Add indexes using online/concurrent strategy where applicable.
Example:
- Old:
users.full_name - New target:
users.first_name,users.last_name
Expand by adding new columns first.
Phase 2 — Dual write (forward compatibility)
Application writes to both schemas:
- old path (to preserve current reads),
- new path (to accumulate migrated truth).
Key safeguards:
- dual-write success metric,
- idempotent write path,
- feature flag to disable new-path writes quickly.
Phase 3 — Backfill historical data (bounded blast radius)
Run chunked backfill with explicit controls:
- chunk size,
- target rows/sec,
- sleep between chunks,
- auto-throttle on replica lag / DB load.
Control loop example:
- if p95 query latency or replication lag breaches threshold → reduce chunk size or pause.
Phase 4 — Read switch + parity verification
Switch reads progressively:
- canary cohort → partial traffic → full traffic.
Gate progression on parity checks:
- row-count parity,
- checksum/sample parity,
- business invariant parity (not just raw counts).
Phase 5 — Contract (remove legacy only after soak)
Only after:
- full read migration complete,
- no rollback signals during soak window,
- observability stable.
Then:
- stop old writes,
- drop old columns/tables,
- remove dead code and feature flags.
Migration SLOs (use these, not vibes)
Define migration as an SLO-controlled operation:
migration_error_ratedual_write_mismatch_ratebackfill_rows_per_secreplica_lag_secondsdb_cpu/p95_query_latencytime_to_pauseandtime_to_resumecutover_duration_seconds
Set explicit budgets (example):
- mismatch rate < 0.01%
- replica lag < 10s sustained
- no P0 query latency regression > 15%
If budgets are breached, migration state machine must auto-transition to THROTTLED or PAUSED.
Failure modes that repeatedly cause incidents
Coupling code deploy with irreversible DDL in one step
- Fix: separate expand and contract releases.
No explicit cutover window control
- Fix: postpone-able/manual cutover with rollback readiness.
Backfill without load feedback
- Fix: adaptive throttling tied to lag + latency.
Assuming online means zero risk
- Fix: preflight constraints, triggers, FK behavior, and lock edge cases.
Dropping old schema before proving parity
- Fix: parity checklist + soak period as hard gate.
Rollback strategy by phase
- Expand phase: easy rollback (ignore new schema).
- Dual-write/backfill phase: disable new writes, stop backfill, continue old reads.
- Read switch phase: flip reads back to old schema immediately.
- After contract: rollback cost spikes; require restore/rebuild paths.
Interpretation: your cheapest rollback exists before contract. Protect that window.
Tooling selection cheat sheet
- Need MySQL online migration with high operational control and triggerless approach:
- consider
gh-ost.
- consider
- Need broad MySQL online schema change workflow with chunk-copy/triggers:
- consider
pt-online-schema-change.
- consider
- Need platform-managed non-blocking flow with branch/deploy governance:
- use provider workflow (e.g., PlanetScale safe migrations).
- On PostgreSQL:
- prefer migration plans that minimize
ACCESS EXCLUSIVEexposure, - use
CREATE INDEX CONCURRENTLYwhere appropriate, - stage incompatible changes via expand/contract.
- prefer migration plans that minimize
Minimal preflight checklist
- Exact lock behavior verified for each DDL statement
- Expand SQL is backward-compatible
- Dual-write feature flag in place
- Backfill is chunked + throttle-aware
- Parity checks automated and observable
- Cutover can pause/postpone safely
- Rollback drill done before production execution
- Contract step scheduled separately (never same release by default)
If any item is missing, you are not ready for “zero-downtime” claims.
References (researched)
- Martin Fowler — Parallel Change (expand/migrate/contract)
https://martinfowler.com/bliki/ParallelChange.html - PostgreSQL
ALTER TABLEdocs (lock-level caveat by subform)
https://www.postgresql.org/docs/current/sql-altertable.html - PostgreSQL explicit locking modes (
ACCESS EXCLUSIVE,SHARE UPDATE EXCLUSIVE, etc.)
https://www.postgresql.org/docs/current/explicit-locking.html - PostgreSQL
CREATE INDEX CONCURRENTLYdocs
https://www.postgresql.org/docs/current/sql-createindex.html - gh-ost (triggerless MySQL online schema migration; binlog-based apply/cutover controls)
https://github.com/github/gh-ost - Percona Toolkit
pt-online-schema-changedocs (chunk copy + triggers + rename swap)
https://docs.percona.com/percona-toolkit/pt-online-schema-change.html - PlanetScale non-blocking schema changes / safe migrations workflow
https://planetscale.com/docs/vitess/schema-changes - Prisma Data Guide: Expand and contract pattern overview
https://www.prisma.io/dataguide/types/relational/expand-and-contract-pattern