Zero-Downtime Schema Migration (Expand/Contract) Operations Playbook

Date: 2026-03-01
Category: knowledge
Domain: software / databases / reliability engineering

Why this matters

Most production incidents around schema changes are not caused by SQL syntax errors. They come from coordination failures:

code expects new columns before backfill is done,
DDL lock behavior was misunderstood,
rollback path was not prepared,
migration throughput was tuned by hope instead of guardrails.

A robust schema migration approach treats change as a state machine, not a one-shot ALTER.

Mental model: Parallel Change (Expand → Migrate → Contract)

Martin Fowler’s Parallel Change pattern (aka expand/contract) gives the safest default:

Expand: introduce new schema in a backward-compatible way.
Migrate: move data + traffic gradually while both old/new paths coexist.
Contract: remove legacy schema only after proving parity and rollback irrelevance.

This avoids “big-bang” breakage and keeps each release reversible.

Step 0: Classify the DDL before touching production

Before writing migration SQL, classify each change into one of three risk buckets:

A) Metadata-only / cheap

Examples: nullable column add (engine/version dependent), comment changes.

B) Online but resource-heavy

Examples: concurrent index build, background table copy tools.

C) Blocking / rewrite-prone

Examples: type rewrites, non-compatible defaults, lock-heavy ALTER variants.

If you don’t know the bucket, assume C and stage accordingly.

Engine reality check (what operators forget)

PostgreSQL

ALTER TABLE lock level differs by subform, but default assumption should be strict: many forms take ACCESS EXCLUSIVE unless explicitly noted.
CREATE INDEX CONCURRENTLY allows concurrent writes, but is slower and has caveats.
In lock terms, only ACCESS EXCLUSIVE blocks plain SELECT; this matters for incident triage.

Operational takeaway: always preflight lock semantics for your exact subcommand, not “ALTER TABLE in general”.

MySQL ecosystems (InnoDB/Vitess tooling)

Two common online migration families:

pt-online-schema-change: chunk-copy + triggers + atomic rename.
gh-ost: chunk-copy + binlog apply (triggerless) + controllable cutover.

Trade-off intuition:

pt-osc is widely used but trigger constraints/foreign-key handling need special care.
gh-ost improves operational control (pause/throttle/cutover control) and avoids trigger overhead.

Managed platforms (e.g., PlanetScale safe migrations) add queueing/verification/throttling around this class of online copy-based change.

Production runbook (operator grade)

Phase 1 — Expand (compatible schema first)

Rules:

Add new columns/tables as nullable or default-safe.
Do not drop/rename old structures yet.
Add indexes using online/concurrent strategy where applicable.

Example:

Old: users.full_name
New target: users.first_name, users.last_name

Expand by adding new columns first.

Phase 2 — Dual write (forward compatibility)

Application writes to both schemas:

old path (to preserve current reads),
new path (to accumulate migrated truth).

Key safeguards:

dual-write success metric,
idempotent write path,
feature flag to disable new-path writes quickly.

Phase 3 — Backfill historical data (bounded blast radius)

Run chunked backfill with explicit controls:

chunk size,
target rows/sec,
sleep between chunks,
auto-throttle on replica lag / DB load.

Control loop example:

if p95 query latency or replication lag breaches threshold → reduce chunk size or pause.

Phase 4 — Read switch + parity verification

Switch reads progressively:

canary cohort → partial traffic → full traffic.

Gate progression on parity checks:

row-count parity,
checksum/sample parity,
business invariant parity (not just raw counts).

Phase 5 — Contract (remove legacy only after soak)

Only after:

full read migration complete,
no rollback signals during soak window,
observability stable.

Then:

stop old writes,
drop old columns/tables,
remove dead code and feature flags.

Migration SLOs (use these, not vibes)

Define migration as an SLO-controlled operation:

migration_error_rate
dual_write_mismatch_rate
backfill_rows_per_sec
replica_lag_seconds
db_cpu / p95_query_latency
time_to_pause and time_to_resume
cutover_duration_seconds

Set explicit budgets (example):

mismatch rate < 0.01%
replica lag < 10s sustained
no P0 query latency regression > 15%

If budgets are breached, migration state machine must auto-transition to THROTTLED or PAUSED.

Failure modes that repeatedly cause incidents

Coupling code deploy with irreversible DDL in one step
- Fix: separate expand and contract releases.
No explicit cutover window control
- Fix: postpone-able/manual cutover with rollback readiness.
Backfill without load feedback
- Fix: adaptive throttling tied to lag + latency.
Assuming online means zero risk
- Fix: preflight constraints, triggers, FK behavior, and lock edge cases.
Dropping old schema before proving parity
- Fix: parity checklist + soak period as hard gate.

Rollback strategy by phase

Expand phase: easy rollback (ignore new schema).
Dual-write/backfill phase: disable new writes, stop backfill, continue old reads.
Read switch phase: flip reads back to old schema immediately.
After contract: rollback cost spikes; require restore/rebuild paths.

Interpretation: your cheapest rollback exists before contract. Protect that window.

Tooling selection cheat sheet

Need MySQL online migration with high operational control and triggerless approach:
- consider gh-ost.
Need broad MySQL online schema change workflow with chunk-copy/triggers:
- consider pt-online-schema-change.
Need platform-managed non-blocking flow with branch/deploy governance:
- use provider workflow (e.g., PlanetScale safe migrations).
On PostgreSQL:
- prefer migration plans that minimize ACCESS EXCLUSIVE exposure,
- use CREATE INDEX CONCURRENTLY where appropriate,
- stage incompatible changes via expand/contract.

Minimal preflight checklist

Exact lock behavior verified for each DDL statement
Expand SQL is backward-compatible
Dual-write feature flag in place
Backfill is chunked + throttle-aware
Parity checks automated and observable
Cutover can pause/postpone safely
Rollback drill done before production execution
Contract step scheduled separately (never same release by default)

If any item is missing, you are not ready for “zero-downtime” claims.

References (researched)

Martin Fowler — Parallel Change (expand/migrate/contract)
https://martinfowler.com/bliki/ParallelChange.html
PostgreSQL ALTER TABLE docs (lock-level caveat by subform)
https://www.postgresql.org/docs/current/sql-altertable.html
PostgreSQL explicit locking modes (ACCESS EXCLUSIVE, SHARE UPDATE EXCLUSIVE, etc.)
https://www.postgresql.org/docs/current/explicit-locking.html
PostgreSQL CREATE INDEX CONCURRENTLY docs
https://www.postgresql.org/docs/current/sql-createindex.html
gh-ost (triggerless MySQL online schema migration; binlog-based apply/cutover controls)
https://github.com/github/gh-ost
Percona Toolkit pt-online-schema-change docs (chunk copy + triggers + rename swap)
https://docs.percona.com/percona-toolkit/pt-online-schema-change.html
PlanetScale non-blocking schema changes / safe migrations workflow
https://planetscale.com/docs/vitess/schema-changes
Prisma Data Guide: Expand and contract pattern overview
https://www.prisma.io/dataguide/types/relational/expand-and-contract-pattern