Zero-Downtime Schema Migration (Expand/Contract) Operations Playbook

2026-03-01 · software

Zero-Downtime Schema Migration (Expand/Contract) Operations Playbook

Date: 2026-03-01
Category: knowledge
Domain: software / databases / reliability engineering

Why this matters

Most production incidents around schema changes are not caused by SQL syntax errors. They come from coordination failures:

A robust schema migration approach treats change as a state machine, not a one-shot ALTER.


Mental model: Parallel Change (Expand → Migrate → Contract)

Martin Fowler’s Parallel Change pattern (aka expand/contract) gives the safest default:

  1. Expand: introduce new schema in a backward-compatible way.
  2. Migrate: move data + traffic gradually while both old/new paths coexist.
  3. Contract: remove legacy schema only after proving parity and rollback irrelevance.

This avoids “big-bang” breakage and keeps each release reversible.


Step 0: Classify the DDL before touching production

Before writing migration SQL, classify each change into one of three risk buckets:

A) Metadata-only / cheap

Examples: nullable column add (engine/version dependent), comment changes.

B) Online but resource-heavy

Examples: concurrent index build, background table copy tools.

C) Blocking / rewrite-prone

Examples: type rewrites, non-compatible defaults, lock-heavy ALTER variants.

If you don’t know the bucket, assume C and stage accordingly.


Engine reality check (what operators forget)

PostgreSQL

Operational takeaway: always preflight lock semantics for your exact subcommand, not “ALTER TABLE in general”.

MySQL ecosystems (InnoDB/Vitess tooling)

Two common online migration families:

Trade-off intuition:

Managed platforms (e.g., PlanetScale safe migrations) add queueing/verification/throttling around this class of online copy-based change.


Production runbook (operator grade)

Phase 1 — Expand (compatible schema first)

Rules:

Example:

Expand by adding new columns first.

Phase 2 — Dual write (forward compatibility)

Application writes to both schemas:

Key safeguards:

Phase 3 — Backfill historical data (bounded blast radius)

Run chunked backfill with explicit controls:

Control loop example:

Phase 4 — Read switch + parity verification

Switch reads progressively:

Gate progression on parity checks:

Phase 5 — Contract (remove legacy only after soak)

Only after:

Then:


Migration SLOs (use these, not vibes)

Define migration as an SLO-controlled operation:

Set explicit budgets (example):

If budgets are breached, migration state machine must auto-transition to THROTTLED or PAUSED.


Failure modes that repeatedly cause incidents

  1. Coupling code deploy with irreversible DDL in one step

    • Fix: separate expand and contract releases.
  2. No explicit cutover window control

    • Fix: postpone-able/manual cutover with rollback readiness.
  3. Backfill without load feedback

    • Fix: adaptive throttling tied to lag + latency.
  4. Assuming online means zero risk

    • Fix: preflight constraints, triggers, FK behavior, and lock edge cases.
  5. Dropping old schema before proving parity

    • Fix: parity checklist + soak period as hard gate.

Rollback strategy by phase

Interpretation: your cheapest rollback exists before contract. Protect that window.


Tooling selection cheat sheet


Minimal preflight checklist

If any item is missing, you are not ready for “zero-downtime” claims.


References (researched)