Local-First Collaboration Systems: OT vs CRDT Sync Engine Playbook

2026-03-05 · software

Local-First Collaboration Systems: OT vs CRDT Sync Engine Playbook

Date: 2026-03-05
Category: software
Purpose: A practical decision and operations guide for choosing and running a real-time collaborative sync engine (Operational Transformation vs CRDT) in production.


Why this matters

Most teams underestimate collaborative editing complexity until they hit one of these failures:

If your app is notes, docs, whiteboards, knowledge bases, or design tools, the sync model is your database transaction model. Pick wrong, and every feature becomes harder.


The two dominant families

1) Operational Transformation (OT)

Mental model:

Strengths:

Tradeoffs:

2) CRDTs (Conflict-Free Replicated Data Types)

Mental model:

Strengths:

Tradeoffs:


Decision framework (use this, not vibes)

Pick CRDT-first when:

Pick OT-first when:

Hybrid is common:


Architecture blueprint (production-safe)

Layer A: State sync core

Layer B: Presence (separate from durable state)

Presence should be ephemeral:

Layer C: Policy and invariants

CRDT/OT convergence is not business correctness.

Enforce separately:

Pattern:

  1. Accept sync update
  2. Materialize candidate state
  3. Validate policy/invariants
  4. Commit + broadcast if valid, reject/repair otherwise

Layer D: Persistence and replay

Store:

Without replay tooling, incident response becomes guesswork.


Data modeling rules that prevent pain

  1. Prefer coarse-grained top-level docs, fine-grained fields inside
    • Too many tiny docs increase fanout and transactional complexity
  2. Model order explicitly for lists/blocks
    • Avoid hidden dependence on array index semantics
  3. Separate ephemeral vs durable state
    • Presence in awareness channel; content in sync log
  4. Use stable IDs everywhere
    • Never rely on position as identity
  5. Plan tombstone/compaction lifecycle up front
    • Long-lived docs otherwise degrade in memory/load latency

Performance and cost realities

What dominates in practice:

Operational heuristics:

Define SLOs early:


Testing strategy (must-have)

  1. Deterministic simulation harness
    • Reorder, drop, duplicate, and delay messages
  2. Multi-device offline/reconnect fuzzing
    • N replicas, random edit streams, random partitions
  3. Schema migration replay tests
    • Old logs + new code = same materialized state
  4. Invariant property tests
    • “Converged” is insufficient; assert business correctness

Golden rule:

If you can’t replay and deterministically reproduce a sync bug, you don’t own the system yet.


Common anti-patterns

  1. Treating CRDT as “no backend logic needed”
  2. Mixing presence and durable state in one channel
  3. No update-format versioning
  4. No compaction plan for long-lived docs
  5. Assuming convergence implies valid business state
  6. Shipping without partition/reconnect fuzz tests

Migration guidance (centralized app → local-first)

Minimal-risk path:

  1. Start with a single collaborative surface (e.g., notes body)
  2. Keep existing backend as authority for permissions/billing/search
  3. Add client-local persistence and background sync
  4. Introduce snapshots + replay tooling before scale-up
  5. Gradually move more surfaces as observability matures

Don’t attempt full-domain migration in one release.


Tooling signals from ecosystem docs

The practical takeaway: library choice is less important than operational discipline around schema boundaries, invariants, replay, and compaction.


References


Rule of thumb

Choose the sync model that matches your failure mode tolerance, not your demo speed.

Either way, the winning teams treat sync as infrastructure, not widget glue.