Local-First Collaboration Systems: OT vs CRDT Sync Engine Playbook
Date: 2026-03-05
Category: software
Purpose: A practical decision and operations guide for choosing and running a real-time collaborative sync engine (Operational Transformation vs CRDT) in production.
Why this matters
Most teams underestimate collaborative editing complexity until they hit one of these failures:
- Offline edits vanish after reconnect
- Cursor/presence desync causes phantom users
- Reordering bugs corrupt rich-text structure
- Server-side constraints fight client-side merges
- History/log format becomes impossible to migrate
If your app is notes, docs, whiteboards, knowledge bases, or design tools, the sync model is your database transaction model. Pick wrong, and every feature becomes harder.
The two dominant families
1) Operational Transformation (OT)
Mental model:
- Clients submit operations relative to a document version
- A server (or authority path) transforms concurrent ops against each other
- Goal: preserve user intent while converging to one state
Strengths:
- Mature history in collaborative editors
- Can be efficient for centrally coordinated systems
- Good fit if a central authority is guaranteed and always online
Tradeoffs:
- Correct transform functions are hard (especially beyond plain text)
- Rich structured data can be painful
- Offline-first, multi-device, and p2p become harder operationally
2) CRDTs (Conflict-Free Replicated Data Types)
Mental model:
- Replicas apply local changes immediately
- Changes are merged deterministically without central arbitration
- Goal: eventual convergence under arbitrary message ordering/duplication
Strengths:
- Strong offline-first and multi-device behavior
- Natural fit for local-first UX
- P2P and server-assisted topologies both possible
Tradeoffs:
- Data structure and metadata overhead can be non-trivial
- Some invariants (business rules) are not solved by CRDT merge alone
- Garbage collection/compaction strategy must be explicit
Decision framework (use this, not vibes)
Pick CRDT-first when:
- Offline editing is core, not “nice to have”
- You need resilient multi-device sync with weak connectivity
- P2P or edge-heavy sync is a roadmap item
- You want local-first UX (instant writes, optimistic by design)
Pick OT-first when:
- Architecture is strictly central-server and always-connected
- Core domain is linear text with strong centralized coordination
- Team has existing OT expertise/infra and low offline requirements
Hybrid is common:
- CRDT for document state
- Separate channels for presence, locks, and server validation
Architecture blueprint (production-safe)
Layer A: State sync core
- Use a battle-tested library (e.g., Yjs/Automerge class ecosystems)
- Define explicit document schema boundaries (doc-level sharding)
- Version binary/update format from day 1
Layer B: Presence (separate from durable state)
Presence should be ephemeral:
- Cursor position, selection, “typing”, online status
- Heartbeat + timeout expiration
- Never treat presence as canonical durable state
Layer C: Policy and invariants
CRDT/OT convergence is not business correctness.
Enforce separately:
- Permission checks
- Referential integrity
- Domain rules (e.g., workflow states, unique constraints)
Pattern:
- Accept sync update
- Materialize candidate state
- Validate policy/invariants
- Commit + broadcast if valid, reject/repair otherwise
Layer D: Persistence and replay
Store:
- Snapshot checkpoints (fast load)
- Incremental updates/oplog (audit + replay)
- Compaction metadata
Without replay tooling, incident response becomes guesswork.
Data modeling rules that prevent pain
- Prefer coarse-grained top-level docs, fine-grained fields inside
- Too many tiny docs increase fanout and transactional complexity
- Model order explicitly for lists/blocks
- Avoid hidden dependence on array index semantics
- Separate ephemeral vs durable state
- Presence in awareness channel; content in sync log
- Use stable IDs everywhere
- Never rely on position as identity
- Plan tombstone/compaction lifecycle up front
- Long-lived docs otherwise degrade in memory/load latency
Performance and cost realities
What dominates in practice:
- Initial document load and catch-up latency
- Update fanout and backpressure under hot rooms
- Memory overhead from metadata/tombstones
- Snapshot frequency and compaction stalls
Operational heuristics:
- Snapshot periodically by update count + time window
- Run compaction asynchronously with bounded CPU budget
- Apply room-level rate limits for abusive clients
- Use binary transport and compression for update frames
Define SLOs early:
- P95 local apply latency
- P95 reconnect catch-up latency
- Presence freshness lag
- Server CPU/memory per active room
Testing strategy (must-have)
- Deterministic simulation harness
- Reorder, drop, duplicate, and delay messages
- Multi-device offline/reconnect fuzzing
- N replicas, random edit streams, random partitions
- Schema migration replay tests
- Old logs + new code = same materialized state
- Invariant property tests
- “Converged” is insufficient; assert business correctness
Golden rule:
If you can’t replay and deterministically reproduce a sync bug, you don’t own the system yet.
Common anti-patterns
- Treating CRDT as “no backend logic needed”
- Mixing presence and durable state in one channel
- No update-format versioning
- No compaction plan for long-lived docs
- Assuming convergence implies valid business state
- Shipping without partition/reconnect fuzz tests
Migration guidance (centralized app → local-first)
Minimal-risk path:
- Start with a single collaborative surface (e.g., notes body)
- Keep existing backend as authority for permissions/billing/search
- Add client-local persistence and background sync
- Introduce snapshots + replay tooling before scale-up
- Gradually move more surfaces as observability matures
Don’t attempt full-domain migration in one release.
Tooling signals from ecosystem docs
- Automerge positions itself as a local-first data structure layer with compact format and sync protocol, emphasizing offline + merge semantics.
- Yjs emphasizes network-agnostic CRDT shared types, offline editing, and provider-based awareness/presence.
- ShareDB remains a clear OT reference architecture for centralized real-time JSON collaboration.
The practical takeaway: library choice is less important than operational discipline around schema boundaries, invariants, replay, and compaction.
References
- Local-first software essay (Ink & Switch):
https://www.inkandswitch.com/essay/local-first/ - Automerge docs (intro):
https://automerge.org/docs/hello/ - Automerge repository README (sync protocol + local-first positioning):
https://github.com/automerge/automerge - Yjs docs (awareness protocol):
https://docs.yjs.dev/api/about-awareness - Yjs repository README (network-agnostic CRDT shared types):
https://github.com/yjs/yjs - ShareDB (OT backend reference):
https://github.com/share/sharedb
Rule of thumb
Choose the sync model that matches your failure mode tolerance, not your demo speed.
- If offline correctness is product-critical: CRDT-first
- If centralized always-online editing is enough: OT can be simpler
Either way, the winning teams treat sync as infrastructure, not widget glue.