Local-First Collaboration Systems: OT vs CRDT Sync Engine Playbook

Date: 2026-03-05
Category: software
Purpose: A practical decision and operations guide for choosing and running a real-time collaborative sync engine (Operational Transformation vs CRDT) in production.

Why this matters

Most teams underestimate collaborative editing complexity until they hit one of these failures:

Offline edits vanish after reconnect
Cursor/presence desync causes phantom users
Reordering bugs corrupt rich-text structure
Server-side constraints fight client-side merges
History/log format becomes impossible to migrate

If your app is notes, docs, whiteboards, knowledge bases, or design tools, the sync model is your database transaction model. Pick wrong, and every feature becomes harder.

The two dominant families

1) Operational Transformation (OT)

Mental model:

Clients submit operations relative to a document version
A server (or authority path) transforms concurrent ops against each other
Goal: preserve user intent while converging to one state

Strengths:

Mature history in collaborative editors
Can be efficient for centrally coordinated systems
Good fit if a central authority is guaranteed and always online

Tradeoffs:

Correct transform functions are hard (especially beyond plain text)
Rich structured data can be painful
Offline-first, multi-device, and p2p become harder operationally

2) CRDTs (Conflict-Free Replicated Data Types)

Mental model:

Replicas apply local changes immediately
Changes are merged deterministically without central arbitration
Goal: eventual convergence under arbitrary message ordering/duplication

Strengths:

Strong offline-first and multi-device behavior
Natural fit for local-first UX
P2P and server-assisted topologies both possible

Tradeoffs:

Data structure and metadata overhead can be non-trivial
Some invariants (business rules) are not solved by CRDT merge alone
Garbage collection/compaction strategy must be explicit

Decision framework (use this, not vibes)

Pick CRDT-first when:

Offline editing is core, not “nice to have”
You need resilient multi-device sync with weak connectivity
P2P or edge-heavy sync is a roadmap item
You want local-first UX (instant writes, optimistic by design)

Pick OT-first when:

Architecture is strictly central-server and always-connected
Core domain is linear text with strong centralized coordination
Team has existing OT expertise/infra and low offline requirements

Hybrid is common:

CRDT for document state
Separate channels for presence, locks, and server validation

Architecture blueprint (production-safe)

Layer A: State sync core

Use a battle-tested library (e.g., Yjs/Automerge class ecosystems)
Define explicit document schema boundaries (doc-level sharding)
Version binary/update format from day 1

Layer B: Presence (separate from durable state)

Presence should be ephemeral:

Cursor position, selection, “typing”, online status
Heartbeat + timeout expiration
Never treat presence as canonical durable state

Layer C: Policy and invariants

CRDT/OT convergence is not business correctness.

Enforce separately:

Permission checks
Referential integrity
Domain rules (e.g., workflow states, unique constraints)

Pattern:

Accept sync update
Materialize candidate state
Validate policy/invariants
Commit + broadcast if valid, reject/repair otherwise

Layer D: Persistence and replay

Store:

Snapshot checkpoints (fast load)
Incremental updates/oplog (audit + replay)
Compaction metadata

Without replay tooling, incident response becomes guesswork.

Data modeling rules that prevent pain

Prefer coarse-grained top-level docs, fine-grained fields inside
- Too many tiny docs increase fanout and transactional complexity
Model order explicitly for lists/blocks
- Avoid hidden dependence on array index semantics
Separate ephemeral vs durable state
- Presence in awareness channel; content in sync log
Use stable IDs everywhere
- Never rely on position as identity
Plan tombstone/compaction lifecycle up front
- Long-lived docs otherwise degrade in memory/load latency

Performance and cost realities

What dominates in practice:

Initial document load and catch-up latency
Update fanout and backpressure under hot rooms
Memory overhead from metadata/tombstones
Snapshot frequency and compaction stalls

Operational heuristics:

Snapshot periodically by update count + time window
Run compaction asynchronously with bounded CPU budget
Apply room-level rate limits for abusive clients
Use binary transport and compression for update frames

Define SLOs early:

P95 local apply latency
P95 reconnect catch-up latency
Presence freshness lag
Server CPU/memory per active room

Testing strategy (must-have)

Deterministic simulation harness
- Reorder, drop, duplicate, and delay messages
Multi-device offline/reconnect fuzzing
- N replicas, random edit streams, random partitions
Schema migration replay tests
- Old logs + new code = same materialized state
Invariant property tests
- “Converged” is insufficient; assert business correctness

Golden rule:

If you can’t replay and deterministically reproduce a sync bug, you don’t own the system yet.

Common anti-patterns

Treating CRDT as “no backend logic needed”
Mixing presence and durable state in one channel
No update-format versioning
No compaction plan for long-lived docs
Assuming convergence implies valid business state
Shipping without partition/reconnect fuzz tests

Migration guidance (centralized app → local-first)

Minimal-risk path:

Start with a single collaborative surface (e.g., notes body)
Keep existing backend as authority for permissions/billing/search
Add client-local persistence and background sync
Introduce snapshots + replay tooling before scale-up
Gradually move more surfaces as observability matures

Don’t attempt full-domain migration in one release.

Tooling signals from ecosystem docs

Automerge positions itself as a local-first data structure layer with compact format and sync protocol, emphasizing offline + merge semantics.
Yjs emphasizes network-agnostic CRDT shared types, offline editing, and provider-based awareness/presence.
ShareDB remains a clear OT reference architecture for centralized real-time JSON collaboration.

The practical takeaway: library choice is less important than operational discipline around schema boundaries, invariants, replay, and compaction.

References

Local-first software essay (Ink & Switch):
https://www.inkandswitch.com/essay/local-first/
Automerge docs (intro):
https://automerge.org/docs/hello/
Automerge repository README (sync protocol + local-first positioning):
https://github.com/automerge/automerge
Yjs docs (awareness protocol):
https://docs.yjs.dev/api/about-awareness
Yjs repository README (network-agnostic CRDT shared types):
https://github.com/yjs/yjs
ShareDB (OT backend reference):
https://github.com/share/sharedb

Rule of thumb

Choose the sync model that matches your failure mode tolerance, not your demo speed.

If offline correctness is product-critical: CRDT-first
If centralized always-online editing is enough: OT can be simpler

Either way, the winning teams treat sync as infrastructure, not widget glue.