CRDT vs OT for Realtime Collaboration (Practical Playbook)
Date: 2026-02-26
Category: knowledge
Domain: distributed systems / collaborative editing
Why this matters
If you want Google-Docs-like collaboration, the hard part is not WebSocket plumbing.
The hard part is convergence under concurrent edits while keeping latency low.
Most teams eventually choose one of two families:
- OT (Operational Transformation)
- CRDT (Conflict-free Replicated Data Types)
Both can work in production. Most failures come from choosing a model that mismatches your product shape.
Executive summary (fast choice)
Choose OT when:
- you have a mostly text-document product,
- you can maintain a strong central server path,
- you need compact wire ops and strict compatibility with classic editor semantics.
Choose CRDT when:
- you need robust offline-first behavior,
- peer-to-peer or multi-region eventually-consistent sync matters,
- your data is not just plain text (lists/maps/comments/presence trees).
If undecided: start with centralized CRDT sync (client-server) and avoid P2P until product need is real.
Mental model
OT in one sentence
OT rewrites incoming operations against already-applied concurrent operations so every replica applies a transformed op sequence and converges.
CRDT in one sentence
CRDT designs the data type/operations so merges are mathematically convergent without needing a global transformation pipeline.
Tradeoff table
| Dimension | OT | CRDT |
|---|---|---|
| Core mechanism | Transform ops by context/version | Merge commutative/idempotent ops/states |
| Offline support | Possible, but server/version coupling can be tricky | Natural fit, especially op-based CRDTs |
| Central server dependency | Usually strong | Optional (still common in practice) |
| Text performance | Very good with mature implementations | Good, but metadata/GC design is crucial |
| Non-text shared data | More custom logic needed | Strong with map/list/set CRDT families |
| Implementation complexity | High in transform correctness | High in metadata, tombstones, compaction |
| Debuggability | Version + transform bugs are painful | Metadata growth + causal ordering bugs |
| Typical wire size | Often smaller | Can grow unless aggressively compacted |
Architecture patterns that actually work
1) Central relay (recommended default)
- Clients maintain local replicas.
- Server relays ops and stores durable history/snapshots.
- No client-to-client trust boundary issues.
This works for both OT and CRDT and is usually enough for 95% of products.
2) Offline-first with reconnect reconciliation
- Local-first editing even without network.
- On reconnect: exchange missing ops via vector clock/version frontier.
- Use explicit backpressure to avoid replay storms.
CRDT usually simplifies this path.
3) Multi-region collaboration
- Region-local ingest endpoints reduce RTT.
- Async cross-region replication.
- Strong observability for lag windows.
CRDT is often easier to reason about at region boundaries; OT can still work with disciplined sequencer design.
Data model choices (most teams underestimate this)
Text only editor
- OT and CRDT both viable.
- Key question: do you need robust offline + background sync while app is suspended?
Rich document (text + comments + blocks + embeds)
- Model main structure as a CRDT map/list tree.
- Keep ephemeral UI state (cursor color, hover, selection previews) out of durable CRDT logs.
Whiteboard/graph app
- CRDT map(set of objects) + per-object fields often cleaner.
- Use LWW carefully; for numeric accumulators or ordering semantics use dedicated CRDT types.
Performance guardrails
Snapshot + incremental log
- Periodically persist snapshots.
- Replay only tail ops at load.
Compaction policy
- Define compaction trigger by op count or byte size.
- Keep compaction deterministic and versioned.
Tombstone/metadata control (CRDT)
- Plan garbage collection from day one.
- Never assume tombstones stay “small enough.”
Transform test corpus (OT)
- Maintain randomized concurrent-edit fuzz suite.
- Regression-test classic edge cases: insert/insert same index, delete-overlap, replace chains.
Bounded payloads
- Limit max ops per message and max catch-up batch.
- Throttle reconnect storms with server tokens.
Consistency model UX checklist
Users forgive tiny delays. They do not forgive data loss.
- Show sync state (
online,syncing,offline-local). - Expose conflict-safe semantics for destructive actions (delete block, resolve comment).
- Make undo/redo model explicit: local undo stack vs global collaborative history.
- Keep presence ephemeral and cheap (don’t persist cursor jitter as document history).
Testing strategy (minimum viable seriousness)
Deterministic simulation
- N clients, randomized interleavings, partitions, reconnects.
- Assert convergence hash equality after quiescence.
Jepsen-lite chaos for collaboration backend
- Drop/reorder/duplicate websocket frames in staging.
- Restart sync nodes during heavy concurrent editing.
- Verify no permanent divergence and bounded recovery time.
Property tests
- Idempotency: applying same op twice doesn’t corrupt state.
- Commutativity (where expected): op ordering under causal constraints converges.
- Monotonic clocks/version frontier behavior remains valid under retries.
Observability that prevents 3 a.m. incidents
Track these as first-class metrics:
- replication lag (p50/p95/p99)
- catch-up replay duration
- document divergence alarms (checksum mismatch sampling)
- op ingest reject rate (schema/version mismatch)
- snapshot load latency
- metadata growth rate per document (CRDT)
Alert on trend, not just static threshold.
Migration patterns
OT → CRDT
- Dual-write operation envelopes into both engines in shadow mode.
- Compare periodic canonical render hashes.
- Cut over by cohort, keep rollback bridge for a full release cycle.
CRDT → OT (rare but possible)
- Usually for strict centralization + low metadata appetite.
- Requires canonical linearization layer and careful semantic mapping.
Common failure modes
- Treating presence events as durable document events.
- Shipping without compaction/GC strategy.
- Assuming “eventual consistency” means “no UX design needed.”
- Ignoring versioned schemas for operation payloads.
- No anti-entropy protocol for missed ops.
Practical stack suggestions
- CRDT-heavy path: Yjs / Automerge + custom sync relay + snapshot store
- OT-heavy path: ShareDB-style architecture + strict central sequencer
Regardless of stack:
- define protocol versioning policy,
- design replay/backfill APIs early,
- and automate convergence tests in CI.
30-minute decision rubric
Score 1~5 for each axis:
- Offline-first criticality
- Multi-region or P2P need
- Data shape complexity (text-only vs rich graph)
- Team familiarity with transform math (OT) vs metadata/GC (CRDT)
- Acceptable operational complexity
Interpretation:
- High offline + rich data + multi-region → CRDT bias
- Centralized text editor + mature server infra → OT bias
Pick one, then invest in testing + observability before adding fancy features.
That discipline matters more than framework choice.