CRDT vs OT for Realtime Collaboration (Practical Playbook)

Date: 2026-02-26
Category: knowledge
Domain: distributed systems / collaborative editing

Why this matters

If you want Google-Docs-like collaboration, the hard part is not WebSocket plumbing.
The hard part is convergence under concurrent edits while keeping latency low.

Most teams eventually choose one of two families:

OT (Operational Transformation)
CRDT (Conflict-free Replicated Data Types)

Both can work in production. Most failures come from choosing a model that mismatches your product shape.

Executive summary (fast choice)

Choose OT when:

you have a mostly text-document product,
you can maintain a strong central server path,
you need compact wire ops and strict compatibility with classic editor semantics.

Choose CRDT when:

you need robust offline-first behavior,
peer-to-peer or multi-region eventually-consistent sync matters,
your data is not just plain text (lists/maps/comments/presence trees).

If undecided: start with centralized CRDT sync (client-server) and avoid P2P until product need is real.

Mental model

OT in one sentence

OT rewrites incoming operations against already-applied concurrent operations so every replica applies a transformed op sequence and converges.

CRDT in one sentence

CRDT designs the data type/operations so merges are mathematically convergent without needing a global transformation pipeline.

Tradeoff table

Dimension	OT	CRDT
Core mechanism	Transform ops by context/version	Merge commutative/idempotent ops/states
Offline support	Possible, but server/version coupling can be tricky	Natural fit, especially op-based CRDTs
Central server dependency	Usually strong	Optional (still common in practice)
Text performance	Very good with mature implementations	Good, but metadata/GC design is crucial
Non-text shared data	More custom logic needed	Strong with map/list/set CRDT families
Implementation complexity	High in transform correctness	High in metadata, tombstones, compaction
Debuggability	Version + transform bugs are painful	Metadata growth + causal ordering bugs
Typical wire size	Often smaller	Can grow unless aggressively compacted

Architecture patterns that actually work

1) Central relay (recommended default)

Clients maintain local replicas.
Server relays ops and stores durable history/snapshots.
No client-to-client trust boundary issues.

This works for both OT and CRDT and is usually enough for 95% of products.

2) Offline-first with reconnect reconciliation

Local-first editing even without network.
On reconnect: exchange missing ops via vector clock/version frontier.
Use explicit backpressure to avoid replay storms.

CRDT usually simplifies this path.

3) Multi-region collaboration

Region-local ingest endpoints reduce RTT.
Async cross-region replication.
Strong observability for lag windows.

CRDT is often easier to reason about at region boundaries; OT can still work with disciplined sequencer design.

Data model choices (most teams underestimate this)

Text only editor

OT and CRDT both viable.
Key question: do you need robust offline + background sync while app is suspended?

Rich document (text + comments + blocks + embeds)

Model main structure as a CRDT map/list tree.
Keep ephemeral UI state (cursor color, hover, selection previews) out of durable CRDT logs.

Whiteboard/graph app

CRDT map(set of objects) + per-object fields often cleaner.
Use LWW carefully; for numeric accumulators or ordering semantics use dedicated CRDT types.

Performance guardrails

Snapshot + incremental log
- Periodically persist snapshots.
- Replay only tail ops at load.
Compaction policy
- Define compaction trigger by op count or byte size.
- Keep compaction deterministic and versioned.
Tombstone/metadata control (CRDT)
- Plan garbage collection from day one.
- Never assume tombstones stay “small enough.”
Transform test corpus (OT)
- Maintain randomized concurrent-edit fuzz suite.
- Regression-test classic edge cases: insert/insert same index, delete-overlap, replace chains.
Bounded payloads
- Limit max ops per message and max catch-up batch.
- Throttle reconnect storms with server tokens.

Consistency model UX checklist

Users forgive tiny delays. They do not forgive data loss.

Show sync state (online, syncing, offline-local).
Expose conflict-safe semantics for destructive actions (delete block, resolve comment).
Make undo/redo model explicit: local undo stack vs global collaborative history.
Keep presence ephemeral and cheap (don’t persist cursor jitter as document history).

Testing strategy (minimum viable seriousness)

Deterministic simulation

N clients, randomized interleavings, partitions, reconnects.
Assert convergence hash equality after quiescence.

Jepsen-lite chaos for collaboration backend

Drop/reorder/duplicate websocket frames in staging.
Restart sync nodes during heavy concurrent editing.
Verify no permanent divergence and bounded recovery time.

Property tests

Idempotency: applying same op twice doesn’t corrupt state.
Commutativity (where expected): op ordering under causal constraints converges.
Monotonic clocks/version frontier behavior remains valid under retries.

Observability that prevents 3 a.m. incidents

Track these as first-class metrics:

replication lag (p50/p95/p99)
catch-up replay duration
document divergence alarms (checksum mismatch sampling)
op ingest reject rate (schema/version mismatch)
snapshot load latency
metadata growth rate per document (CRDT)

Alert on trend, not just static threshold.

Migration patterns

OT → CRDT

Dual-write operation envelopes into both engines in shadow mode.
Compare periodic canonical render hashes.
Cut over by cohort, keep rollback bridge for a full release cycle.

CRDT → OT (rare but possible)

Usually for strict centralization + low metadata appetite.
Requires canonical linearization layer and careful semantic mapping.

Common failure modes

Treating presence events as durable document events.
Shipping without compaction/GC strategy.
Assuming “eventual consistency” means “no UX design needed.”
Ignoring versioned schemas for operation payloads.
No anti-entropy protocol for missed ops.

Practical stack suggestions

CRDT-heavy path: Yjs / Automerge + custom sync relay + snapshot store
OT-heavy path: ShareDB-style architecture + strict central sequencer

Regardless of stack:

define protocol versioning policy,
design replay/backfill APIs early,
and automate convergence tests in CI.

30-minute decision rubric

Score 1~5 for each axis:

Offline-first criticality
Multi-region or P2P need
Data shape complexity (text-only vs rich graph)
Team familiarity with transform math (OT) vs metadata/GC (CRDT)
Acceptable operational complexity

Interpretation:

High offline + rich data + multi-region → CRDT bias
Centralized text editor + mature server infra → OT bias

Pick one, then invest in testing + observability before adding fancy features.

That discipline matters more than framework choice.