Point-in-Time Columnar Research Store (Arrow + Parquet + DuckDB) Playbook
Date: 2026-03-07
Category: knowledge (quant data engineering / reproducible research)
Why this matters
Many quant teams lose edge not because models are weak, but because data truth is inconsistent across backtest, replay, and live monitoring.
Typical failure pattern:
- Data is fast but not point-in-time safe (lookahead leaks).
- Data is correct but too slow/costly to iterate.
- Teams optimize either speed or correctness, rarely both.
A practical fix is a columnar, point-in-time research store:
- Arrow in-memory layout for fast local compute,
- Parquet on disk for efficient compression + scans,
- DuckDB for deterministic SQL analytics and as-of joins.
The key is not the tools alone; it is the contract around time semantics and replayability.
Core design contract (non-negotiable)
For every record, carry at least:
event_time: when market/business event happenedingest_time: when your system first observed iteffective_from,effective_to: validity window in your canonical viewsource,source_seq: provenance and ordering within sourcerevision: monotonic correction version for same business key
If you store only event_time, you will eventually leak hindsight through late corrections.
Table lanes: split by mutation pattern
Use separate physical lanes instead of one giant “everything” table.
1) Immutable event lane
Examples: trades, quotes, order acknowledgements, fills.
- Append-only
- No update in place
- Deduplicate by
(source, source_seq)or deterministic hash
2) Correctable reference lane
Examples: symbol metadata, corporate action adjustments, borrow flags, risk limits.
- Bitemporal validity (
effective_*+ingest_time) - Explicit correction records (never silent overwrite)
3) Derived feature lane
Examples: realized volatility windows, imbalance factors, toxicity metrics.
- Versioned by feature logic hash and input snapshot id
- Rebuildable from immutable + reference lanes
This separation keeps replay possible when feature logic changes.
Physical storage layout (Parquet)
Partitioning
Prefer low-cardinality first, high-cardinality later. Example:
dataset=quotes/date=YYYY-MM-DD/hour=HH/symbol_bucket=NN/
Avoid partitioning directly by full symbol if universe is large; you create tiny-file hell.
File sizing
Target practical file sizes that are scan-friendly and compactable (commonly hundreds of MB scale).
For row groups, Parquet docs discuss larger row groups (often 512MB–1GB in HDFS-oriented setups), but for local/dev + mixed query workloads you often tune lower for memory/parallelism balance.
Operational rule: choose one baseline and measure p95 query latency + memory pressure before “optimizing.”
Encoding defaults
- Dictionary encoding: venue codes, side, condition flags
- Delta/bit-pack friendly integer timestamps (ns/us precision as needed)
- Avoid float for money; store scaled integers (
price_nanos,notional_micros) - Compression: ZSTD default for research archives, Snappy for very hot write paths
In-memory layout (Arrow)
Use Arrow-native buffers in pipeline boundaries where possible:
- parser -> normalizer -> feature builder -> query engine
Benefits:
- vectorization-friendly scans
- low-copy interchange across language/runtime boundaries
- predictable column-wise memory behavior
But remember: Arrow optimizes analytic reads, not random row mutation.
Point-in-time query contract
Every research query must answer:
“What did we know as of decision timestamp T, not what we know now?”
Use an explicit query shape:
- select candidate events with
event_time <= T - constrain corrections by
ingest_time <= T - apply
effective_from <= T < effective_to - pick latest valid revision by deterministic ordering
If any one of these is skipped, backtest optimism is likely contaminated.
As-of joins (price/reference alignment)
As-of joins are essential when streams are not perfectly timestamp-aligned.
Practical guardrails:
- Require join key + monotonic time field
- Define tolerance windows (
max_staleness_ms) - Record matched-right-timestamp for audit (
matched_ts) - Track unmatched ratio; rising unmatched often means feed lag/schema drift
Treat as-of join outcomes as telemetry, not just SQL plumbing.
Late data and corrections policy
Define explicit states:
ON_TIMELATE_WITHIN_SLALATE_BREACHCORRECTION
For each state, define behavior:
- Recompute affected feature windows? (yes/no + depth)
- Trigger replay queue? (threshold-based)
- Block promotion of new models? (if correction blast radius high)
No policy = silent inconsistencies between yesterday’s backtest and today’s replay.
Reproducibility primitives
Attach immutable manifests to each experiment/run:
- input dataset ids (or content hashes)
- query text hash / DAG hash
- feature code commit
- engine/runtime versions
- timezone/calendar profile
If two runs with same manifest produce different outputs, treat as incident.
Observability KPIs for the data layer
Track these continuously:
freshness_lag_p50/p95(event -> ingest)late_ratioby dataset and venuecorrection_rateand correction depthasof_unmatched_ratiosmall_file_ratioand average file sizebytes_scanned_per_queryvs expectedreplay_drift_rate(recomputed vs stored feature mismatch)
Data platform quality should be measured like SRE quality.
Anti-patterns to avoid
Single timestamp truth
- looks simple, hides causality violations.
Overwrite-in-place reference tables
- destroys historical explainability.
Partition by high-cardinality first key
- creates massive metadata/tiny-file tax.
Float prices in feature pipelines
- introduces subtle non-determinism.
No manifest for experiments
- “cannot reproduce” becomes normal.
Rollout plan (practical)
- Pick one dataset pair (e.g., top-of-book + trades) as pilot.
- Implement full time contract fields and immutable manifests.
- Build one canonical as-of query used by both backtest and notebook paths.
- Add correction replay for a bounded horizon (e.g., 5 days).
- Gate promotion on replay drift + unmatched ratio thresholds.
- Expand dataset-by-dataset only after stability.
Do not attempt full historical migration in one shot.
Quick readiness checklist
- Do records carry both
event_timeandingest_time? - Can we reconstruct “known at T” deterministically?
- Are corrections append-only and versioned?
- Are as-of unmatcheds and staleness monitored?
- Can we replay a prior run exactly from manifest?
References
- Apache Arrow documentation — columnar in-memory format and zero-copy design goals.
- Apache Parquet documentation — file-format configuration notes (row groups/pages).
- DuckDB documentation — AsOf join usage for temporal alignment.
One-line takeaway
A fast quant stack is nice, but a point-in-time-correct quant stack is compounding: Arrow + Parquet + DuckDB works only when time semantics are treated as a hard contract.