Point-in-Time Columnar Research Store (Arrow + Parquet + DuckDB) Playbook

Date: 2026-03-07
Category: knowledge (quant data engineering / reproducible research)

Why this matters

Many quant teams lose edge not because models are weak, but because data truth is inconsistent across backtest, replay, and live monitoring.

Typical failure pattern:

Data is fast but not point-in-time safe (lookahead leaks).
Data is correct but too slow/costly to iterate.
Teams optimize either speed or correctness, rarely both.

A practical fix is a columnar, point-in-time research store:

Arrow in-memory layout for fast local compute,
Parquet on disk for efficient compression + scans,
DuckDB for deterministic SQL analytics and as-of joins.

The key is not the tools alone; it is the contract around time semantics and replayability.

Core design contract (non-negotiable)

For every record, carry at least:

event_time: when market/business event happened
ingest_time: when your system first observed it
effective_from, effective_to: validity window in your canonical view
source, source_seq: provenance and ordering within source
revision: monotonic correction version for same business key

If you store only event_time, you will eventually leak hindsight through late corrections.

Table lanes: split by mutation pattern

Use separate physical lanes instead of one giant “everything” table.

1) Immutable event lane

Examples: trades, quotes, order acknowledgements, fills.

Append-only
No update in place
Deduplicate by (source, source_seq) or deterministic hash

2) Correctable reference lane

Examples: symbol metadata, corporate action adjustments, borrow flags, risk limits.

Bitemporal validity (effective_* + ingest_time)
Explicit correction records (never silent overwrite)

3) Derived feature lane

Examples: realized volatility windows, imbalance factors, toxicity metrics.

Versioned by feature logic hash and input snapshot id
Rebuildable from immutable + reference lanes

This separation keeps replay possible when feature logic changes.

Physical storage layout (Parquet)

Partitioning

Prefer low-cardinality first, high-cardinality later. Example:

dataset=quotes/date=YYYY-MM-DD/hour=HH/symbol_bucket=NN/

Avoid partitioning directly by full symbol if universe is large; you create tiny-file hell.

File sizing

Target practical file sizes that are scan-friendly and compactable (commonly hundreds of MB scale).
For row groups, Parquet docs discuss larger row groups (often 512MB–1GB in HDFS-oriented setups), but for local/dev + mixed query workloads you often tune lower for memory/parallelism balance.

Operational rule: choose one baseline and measure p95 query latency + memory pressure before “optimizing.”

Encoding defaults

Dictionary encoding: venue codes, side, condition flags
Delta/bit-pack friendly integer timestamps (ns/us precision as needed)
Avoid float for money; store scaled integers (price_nanos, notional_micros)
Compression: ZSTD default for research archives, Snappy for very hot write paths

In-memory layout (Arrow)

Use Arrow-native buffers in pipeline boundaries where possible:

parser -> normalizer -> feature builder -> query engine

Benefits:

vectorization-friendly scans
low-copy interchange across language/runtime boundaries
predictable column-wise memory behavior

But remember: Arrow optimizes analytic reads, not random row mutation.

Point-in-time query contract

Every research query must answer:

“What did we know as of decision timestamp T, not what we know now?”

Use an explicit query shape:

select candidate events with event_time <= T
constrain corrections by ingest_time <= T
apply effective_from <= T < effective_to
pick latest valid revision by deterministic ordering

If any one of these is skipped, backtest optimism is likely contaminated.

As-of joins (price/reference alignment)

As-of joins are essential when streams are not perfectly timestamp-aligned.

Practical guardrails:

Require join key + monotonic time field
Define tolerance windows (max_staleness_ms)
Record matched-right-timestamp for audit (matched_ts)
Track unmatched ratio; rising unmatched often means feed lag/schema drift

Treat as-of join outcomes as telemetry, not just SQL plumbing.

Late data and corrections policy

Define explicit states:

ON_TIME
LATE_WITHIN_SLA
LATE_BREACH
CORRECTION

For each state, define behavior:

Recompute affected feature windows? (yes/no + depth)
Trigger replay queue? (threshold-based)
Block promotion of new models? (if correction blast radius high)

No policy = silent inconsistencies between yesterday’s backtest and today’s replay.

Reproducibility primitives

Attach immutable manifests to each experiment/run:

input dataset ids (or content hashes)
query text hash / DAG hash
feature code commit
engine/runtime versions
timezone/calendar profile

If two runs with same manifest produce different outputs, treat as incident.

Observability KPIs for the data layer

Track these continuously:

freshness_lag_p50/p95 (event -> ingest)
late_ratio by dataset and venue
correction_rate and correction depth
asof_unmatched_ratio
small_file_ratio and average file size
bytes_scanned_per_query vs expected
replay_drift_rate (recomputed vs stored feature mismatch)

Data platform quality should be measured like SRE quality.

Anti-patterns to avoid

Single timestamp truth
- looks simple, hides causality violations.
Overwrite-in-place reference tables
- destroys historical explainability.
Partition by high-cardinality first key
- creates massive metadata/tiny-file tax.
Float prices in feature pipelines
- introduces subtle non-determinism.
No manifest for experiments
- “cannot reproduce” becomes normal.

Rollout plan (practical)

Pick one dataset pair (e.g., top-of-book + trades) as pilot.
Implement full time contract fields and immutable manifests.
Build one canonical as-of query used by both backtest and notebook paths.
Add correction replay for a bounded horizon (e.g., 5 days).
Gate promotion on replay drift + unmatched ratio thresholds.
Expand dataset-by-dataset only after stability.

Do not attempt full historical migration in one shot.

Quick readiness checklist

Do records carry both event_time and ingest_time?
Can we reconstruct “known at T” deterministically?
Are corrections append-only and versioned?
Are as-of unmatcheds and staleness monitored?
Can we replay a prior run exactly from manifest?

References

Apache Arrow documentation — columnar in-memory format and zero-copy design goals.
Apache Parquet documentation — file-format configuration notes (row groups/pages).
DuckDB documentation — AsOf join usage for temporal alignment.

One-line takeaway

A fast quant stack is nice, but a point-in-time-correct quant stack is compounding: Arrow + Parquet + DuckDB works only when time semantics are treated as a hard contract.