Point-in-Time Columnar Research Store (Arrow + Parquet + DuckDB) Playbook

2026-03-07 · software

Point-in-Time Columnar Research Store (Arrow + Parquet + DuckDB) Playbook

Date: 2026-03-07
Category: knowledge (quant data engineering / reproducible research)

Why this matters

Many quant teams lose edge not because models are weak, but because data truth is inconsistent across backtest, replay, and live monitoring.

Typical failure pattern:

  1. Data is fast but not point-in-time safe (lookahead leaks).
  2. Data is correct but too slow/costly to iterate.
  3. Teams optimize either speed or correctness, rarely both.

A practical fix is a columnar, point-in-time research store:

The key is not the tools alone; it is the contract around time semantics and replayability.


Core design contract (non-negotiable)

For every record, carry at least:

If you store only event_time, you will eventually leak hindsight through late corrections.


Table lanes: split by mutation pattern

Use separate physical lanes instead of one giant “everything” table.

1) Immutable event lane

Examples: trades, quotes, order acknowledgements, fills.

2) Correctable reference lane

Examples: symbol metadata, corporate action adjustments, borrow flags, risk limits.

3) Derived feature lane

Examples: realized volatility windows, imbalance factors, toxicity metrics.

This separation keeps replay possible when feature logic changes.


Physical storage layout (Parquet)

Partitioning

Prefer low-cardinality first, high-cardinality later. Example:

Avoid partitioning directly by full symbol if universe is large; you create tiny-file hell.

File sizing

Target practical file sizes that are scan-friendly and compactable (commonly hundreds of MB scale).
For row groups, Parquet docs discuss larger row groups (often 512MB–1GB in HDFS-oriented setups), but for local/dev + mixed query workloads you often tune lower for memory/parallelism balance.

Operational rule: choose one baseline and measure p95 query latency + memory pressure before “optimizing.”

Encoding defaults


In-memory layout (Arrow)

Use Arrow-native buffers in pipeline boundaries where possible:

Benefits:

But remember: Arrow optimizes analytic reads, not random row mutation.


Point-in-time query contract

Every research query must answer:

“What did we know as of decision timestamp T, not what we know now?”

Use an explicit query shape:

  1. select candidate events with event_time <= T
  2. constrain corrections by ingest_time <= T
  3. apply effective_from <= T < effective_to
  4. pick latest valid revision by deterministic ordering

If any one of these is skipped, backtest optimism is likely contaminated.


As-of joins (price/reference alignment)

As-of joins are essential when streams are not perfectly timestamp-aligned.

Practical guardrails:

Treat as-of join outcomes as telemetry, not just SQL plumbing.


Late data and corrections policy

Define explicit states:

For each state, define behavior:

No policy = silent inconsistencies between yesterday’s backtest and today’s replay.


Reproducibility primitives

Attach immutable manifests to each experiment/run:

If two runs with same manifest produce different outputs, treat as incident.


Observability KPIs for the data layer

Track these continuously:

Data platform quality should be measured like SRE quality.


Anti-patterns to avoid

  1. Single timestamp truth

    • looks simple, hides causality violations.
  2. Overwrite-in-place reference tables

    • destroys historical explainability.
  3. Partition by high-cardinality first key

    • creates massive metadata/tiny-file tax.
  4. Float prices in feature pipelines

    • introduces subtle non-determinism.
  5. No manifest for experiments

    • “cannot reproduce” becomes normal.

Rollout plan (practical)

  1. Pick one dataset pair (e.g., top-of-book + trades) as pilot.
  2. Implement full time contract fields and immutable manifests.
  3. Build one canonical as-of query used by both backtest and notebook paths.
  4. Add correction replay for a bounded horizon (e.g., 5 days).
  5. Gate promotion on replay drift + unmatched ratio thresholds.
  6. Expand dataset-by-dataset only after stability.

Do not attempt full historical migration in one shot.


Quick readiness checklist


References


One-line takeaway

A fast quant stack is nice, but a point-in-time-correct quant stack is compounding: Arrow + Parquet + DuckDB works only when time semantics are treated as a hard contract.