ADR 0027: Load Avro/ORC + Extract Avro (G1)¶
- Status: Accepted
Context¶
The v1.0 competitor-parity workstream G1 closes three rows of the
goccy bigquery-emulator feature
matrix
that bqemulator missed before this session:
| Gap | Before | After |
|---|---|---|
| G-5 Load format: Avro | ❌ UnsupportedFeatureError at the executor |
✅ |
| G-6 Load format: ORC | ❌ UnsupportedFeatureError at the executor |
✅ |
| G-7 Extract format: Avro | ❌ InvalidQueryError("Unknown destination format") |
✅ |
The motivating use case is Java BigQuery clients (Hadoop / Hive / Trino
/ Presto migrations) that default to Avro for schema-preserving wire
format and to ORC for the legacy Hadoop ecosystem. Before G1, a Java
integration test that loaded from an .avro URI surfaced
UnsupportedFeatureError and the test author had to hand-convert
to Parquet. After G1, the same code runs unchanged.
The constraints to satisfy:
- Boot performance + offline tolerance — engine startup must not
network-fetch a DuckDB extension every time, and must not crash
when the extension repository is unreachable (constrained
deployments, air-gapped CI). Contrast with the spatial extension
(ADR's not numbered but the engine code at
engine.py:_load_spatial) which is a hard fail because GEOGRAPHY semantics cannot work without it. - Coverage — every new branch hits the ≥90% line+branch coverage gate.
- Conformance shape — recorded against real BigQuery via the existing HTTP corpus framework (P2.f). Hand-authored baselines are explicitly forbidden by ADR 0022 §2.
- Four-language E2E — Python / Node / Go / Java suites all exercise the load + extract surfaces against a fresh container (AGENTS.md non-negotiable; Java gets the third ORC test because ORC is most common in the Java/Hadoop ecosystem).
Decisions¶
1. Avro: DuckDB's avro extension for both read and write¶
DuckDB 1.5+ ships an avro extension at
http://extensions.duckdb.org/v1.5.2/<platform>/avro.duckdb_extension.gz
that provides:
SELECT * FROM read_avro('<path>')for reads, andCOPY (<query>) TO '<path>' (FORMAT AVRO)for writes.
The extension was verified to support both read and write of
records-of-records (BigQuery STRUCT round-trips), Avro logical types
(decimal, date, timestamp-millis, timestamp-micros), and nullable
unions (["null", "<T>"]).
Both load + extract executor branches consult the same DuckDB
function family, so a single best-effort INSTALL avro; LOAD avro
at engine boot covers them. The install is best-effort: a
failure logs a warning and continues, then the at-query SELECT or
COPY either succeeds (DuckDB's runtime autoload picks up the
extension if it later becomes available) or fails with a clear
Catalog Error:... is not in the catalog, but it exists in the
avro extension envelope that the executor's
_is_missing_extension_error classifier translates to
UnsupportedFeatureError for the client.
2. ORC: Python pyorc package via Arrow bridge¶
The DuckDB community ORC extension is not packaged for darwin-arm64 in the 1.5.2 release (404 at the extension repository URL). The choices were:
- Option A: Wait for upstream packaging. Blocks the G-6 row indefinitely; v1.0 has shipped on the v1-confidence-plan timeline.
- Option B (chosen): Route ORC reads through the Apache
pyorcpackage (installed via the new optional[orc]extra). The reader insrc/bqemulator/jobs/orc_reader.pyparses an ORC file into apyarrow.Tableand the executor inserts via DuckDB'sregister(...)+INSERT INTO... SELECT * FROM <view>. - Option C: Hand-build an ORC reader. Multi-week scope; ORC's
on-disk format (stripes, footer, postscript, compression codecs)
is fully specified but rebuilding it is unjustified when
pyorcexists and is maintained.
Option B is bounded scope (~150 LoC), depends on a maintained
library, and degrades cleanly when the optional extra isn't
installed (UnsupportedFeatureError with actionable remediation in
the error message).
3. ORC writes are NOT supported (out-of-scope)¶
BigQuery itself does not support ORC as a destination extract format (BigQuery extract docs: formats list AVRO, CSV, JSON, PARQUET only). Adding ORC write would put the emulator ahead of BigQuery on a surface where parity matters — a user who extracts to ORC against the emulator and then tries to repeat the workflow on the real service would get a surprising failure.
We therefore pin ORC extract as out-of-scope in
docs/reference/out-of-scope.md
with the rationale above. The workaround for users who genuinely
need ORC output is to extract to Parquet and run a downstream
conversion via pyorc or pyarrow.
4. Configuration flag: enable_format_extensions¶
Defaults to True. When False, the engine boot skips the
INSTALL avro; LOAD avro calls entirely — useful in constrained
deployments that cannot reach extensions.duckdb.org. The ORC
path is unaffected because it uses pyorc, not a DuckDB
extension. Exposed as BQEMU_ENABLE_FORMAT_EXTENSIONS env var via
the existing Settings configuration surface.
Consequences¶
Capability matrix shifts¶
| Surface | Before G1 | After G1 |
|---|---|---|
| Load CSV | ✅ | ✅ |
| Load NEWLINE_DELIMITED_JSON | ✅ | ✅ |
| Load PARQUET | ✅ | ✅ |
| Load AVRO | ❌ | ✅ |
| Load ORC | ❌ | ✅ |
| Extract CSV | ✅ | ✅ |
| Extract NEWLINE_DELIMITED_JSON | ✅ | ✅ |
| Extract PARQUET | ✅ | ✅ |
| Extract AVRO | ❌ | ✅ |
| Extract ORC | ❌ (matches BigQuery) | ❌ (matches BigQuery) |
After G1, bqemulator becomes a strict superset of both goccy and BigQuery on the load/extract format axis: equal on the intersect-with-BigQuery set, ahead of goccy on Parquet extract + Avro/ORC load + Avro extract.
Coverage + test surface¶
- 4 new unit tests in
tests/unit/storage/test_engine_format_extensions.pyfor the engine flag + best-effort load contract. - 5 new unit tests in
tests/unit/jobs/test_executor.pyfor the_is_missing_extension_errorclassifier. - 12 new integration tests in
tests/integration/test_load_avro_orc.pycovering Avro/ORC load basic + nested + missing-file + round-trip - Avro extract + ORC reader unit tests + Settings flag.
- 8 new conformance fixtures under
tests/conformance/http_corpus/jobs/(recorded against real BigQuery; the recorder runs as an operator-side step because the fixture-recording flow needs the operator's BQ ADC + GCS access). - 9 new four-language E2E tests (2 × Python/Node/Go + 3 × Java including ORC).
Optional dependencies¶
A new [orc] extra in pyproject.toml
gates the pyorc dependency. The [all] umbrella extra picks it
up automatically; existing user installs of bqemulator[avro] keep
working without ORC support.
Avro logical-type coverage¶
The recorded conformance fixture set covers these Avro logical
types: decimal, date, timestamp-millis, timestamp-micros.
Logical types BigQuery itself doesn't surface (uuid,
local-timestamp-*, time-millis/time-micros at full precision)
are out of scope by definition — BigQuery's load path either
coerces to a supported type or rejects them, and the emulator
matches that surface.
Error envelope¶
Two new classifier branches in the executor (_is_missing_extension_error
for Avro load + Avro extract) plus a new helper module
(src/bqemulator/jobs/orc_reader.py). Both branches preserve the
existing error_mapper chain — missing-file / schema-mismatch errors
flow through the standard BigQuery-shape translator unchanged.
Alternatives considered¶
- Apache Arrow
pyarrow.avrofor Avro read/write —pyarrowlacks a stable Avro module (the experimentalpyarrow.dataset.avroreader is not officially supported and was removed in recent releases). Rejected. fastavrofor Avro write fallback — a viable fallback if DuckDB's COPY TO AVRO ever regresses, but adds a second code path with subtle Arrow-→-Avro-schema mapping logic. Tabled as a fallback to wire in only if the DuckDB path fails in CI.- A custom ORC codec — multi-day scope; ORC's documented but
intricate stripe/footer format does not justify reinvention when
pyorcexists. - Routing both formats through
pyarrow.dataset—pyarrowsurfacesread_orc(via thepyorcpackage internally for ORC, or its own C++ implementation when available). Rejected for ORC because thepyarrowORC binding is not built by default in the pyarrow wheels and would force operators to compile from source.
Related work¶
- G2 (multipart/resumable upload endpoints) — closes the load-from-local-file path the four client libraries use by default. Independent of G1.
- G3 (Storage Read Avro encoding) — extends the Storage Read API beyond Arrow. Independent.