ADR 0030: Storage Read API — Avro output format (G3)¶

Status: Accepted

Context¶

The v1.0 competitor-parity workstream G3 closes the single row of the goccy bigquery-emulator feature matrix where bqemulator's Storage Read API supported only Arrow IPC while goccy supported both Arrow and Apache Avro:

Gap	Before	After
G-8 Storage Read API — Avro output	❌ servicer hard-coded `data_format=types.DataFormat.ARROW` at `read_servicer.py:283`	✅

The motivating use case: the Java BigQuery Storage Read client defaults to Avro, not Arrow. A Java consumer that uses BigQueryReadClient.create().createReadSession(...) without explicitly setting DataFormat.ARROW requests Avro and (before G3) got a gRPC error from the emulator. Python / Go / Node default to Arrow so the gap was invisible there; in Java, it was blocking.

The wire-format contract the Storage Read API obeys (per BigQuery's Storage Read API docs):

ReadSession.avro_schema.schema carries the writer schema as a JSON string (Avro schema is itself JSON).
ReadRowsResponse.avro_rows.serialized_binary_rows carries the per-chunk row bytes encoded via Avro's binary encoding, back-to-back. NO Avro Object Container File (OCF) header — no Obj\x01 magic, no embedded schema, no sync marker per chunk. The schema is sent once on the session; each row block is naked.

The constraints to satisfy:

No drift on the Arrow path. Phase 4 + P3.d shipped the Arrow surface; the Java default has been silently broken for everyone else. The Arrow path must stay byte-identical to its post-P3.d shape.
Real Avro decoders must accept the bytes. The defining failure mode this workstream guards against is "emit proto-valid bytes that no Avro decoder can actually parse." Every layer (unit, integration, conformance, E2E) asserts at the decoded-row level via a real Avro implementation.
Cross-implementation interop. A single Python implementation (fastavro) can validate itself in a circle; we also need a second, independent Avro implementation to decode the bytes. The Apache Avro implementation (avro-tools jar; Java's DataFileReader<GenericRecord>) serves that role.
Coverage — every new branch ≥90% line + branch.
Conformance shape — recorded via the existing P3.d gRPC corpus framework, with the new three-layer comparator (proto envelope + Avro schema parse-equality + decoded-row equality).

Decisions¶

1. fastavro as the encoder¶

fastavro (PyPI fastavro>=1.9) is already a soft dependency under the [avro] optional extra for G1's Avro load/extract path. It provides fastavro.schemaless_writer — a per-row, schema-already- known encoder that matches the Storage Read "naked binary rows" wire shape exactly.

Alternatives considered:

Apache python-avro — slower (pure-Python encoder), more full-featured than we need. fastavro is the de-facto Python Avro library and ~10× faster per row.
Custom Avro encoder — Avro's binary encoding is fully documented and not enormous, but writing and maintaining an encoder duplicates well-tested code with no upside.

2. Promote fastavro to runtime (not optional)¶

The Java client default is Avro; ANY deployment serving Java consumers needs Avro support. Promoting fastavro from [avro] extra to runtime means a one-time install footprint increase (~1 MB wheel) in exchange for the canonical Java BQ Storage Read code path working out-of-the-box.

3. BigQuery → Avro type mapping per Google's documented contract¶

The schema converter (arrow_schema_to_avro_json) implements Google's documented BigQuery → Avro export mapping:

BigQuery	Avro
INT64	`long`
FLOAT64	`double`
NUMERIC	`bytes` + `logicalType=decimal`, precision=38, scale=9
BIGNUMERIC	`bytes` + `logicalType=decimal`, precision=76, scale=38
STRING	`string`
BYTES	`bytes`
BOOL	`boolean`
DATE	`int` + `logicalType=date`
TIME	`long` + `logicalType=time-micros`
DATETIME	`string` (BigQuery-special — no native Avro logical type)
TIMESTAMP	`long` + `logicalType=timestamp-micros`
GEOGRAPHY	`string` (WKT encoding, per BQ docs)
JSON	`string`
RANGE\<T>	`record` with `start`/`end` fields, recursive on T
INTERVAL	`string` (canonical Y-M D H:M:S form)
ARRAY\<T>	`array` of T
STRUCT	`record`
nullable T	`["null", <T>]` union with `"null"` first

Cited authoritatively in the converter's module docstring; tested exhaustively in tests/unit/streaming/test_avro_serializer.py.

4. Servicer dispatch on the request's `data_format`¶

The read servicer dispatches at session-creation time:

raw_format = read_session._pb.data_format  # bypass proto-plus enum warn
if raw_format in (UNSPECIFIED, ARROW):
    session_format = FORMAT_ARROW
    wire_format    = DataFormat.ARROW
elif raw_format == AVRO:
    session_format = FORMAT_AVRO
    wire_format    = DataFormat.AVRO
else:
    return INVALID_ARGUMENT("Unsupported data_format: …")

The chosen format lives on ReadSessionState.data_format (with avro_schema_json pre-computed once at session creation) so every subsequent ReadRows call and every SplitReadStream child serves the same format without re-deriving it from the request. The state is format-agnostic at the row layer (the Arrow table IS the snapshot); the format-specific bytes are computed on the fly per chunk.

Reading _pb.data_format (the raw protobuf int) rather than the proto-plus enum property side-steps proto-plus's UserWarning on unknown enum values, which the test runner's filterwarnings = ["error"] would otherwise convert into a Unexpected [UserWarning] 500 error for any hand-crafted client request that sends an out-of-range data_format byte.

5. Three-layer conformance comparison¶

The G3 fixtures land in the existing P3.d grpc_corpus/ directory. The comparator gains two new helpers in _grpc_comparison.py:

Layer	What it checks	How
Proto envelope	`ReadSession` + `ReadRowsResponse` proto fields match the recorded structural subset	existing P3.d `compare_grpc_calls`
Avro schema	The emulator's `avro_schema.schema` parses as valid Avro JSON AND is canonically equal to the recorded schema	new `compare_avro_schema` (uses `fastavro.parse_schema` for normalisation)
Avro rows	`fastavro.schemaless_reader` decodes the emulator's bytes AND yields a row list equal to the recorded `decoded_rows`	new `decode_and_compare_avro_rows` (with FLOAT64 / Decimal tolerance per ADR 0022 §3)

Bytes are NOT compared (encoder freedom in varint padding / union ordering). Decoded values ARE compared.

6. Cross-implementation Avro interop¶

The conformance comparator uses fastavro; that's a single implementation cross-checking itself. To catch a class of drift where the emulator emits bytes only fastavro accepts, we exercise a second, independent Apache Avro implementation in two places:

Integration: test_emulator_avro_bytes_decode_with_apache_avro_tools uses the canonical Apache Avro avro-tools jar (getschema + tojson) against an emulator-emitted OCF.
E2E (Java): the Java suite's round-trip-to-.avro test uses the official Apache Avro DataFileReader<GenericRecord> (the canonical Java implementation, NOT fastavro) to read back what the emulator emitted. This is the load-bearing cross-impl proof.

The two-implementation rule MUST hold: if the emulator drifts away from the documented wire format, at least one of these two independent decoders will fail before any user does.

7. Reference files under `tests/fixtures/avro/`¶

Six hand-authored Avro OCF files (one per conformance fixture) live under tests/fixtures/avro/. They are NOT the wire-format bytes the Storage Read API emits — they are standard OCFs with embedded schemas + sync markers, used as the source-of-truth contract for the schema converter. They are regenerated via make generate-avro-fixtures from scripts/generate_avro_fixtures.py, which drives the emulator's own schema converter so the files stay in lock-step with any future schema-mapping change.

Consequences¶

Positive¶

Java consumer default path works. The canonical BigQueryReadClient.create().createReadSession(...) Java code runs unchanged.
goccy parity row G-8 closed. Twelve gaps remain (G2–G4 etc.; see v1-confidence-plan).
Real Avro file round-trip provably works end-to-end via the three integration tests + four-language E2E + cross-impl avro-tools/DataFileReader checks.

Negative¶

fastavro becomes a runtime dependency. Adds ~1 MB to the install footprint; pip-audit must clean it on every release.
Avro test surface expands. Six new conformance fixtures, six reference OCFs, three load-bearing integration tests, four-language E2E (Java gets the most attention). Maintenance is bounded by the Avro spec being fully documented and stable.

Neutral¶

No changes to the Arrow path beyond what's needed to wire the format-branch dispatch. Existing Arrow conformance + integration + E2E coverage continues to assert byte-identical behaviour.

Alternatives considered¶

Skip Java's default. Document "set DataFormat.ARROW explicitly in Java" as a known limitation. Rejected: the whole point of an emulator is drop-in compatibility; adding "set this flag" instructions defeats the purpose.
Keep fastavro as an optional extra. Rejected: the canonical Java BQ Storage code path silently breaks unless the install includes the extra. Either fastavro is in the runtime tree or Avro is documented as not-supported; the half-measure leaves every deployment one mistake away from a broken Java consumer.
Use Apache python-avro as the encoder. Rejected: slower pure-Python encoder; fastavro is the de-facto Python Avro library with broader install footprint and faster per-row encoding.
Hand-write the Avro encoder. Rejected: maintenance cost without upside; Avro's wire format is fully documented but the encoder is enough code to be worth re-using a maintained dependency.

References¶

BigQuery Storage Read API — Avro format — the wire-format contract.
BigQuery → Avro export mapping — the BQ-to-Avro type table.
ADR 0022 — recorded-baseline design + FLOAT64 / numeric tolerance rules.
ADR 0027 — sibling G1 workstream closing Avro/ORC on the load/extract axis.
ADR 0008 — Storage Read API session-time materialisation snapshot (still authoritative).