Storage Read API¶
Implementation in src/bqemulator/grpc_api/read_servicer.py +
src/bqemulator/streaming/read_session.py +
src/bqemulator/streaming/avro_serializer.py.
Session lifecycle¶
- Client calls
CreateReadSession(table, selected_fields, row_filter, max_streams, data_format). - Servicer dispatches on
data_format: ARROW(Python / Go / Node default; also the proto3 default for an unset field) — session emits Arrow IPC.AVRO(Java client default) — session emits naked Avro binary rows (ADR 0030).- Any other value →
INVALID_ARGUMENT. - Servicer builds a projection+filter query and executes it against
DuckDB, materializing a
pyarrow.Table. - Servicer splits the table into
N = min(max_streams, 10)row ranges (with a small-table cap of 1 stream below ~1 MB) and records them asReadStreamreferences on aReadSessionState. The session state carries the chosendata_formatand (for AVRO sessions) a pre-computed Avro schema JSON so subsequentReadRowscalls andSplitReadStreamchildren all emit the same format without re-deriving it. - Each stream has a name of the form
projects/{p}/locations/{loc}/sessions/{sid}/streams/{n}.
Read path¶
Client calls ReadRows(stream_name) (server-streaming RPC). The servicer
looks up the stream state, slices the materialized Arrow table for the
stream's row range, and streams chunks in the session's chosen format:
- Arrow IPC: via
pyarrow.ipc.RecordBatchStreamWriterinto aBytesIObuffer; each chunk carries anArrowRecordBatchand the first chunk additionally carries thearrow_schema. - Avro: via
fastavro.schemaless_writer(ADR 0030); each chunk carries anAvroRows.serialized_binary_rowspayload and the first chunk additionally carries theavro_schema. The bytes are naked — NO Object Container File header per chunk — per BigQuery's documented wire contract.
Consistency¶
Per ADR 0008, the materialization IS the snapshot. Writes after session creation do not affect in-flight sessions.
SplitReadStream¶
The API supports subdividing a stream further. The servicer re-splits the
stream's row range into two halves and returns the new stream names.
Both child streams inherit the parent session's data_format (a child
of an Avro session continues serving Avro).