Storage API¶
Status: shipped — both Storage Read and Storage Write.
Storage Read API¶
CreateReadSession materializes your projection + filter as a
pyarrow.Table at session creation (see
ADR 0008). ReadRows streams
slices of that snapshot in the session's chosen wire format.
Choosing Arrow vs Avro¶
| Format | Default in | When to pick |
|---|---|---|
| Arrow IPC | Python, Go, Node | The fastest read path — zero-copy into pyarrow / arrow-js / arrow-go memory. Pick this unless you have a specific reason to use Avro. |
| Apache Avro | Java | Java's BigQueryReadClient.create().createReadSession(...) defaults to Avro. Lower overhead in JVM consumers that already have an Avro pipeline, and matches BigQuery's documented Avro export shape. See ADR 0030. |
The proto3 default for an unset data_format is treated as Arrow,
matching real BigQuery. Any other value (a hypothetical future
PROTO format) surfaces INVALID_ARGUMENT.
The Avro wire shape is "schema-once on the session, naked binary rows
per response chunk" — ReadSession.avro_schema.schema carries the
writer schema as JSON; each ReadRowsResponse.avro_rows.serialized_binary_rows
carries Avro's binary encoding back-to-back with no Avro Object
Container File header. Decode with fastavro.schemaless_reader (Python),
org.apache.avro.io.DatumReader<GenericRecord> (Java),
@google-cloud/bigquery-storage's built-in Avro decoder (Node), or
github.com/linkedin/goavro (Go).
Storage Write API¶
Four stream types, all fully supported in v1 — see ADR 0013 for the design.
| Stream type | Visibility | Commit required | Offset dedup |
|---|---|---|---|
DEFAULT |
Immediate (at-least-once) | No | No |
COMMITTED |
Immediate (exactly-once) | No | Yes |
PENDING |
On BatchCommitWriteStreams |
Yes | Yes |
BUFFERED |
On FlushRows |
Flush | Yes |
Both Arrow and dynamic-protobuf row formats are accepted.
Quickstart (Python)¶
from google.cloud import bigquery_storage_v1
from google.cloud.bigquery_storage_v1 import types, writer
from google.api_core.client_options import ClientOptions
from google.auth.credentials import AnonymousCredentials
import grpc
# Point the Write client at the emulator's gRPC endpoint.
channel = grpc.insecure_channel("localhost:9060")
write_client = bigquery_storage_v1.BigQueryWriteClient(
transport=bigquery_storage_v1.services.big_query_write.transports
.BigQueryWriteGrpcTransport(channel=channel),
)
# Create a COMMITTED stream.
stream = write_client.create_write_stream(
parent="projects/my-project/datasets/sales/tables/orders",
write_stream=types.WriteStream(type_=types.WriteStream.Type.COMMITTED),
)
# AppendRows is a bidi stream — use the high-level AppendRowsStream helper or
# drive it manually via `channel.stream_stream(...)`.
Choosing a stream type¶
- Development loops / ad-hoc scripts — use
DEFAULT; no setup, no teardown. Rows are visible the moment AppendRows acknowledges. - Real production parity / exactly-once ingestion — use
COMMITTED. Keep a per-producer offset counter; retry with the same offset after a transient error. - Batch loads with an atomic swap — use
PENDING. Write N streams in parallel, thenFinalizeWriteStreameach, then commit them all in a singleBatchCommitWriteStreamscall. Either everything lands or nothing does. - Incremental flush — use
BUFFERED. Stage rows, thenFlushRows(offset)when you want them visible. Useful when upstream data arrives in micro-batches and you want to hold back rows until they're certified clean.
Error surface¶
| Client action | Signal |
|---|---|
| Duplicate offset on COMMITTED/PENDING/BUFFERED | AppendRowsResponse.error.code = ALREADY_EXISTS |
| Gap offset (offset > next_offset) | AppendRowsResponse.error.code = OUT_OF_RANGE |
| Append after Finalize | AppendRowsResponse.error.code = FAILED_PRECONDITION |
| Invalid stream name / missing table | gRPC INVALID_ARGUMENT / NOT_FOUND |
| Commit non-finalized PENDING stream | stream_errors[...].code = INVALID_STREAM_STATE |
Limitations¶
- Stream state is in-memory; a process restart forgets every stream. This is intentional — see ADR 0013.
- Schema evolution (AppendRows updating
updated_schema) is not yet emulated; the emulator always reports the target table schema.