Storage API¶

Status: shipped — both Storage Read and Storage Write.

Storage Read API¶

CreateReadSession materializes your projection + filter as a pyarrow.Table at session creation (see ADR 0008). ReadRows streams slices of that snapshot in the session's chosen wire format.

Choosing Arrow vs Avro¶

Format	Default in	When to pick
Arrow IPC	Python, Go, Node	The fastest read path — zero-copy into pyarrow / arrow-js / arrow-go memory. Pick this unless you have a specific reason to use Avro.
Apache Avro	Java	Java's `BigQueryReadClient.create().createReadSession(...)` defaults to Avro. Lower overhead in JVM consumers that already have an Avro pipeline, and matches BigQuery's documented Avro export shape. See ADR 0030.

The proto3 default for an unset data_format is treated as Arrow, matching real BigQuery. Any other value (a hypothetical future PROTO format) surfaces INVALID_ARGUMENT.

The Avro wire shape is "schema-once on the session, naked binary rows per response chunk" — ReadSession.avro_schema.schema carries the writer schema as JSON; each ReadRowsResponse.avro_rows.serialized_binary_rows carries Avro's binary encoding back-to-back with no Avro Object Container File header. Decode with fastavro.schemaless_reader (Python), org.apache.avro.io.DatumReader<GenericRecord> (Java), @google-cloud/bigquery-storage's built-in Avro decoder (Node), or github.com/linkedin/goavro (Go).

Storage Write API¶

Four stream types, all fully supported in v1 — see ADR 0013 for the design.

Stream type	Visibility	Commit required	Offset dedup
`DEFAULT`	Immediate (at-least-once)	No	No
`COMMITTED`	Immediate (exactly-once)	No	Yes
`PENDING`	On `BatchCommitWriteStreams`	Yes	Yes
`BUFFERED`	On `FlushRows`	Flush	Yes

Both Arrow and dynamic-protobuf row formats are accepted.

Quickstart (Python)¶

from google.cloud import bigquery_storage_v1
from google.cloud.bigquery_storage_v1 import types, writer
from google.api_core.client_options import ClientOptions
from google.auth.credentials import AnonymousCredentials
import grpc

# Point the Write client at the emulator's gRPC endpoint.
channel = grpc.insecure_channel("localhost:9060")
write_client = bigquery_storage_v1.BigQueryWriteClient(
    transport=bigquery_storage_v1.services.big_query_write.transports
        .BigQueryWriteGrpcTransport(channel=channel),
)

# Create a COMMITTED stream.
stream = write_client.create_write_stream(
    parent="projects/my-project/datasets/sales/tables/orders",
    write_stream=types.WriteStream(type_=types.WriteStream.Type.COMMITTED),
)

# AppendRows is a bidi stream — use the high-level AppendRowsStream helper or
# drive it manually via `channel.stream_stream(...)`.

Choosing a stream type¶

Development loops / ad-hoc scripts — use DEFAULT; no setup, no teardown. Rows are visible the moment AppendRows acknowledges.
Real production parity / exactly-once ingestion — use COMMITTED. Keep a per-producer offset counter; retry with the same offset after a transient error.
Batch loads with an atomic swap — use PENDING. Write N streams in parallel, then FinalizeWriteStream each, then commit them all in a single BatchCommitWriteStreams call. Either everything lands or nothing does.
Incremental flush — use BUFFERED. Stage rows, then FlushRows(offset) when you want them visible. Useful when upstream data arrives in micro-batches and you want to hold back rows until they're certified clean.

Error surface¶

Client action	Signal
Duplicate offset on COMMITTED/PENDING/BUFFERED	`AppendRowsResponse.error.code = ALREADY_EXISTS`
Gap offset (offset > next_offset)	`AppendRowsResponse.error.code = OUT_OF_RANGE`
Append after Finalize	`AppendRowsResponse.error.code = FAILED_PRECONDITION`
Invalid stream name / missing table	gRPC `INVALID_ARGUMENT` / `NOT_FOUND`
Commit non-finalized PENDING stream	`stream_errors[...].code = INVALID_STREAM_STATE`

Limitations¶

Stream state is in-memory; a process restart forgets every stream. This is intentional — see ADR 0013.
Schema evolution (AppendRows updating updated_schema) is not yet emulated; the emulator always reports the target table schema.