PySpark reading from bqemulator¶
Reads data from bqemulator via the BigQuery Storage Read API (Arrow
output) into a PySpark DataFrame using the
google-cloud-bigquery-storage
client and Spark's createDataFrame from an Arrow table.
Pairs with the Spark integration guide.
Why this shape (vs. the official spark-bigquery-connector)¶
The official
spark-bigquery-connector
is a Scala/Java JAR that talks to the real BigQuery Storage Read API
over gRPC. Pointing it at an emulator host requires JVM-level
--conf overrides that vary by connector version and tend to break
on upgrades.
The robust, version-stable pattern is:
- Use the
google-cloud-bigquery-storagePython client (which respects the standard endpoint override semantics) to read rows out of BigQuery as Arrow record batches. - Hand the Arrow table to Spark via
spark.createDataFrame.
This matches the surface bqemulator exposes (see
ADR 0030) and is the
pattern used in the Storage Read API guide.
What it demonstrates¶
- Seeding a 5-row table via the standard
google-cloud-bigqueryclient against the emulator. - Reading those rows back via the Storage Read API
(
google-cloud-bigquery-storage) withDataFormat.ARROW. - Constructing a Spark
DataFramefrom the result and running an aggregate. - Asserting the round-tripped count matches.
Layout¶
Run¶
What to look for¶
- The Storage Read client is constructed with a
BigQueryReadGrpcTransportwrapping an explicitgrpc.insecure_channel(...)— bqemulator serves the Storage Read API over plaintext gRPC, and the default transport wraps every call in TLS (failing the handshake withSSL_ERROR_SSL: WRONG_VERSION_NUMBER). - Spark runs in local-master mode (
local[*]) so the example is hermetic — no Hadoop/YARN cluster required.
Storage Read API — bare record-batch IPC contract (v1.0.1)¶
The example uses the canonical
reader.to_arrow(session) helper from the
google-cloud-bigquery-storage client. Under v1.0.0 this call tripped
on a bqemulator wire-format bug (the server packed a full Arrow IPC
stream into ArrowRecordBatch.serialized_record_batch); v1.0.1
#15 /
ADR 0033
shipped the spec-conforming bare record-batch format and removed
the workaround. If you're pinning to v1.0.0 the helper still trips
the format mismatch; upgrade to v1.0.1+.