Skip to content

PySpark reading from bqemulator

Reads data from bqemulator via the BigQuery Storage Read API (Arrow output) into a PySpark DataFrame using the google-cloud-bigquery-storage client and Spark's createDataFrame from an Arrow table.

Pairs with the Spark integration guide.

Why this shape (vs. the official spark-bigquery-connector)

The official spark-bigquery-connector is a Scala/Java JAR that talks to the real BigQuery Storage Read API over gRPC. Pointing it at an emulator host requires JVM-level --conf overrides that vary by connector version and tend to break on upgrades.

The robust, version-stable pattern is:

  1. Use the google-cloud-bigquery-storage Python client (which respects the standard endpoint override semantics) to read rows out of BigQuery as Arrow record batches.
  2. Hand the Arrow table to Spark via spark.createDataFrame.

This matches the surface bqemulator exposes (see ADR 0030) and is the pattern used in the Storage Read API guide.

What it demonstrates

  • Seeding a 5-row table via the standard google-cloud-bigquery client against the emulator.
  • Reading those rows back via the Storage Read API (google-cloud-bigquery-storage) with DataFormat.ARROW.
  • Constructing a Spark DataFrame from the result and running an aggregate.
  • Asserting the round-tripped count matches.

Layout

example.py — full demo: seed, Storage Read, Spark aggregate

Run

make test

What to look for

  • The Storage Read client is constructed with a BigQueryReadGrpcTransport wrapping an explicit grpc.insecure_channel(...) — bqemulator serves the Storage Read API over plaintext gRPC, and the default transport wraps every call in TLS (failing the handshake with SSL_ERROR_SSL: WRONG_VERSION_NUMBER).
  • Spark runs in local-master mode (local[*]) so the example is hermetic — no Hadoop/YARN cluster required.

Storage Read API — bare record-batch IPC contract (v1.0.1)

The example uses the canonical reader.to_arrow(session) helper from the google-cloud-bigquery-storage client. Under v1.0.0 this call tripped on a bqemulator wire-format bug (the server packed a full Arrow IPC stream into ArrowRecordBatch.serialized_record_batch); v1.0.1 #15 / ADR 0033 shipped the spec-conforming bare record-batch format and removed the workaround. If you're pinning to v1.0.0 the helper still trips the format mismatch; upgrade to v1.0.1+.