Loading from local files (upload host)¶
The emulator ships multipart and resumable upload host endpoints. See ADR 0029.
This guide covers the standard load_table_from_file() idiom across
the four official client libraries. The emulator's
/upload/bigquery/v2/... routes implement the same multipart /
resumable upload protocols that real BigQuery uses, so client code
runs unchanged.
Quick reference¶
| Client | API | Default protocol | Upload host code path |
|---|---|---|---|
| Python | Client.load_table_from_file(BytesIO, …) |
Auto (multipart < 5 MiB, resumable otherwise) | ✅ |
| Node | Table.load(stream, …) |
Auto (multipart < 5 MiB, resumable otherwise) | ✅ |
| Go | Loader.From(reader).Run(ctx) |
Resumable | ✅ |
| Java | BigQuery.writer(WriteChannelConfiguration) |
Resumable | ✅ |
All four route through /upload/bigquery/v2/projects/{p}/jobs rather
than the data-plane /bigquery/v2/projects/{p}/jobs endpoint.
Python¶
import io
from google.cloud import bigquery
client = bigquery.Client(project="my-project")
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.CSV,
skip_leading_rows=1,
schema=[
bigquery.SchemaField("id", "INTEGER"),
bigquery.SchemaField("name", "STRING"),
],
write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,
)
csv_bytes = b"id,name\n1,alice\n2,bob\n3,carol\n"
job = client.load_table_from_file(
io.BytesIO(csv_bytes),
"my-project.sales.customers",
job_config=job_config,
)
job.result() # waits for the load to complete
The Python client picks multipart for payloads under ~5 MiB and
resumable for larger ones. The emulator handles both shapes
identically — the same LoadJobConfig flags apply.
Node.js¶
const { BigQuery } = require("@google-cloud/bigquery");
const { Readable } = require("node:stream");
const bq = new BigQuery({ projectId: "my-project" });
const stream = Readable.from(Buffer.from("id,name\n1,alice\n2,bob\n"));
await bq.dataset("sales").table("customers").load(stream, {
sourceFormat: "CSV",
skipLeadingRows: 1,
writeDisposition: "WRITE_TRUNCATE",
schema: { fields: [
{ name: "id", type: "INTEGER" },
{ name: "name", type: "STRING" },
] },
});
Go¶
import (
"bytes"
"cloud.google.com/go/bigquery"
)
rs := bigquery.NewReaderSource(bytes.NewReader(csvBytes))
rs.SourceFormat = bigquery.CSV
rs.SkipLeadingRows = 1
rs.Schema = bigquery.Schema{
{Name: "id", Type: bigquery.IntegerFieldType},
{Name: "name", Type: bigquery.StringFieldType},
}
loader := client.Dataset("sales").Table("customers").LoaderFrom(rs)
loader.WriteDisposition = bigquery.WriteTruncate
job, err := loader.Run(ctx)
status, err := job.Wait(ctx)
Java¶
WriteChannelConfiguration cfg = WriteChannelConfiguration
.newBuilder(TableId.of("my-project", "sales", "customers"))
.setFormatOptions(FormatOptions.csv())
.setSchema(schema)
.setSkipLeadingRows(1L)
.setWriteDisposition(JobInfo.WriteDisposition.WRITE_TRUNCATE)
.build();
try (TableDataWriteChannel channel = client.writer(cfg)) {
byte[] csv = "id,name\n1,alice\n2,bob\n".getBytes(StandardCharsets.UTF_8);
channel.write(ByteBuffer.wrap(csv));
}
Supported formats¶
sourceFormat |
Multipart media Content-Type | Notes |
|---|---|---|
CSV |
text/csv |
skipLeadingRows, fieldDelimiter, quote all honored. |
NEWLINE_DELIMITED_JSON |
application/json |
autodetect flag honored. |
PARQUET |
application/x-parquet or application/octet-stream |
Schema inferred from file. |
AVRO |
application/avro or application/octet-stream |
Requires DuckDB's avro extension (G1, ADR 0027). |
ORC |
application/x-orc or application/octet-stream |
Requires pip install bqemulator[orc] (G1, ADR 0027). |
Operator configuration¶
| Setting | Default | Reason |
|---|---|---|
BQEMU_UPLOAD_MAX_BYTES |
1 GiB | Total bytes per upload. Cap is hard — uploads larger than this are rejected with HTTP 400 (invalidQuery) before disk write. |
BQEMU_UPLOAD_SESSION_TTL_SECONDS |
3600 | How long an idle resumable session is retained before eviction. |
BQEMU_UPLOAD_STAGING_DIR |
(system tempdir) | Where staging temp files live. Set this to a known disk in CI to keep tempdir hygiene predictable. |
Resumable protocol details¶
The resumable protocol is exposed as two phases that the client libraries already implement:
- Initiate —
POST /upload/bigquery/v2/projects/{p}/jobs?uploadType=resumablewith theJobresource as the JSON body. Response:200 OKwithLocation: …?upload_id={session}andX-GUploader-UploadID: {session}headers; empty body. - Chunk upload —
PUT /upload/bigquery/v2/projects/{p}/jobs?upload_id={session}with the file bytes as the body andContent-Range: bytes {start}-{end}/{total}declaring the chunk's byte range. Each non-final chunk returns308 Resume IncompletewithRange: bytes=0-{last_received}. The final chunk returns200 OKwith theJobresource.
A client that loses track of the offset can probe the session with
PUT … Content-Range: bytes */{total} (no body); the server replies
308 with the Range header reflecting the current offset.
Known limitations¶
- Session state is in-memory. A pod restart drops every in-progress upload; clients must restart from offset 0. See out-of-scope.md#durable-upload-session-state.
uploadType=mediais rejected. BigQuery itself rejectsmediaforjobs.insert; the emulator mirrors the rejection. Usemultipartorresumableinstead.- Multipart envelope is parsed via the stdlib
emailpackage. The boundary syntax follows RFC 2387 (multipart/related). Other multipart variants (multipart/form-data, etc.) are rejected.
Runnable example¶
A complete runnable example lives at
docs/examples/local-file-load — a
single-file Python script that starts the emulator, runs the
multipart upload, queries the rows back, and asserts. The example
is executed in CI by the docs build to prevent doc rot.