ADR 0029: Upload-host endpoints (G2 — multipart + resumable)¶
- Status: Accepted
Context¶
The v1.0 competitor-parity workstream G2 closes the single row of
the goccy bigquery-emulator feature
matrix
where bqemulator silently lacked the canonical local-file load path:
| Gap | Before | After |
|---|---|---|
| G-4 Load from a local file (multipart / resumable upload) | ❌ no /upload/bigquery/v2/... routes existed |
✅ |
The four official BigQuery client libraries (Python, Node, Go, Java)
all route client.load_table_from_file(io.BytesIO(...)) style calls
through the upload host — a separate URL prefix
(https://bigquery.googleapis.com/upload/bigquery/v2/...) distinct
from the data-plane prefix
(https://bigquery.googleapis.com/bigquery/v2/...). Without those
routes, any caller using load_table_from_file got a 404 Not Found
back from the emulator's FastAPI app and had to manually convert the
load into a gs:// URI flow (Phase 3) or tabledata.insertAll flow
(Phase 2) — neither of which exercises the upload-protocol code path
that real production code uses.
The constraints to satisfy:
- Wire-format parity — three documented upload protocols
(media / multipart / resumable) with the exact response shapes
BigQuery emits:
Location+X-GUploader-UploadIDon resumable initiation;308 Resume Incomplete+Range: bytes=0-Non partial chunks; final200with theJobresource on the last chunk. - Security — the new REST surface accepts arbitrary file bytes;
that's exactly the threat surface AGENTS.md's security-review
non-negotiable was written for. Specifically, path traversal,
size-cap overrun, multipart envelope injection, and
Content-Lengthspoofing. - Coverage — every new branch hits the ≥90% line+branch gate.
- Conformance shape — 12 HTTP corpus fixtures recorded against
real BigQuery (or hand-authored against the emulator with a TODO
to re-record when operator credentials are available). The
recorder framework is extended to support binary request bodies
via a sibling
request.body.binfile. - Four-language E2E — Python / Node / Go / Java suites each get two upload-protocol tests against a fresh container.
- Cleanup — temp files must not leak even when the executor raises mid-load.
Decisions¶
1. New router at /upload/bigquery/v2¶
src/bqemulator/api/routes/upload.py
hosts the four endpoints:
| Method | Path | Purpose |
|---|---|---|
POST |
/projects/{p}/jobs?uploadType=media |
Reserved — rejected with 400 (mirrors real BQ; load jobs accept only multipart and resumable). |
POST |
/projects/{p}/jobs?uploadType=multipart |
Single-shot multipart/related upload. |
POST |
/projects/{p}/jobs?uploadType=resumable |
Initiate a resumable session; returns 200 + Location + X-GUploader-UploadID. |
PUT |
/projects/{p}/jobs?upload_id=<session> |
Append a chunk or query session status; returns 308 partial or 200 final. |
Mounted alongside the existing data-plane jobs_router in
api/app.py. The
GzipRequestMiddleware chain leaves multipart and binary bodies
untouched (the gzip middleware only inflates Content-Encoding:
gzip bodies — verified against the upload tests).
2. Multipart parsing via the stdlib email package¶
The multipart/related MIME shape is structurally identical to the
multipart framing the stdlib email.parser.BytesParser already
handles. We do not take a dependency on python-multipart
because that package is for multipart/form-data (named-field,
positional metadata in Content-Disposition) — a completely
different wire format from multipart/related (positional parts,
type-distinguished). Mixing them would either misparse the BQ
client's upload envelopes or require a per-format dispatch.
The parser walks the two parts in declaration order. The first part
must declare a JSON content type and parses as the Job resource
(specifically configuration.load). The second part carries the
media bytes; its declared Content-Type is validated against a
whitelist (application/octet-stream, text/csv,
application/json, application/avro, application/x-parquet,
application/x-orc, plus a few common variants) so the server
can't be coerced into materialising an arbitrary MIME envelope
inside the staging directory.
3. In-memory resumable session manager¶
src/bqemulator/jobs/upload_session_manager.py
holds the active sessions in a thread-safe dict. Each session owns
a per-session temp file under Settings.upload_staging_dir (or the
system tempdir if unset). The manager enforces three security
boundaries:
upload_idshape —^[A-Za-z0-9_-]{8,64}$. The value flows into a filesystem path so a strict character set closes the path-traversal blast radius. Anything outside the pattern returns 404 (not 400 — leaking "you matched the wrong pattern" gives the attacker information). UUID hex output already conforms.upload_max_bytescap — checked on every append by computing the prospective new offset; an over-cap chunk raises before any bytes touch disk. The declaredContent-Rangetotal is also validated.- TTL eviction — sessions older than
Settings.upload_session_ttl_seconds(default 1 hour) are evicted lazily on the next call that touches the manager. Eviction unlinks the staging file, preventing orphaned bytes from accumulating on a CI runner.
Session state is process-local. A restart drops every in-progress
upload — see the new "Durable upload session state" section in
out-of-scope.md, mirroring the
existing "Durable Storage Write API stream state" exclusion (ADR
0013).
4. Temp-file materialisation, then call execute_load_job¶
Once the multipart body is decoded (single-shot) or the final
resumable chunk arrives, the handler synthesises a file://
sourceUris entry pointing at the staging path and invokes
execute_load_job with the same configuration the client supplied.
The executor's existing _resolve_uri handles file:// URIs
uniformly — no new code paths in the load executor itself; the
upload host is purely a translation layer.
Cleanup runs in a finally arm regardless of load outcome
(success, executor exception, request abort). An integration test
drives a deliberate schema-mismatch failure and asserts the staging
directory is empty afterward (pinned at
test_temp_file_cleaned_up_on_load_failure).
5. Settings additions¶
| Setting | Default | Reason |
|---|---|---|
upload_max_bytes |
1 GiB | BigQuery's production cap is 5 TiB; the emulator default keeps CI runs bounded. |
upload_session_ttl_seconds |
3600 (1 hour) | Long enough that a slow uploader on a flaky network finishes; short enough that a leaked session doesn't accumulate disk forever. |
upload_staging_dir |
None (system tempdir) |
Operators with persistent data dirs can pin sessions under that dir; the default plays well with ephemeral containers. |
Consequences¶
Positive¶
- The Python
load_table_from_file(io.BytesIO(...))idiom works end-to-end against the emulator. Same for the Nodetable.load(stream), GoLoader.From(reader), and JavaBigQuery.writer(...)APIs. - The four-language E2E suite now exercises the standard load path
rather than the synthetic
gs://flow — future load-protocol regressions in the upstream client libraries surface in CI rather than at user runtime. - The single goccy gap left on the load axis (G-4) is closed; the remaining 12 gaps are independent of the load surface.
Negative¶
- New attack surface: the upload host accepts arbitrary file bytes.
Mitigated by (a) the
upload_max_bytessize cap enforced before disk write, (b) theupload_idcharacter-set validator, © the multipart media-type whitelist, (d) staging directory ownership and permissions inherited from the OS tempdir./security-reviewran against this PR (see commit log) — no unresolved findings. - Upload session state is in-memory only. A pod restart mid-upload
forces the client to restart the upload from offset 0. Operators
running long-running CI emulator instances may want to pin
upload_session_ttl_secondshigher; the cap is 24 hours.
Neutral¶
- No new third-party dependencies. Multipart parsing uses the
stdlib
emailpackage; in-memory state usesthreading.Lock(already in the project forWriteStreamManagerlifecycle).
Alternatives considered¶
- Proxy upload requests to a real GCS-compatible emulator
(e.g., fake-gcs-server). Rejected: adds an external process to
the container, breaks the offline-first charter, and provides no
additional fidelity since the load executor already speaks
file://URIs natively. - Decline upload support entirely and document the workaround. Rejected: every user who upgrades the emulator hits the same 404, then has to rewrite their load code. The cost of supporting the protocol is a single router file plus a session manager; the cost of forcing every user to rewrite is paid forever.
- Persist upload session state to disk. Rejected per out-of-scope.md's ephemeral-by- default precedent. The Storage Write API exclusion (ADR 0013) sets the same expectation for the gRPC streaming surface; the upload host's in-memory semantics are consistent.
- Hand-roll a multipart/related parser. Rejected: the stdlib
email.parser.BytesParserhandles the RFC 2387 framing correctly (including nested boundaries, content-transfer-encoding, and the terminating--<boundary>--line). Re-implementing it is wasteful and error-prone.