Testing strategy¶

bqemulator enforces a 7-tier testing pyramid. Coverage threshold is ≥90% line and ≥90% branch, enforced by CI.

Tier	What	Per-PR?	Other invocation
1 Unit	Pure domain, no I/O	✅ via `ci.yml` (Python 3.11/3.12/3.13 × ubuntu/macos/windows)	`make test-unit`
2 Property	Hypothesis invariants	✅ via `ci.yml`	`make test-property`
3 Integration	Emulator in-process + Python client	✅ via `ci.yml`	`make test-integration`
4 E2E	Live container + five conformance clients (Python / Node / Go / Java / `bq` CLI)	✅ via `e2e.yml`	`make test-e2e`
5 Conformance	Replay baselines recorded against real BigQuery	✅ via `ci.yml` (replay-only; recording is a deliberate operator action)	`make test-conformance`; `make record-conformance`
6 Performance	pytest-benchmark; per-arch baseline	manual-only — `perf.yml` `workflow_dispatch`	`make test-perf`
7 Chaos	Fault injection + resource exhaustion + crash recovery	manual-only — `chaos.yml` `workflow_dispatch`	`make test-chaos`

Three ancillary tiers run alongside the pyramid as comparison gates (each ships as a manual-only workflow_dispatch workflow per the deferred-cadence CI policy — the per-PR vs nightly vs release-gate decision is deferred until post-repo-setup when there is real CI traffic to measure runtime / flakiness / runner-cost trade-offs against):

Sibling tier	Workflow	Local target	ADR
Mutation testing	`mutation.yml`	`make test-mutation`	ADR 0026
Differential (row-order perturbation of corpus)	`differential.yml`	`make test-differential`	ADR 0028
Fuzz (Atheris coverage-guided)	`fuzz.yml`	`make test-fuzz`	ADR 0031

Tier 1 — Unit (`tests/unit/`)¶

Pure domain, no I/O. Target <10s total runtime. Covers:

Type mapping, Arrow bridge
SQL translation rules (each rule gets its own test)
Catalog repository contracts (against the in-memory implementation)
Job state machine
Domain error → ErrorProto rendering

Tier 2 — Property (`tests/property/`)¶

Hypothesis-driven. Invariants rather than examples:

SQL translation never crashes — always returns Ok(sql) or Err(error).
Type round-trips preserve semantics (BQ → Arrow → DuckDB → Arrow → BQ).
Arrow bridge handles all supported types for arbitrary values.
Scripting interpreter preserves lexical scope under arbitrary nesting.

Tier 3 — Integration (`tests/integration/`)¶

Emulator in-process (pytest fixture) + official Python client:

REST CRUD workflows
Storage Read / Write API flows
Error-shape compatibility with the Python client's parsing

Tier 4 — E2E against live containers (`tests/e2e/`)¶

The user-mandated bar. Testcontainers spins a built-from-source image; each of the five conformance clients (Python, Node, Go, Java, bq CLI) runs the full scenario set against it.

Scenario coverage enumerated in the architecture overview and the CHANGELOG.

Tier 5 — Conformance (`tests/conformance/`)¶

Replays recorded baselines against the emulator with row-for-row, type-aware tolerance. The corpus ships 1276 active fixtures — 1202 SQL + 48 HTTP + 26 gRPC. 12 documented XFAILs are pinned as permanent design-decision divergences in tests/conformance/divergences.py.

The corpus includes full TPC-H (22/22 queries) and a 70-of-99 TPC-DS subset. The remaining 29 TPC-DS queries are tracked in tpcds-expansion-plan.md, which lists the missing queries (numerical order, with complexity hints), the per-query authoring recipe, BigQuery adaptation patterns, and cost guardrails. Replay is per-PR in CI (no external credentials needed — the baselines are committed). Re-recording is the deliberate operator action make record-conformance, gated on GOOGLE_APPLICATION_CREDENTIALS + BQEMU_CONFORMANCE_PROJECT.

Tier 6 — Performance (`tests/perf/`)¶

pytest-benchmark. Five scenario files / 19 benchmarks: cold-start (containerized), query latency (TPC-H SF0.01 Q1/Q3/Q5/Q6/Q10), Storage Read Arrow throughput, insertAll throughput (batches of 1 / 10 / 100 / 1000), Storage Write throughput (4 stream types x 2 payload formats). Per-arch baselines at tests/perf/baselines/<arch>.json (one per linux-x86_64 / linux-arm64 / darwin-arm64); CI compares each run against the committed baseline with --benchmark-compare-fail=median:10%. A regression > 10% on any single benchmark fails the release. The design contract is locked in ADR 0025.

Tier 7 — Chaos (`tests/chaos/`)¶

Deliberately disruptive. Each test injects a real failure (resource exhaustion, crash, network drop, race) and asserts the emulator either preserves invariants or fails in a clean, documented way. Chaos runs manual-only in CI via the chaos.yml workflow (workflow_dispatch); the cadence-migration decision is deferred per the deferred-cadence CI policy chaos / perf / mutation / differential / fuzz all share. Locally: make test-chaos.

Five categories, one file each under tests/chaos/:

Category	File	What it injects
Concurrency	`test_concurrency.py`	100+ readers on stale MVs; retry storms (1000 threads); mixed read/write contention with time-travel
Resource	`test_resource_exhaustion.py`	Disk-full during EXPORT/COPY; memory cap during Arrow batch; FD exhaustion under many gRPC streams
Crash	`test_crash_recovery.py`	`kill -9` mid-AppendRows, mid-DDL; gRPC client cancellation mid-stream
Storage	`test_storage_failures.py`	Two emulators racing same `data_dir`; spatial extension missing; migration rollback
Network	`test_network_failures.py`	gRPC server-side stream cancellation; slow client back-pressure on ReadRows; connection drop during BatchCommit

Rules:

Flaky chaos tests are not tolerated. Every scenario must be deterministic given the test seed.
Chaos tests must assert one of: (a) invariant preserved despite fault (e.g., offset monotonicity under retry storms), or (b) clean documented failure (e.g., InternalError with row identity), or © graceful degradation (e.g., snapshot-level isolation under concurrent writers).
Chaos tests use pytest-timeout to cap runaway scenarios at 60s each.
The chaos tier ships 18 passing scenarios plus 1 documented environment-conditional skip (the spatial-extension-offline scenario; its deterministic unit-tier counterpart is in tests/unit/storage/test_engine_spatial.py). Catalog-hydration robustness and concurrent-writer contention are also covered inline elsewhere. The design contract for the tier is documented in ADR 0021.

Differential tier (Tier 5 sibling)¶

pytest tests/conformance/test_corpus_row_order_perturbed.py -m differential. Re-runs the conformance corpus with every INSERT … VALUES (…), (…), … tuple list reversed in setup.sql and asserts the emulator's output still matches the recorded baseline under canonical row sorting. Catches the fixture-specific-shortcut bug class: emulator logic that accidentally happens to be correct on the recorded data and wrong on permuted data (e.g., a LIMIT N shortcut that picks the first row in DuckDB's storage order, which happens to match BigQuery's storage order on the recorded dataset).

The tier exercises ~77 of the ~1202 SQL fixtures; the remaining fixtures are skipped because their queries use BigQuery-documented order-sensitive contracts (ORDER BY, LIMIT, ARRAY_AGG / STRING_AGG / window functions without explicit OVER ORDER BY, TABLESAMPLE). The skip rules are conservative — false- positive divergences would drown the genuine shortcut-bug signal.

v1.0 ships row-order perturbation only (mode A). Mode B (value-shift) and mode C (schema-reorder) require operator BigQuery time to re-record perturbed-sibling fixtures and are deferred to v1.0.x.

The differential workflow at differential.yml ships as workflow_dispatch only — the gating / cadence decision (per-PR vs nightly vs release-gate) is deferred until post-repo-setup when there is real CI traffic to measure runtime / flakiness / runner-cost trade-offs. The design contract — the perturbation taxonomy (A / B / C), eligibility rules, comparator behaviour, skip-list policy, and triage protocol on divergence — is locked in ADR 0028.

The differential tier is intentionally not numbered into the seven-tier pyramid above — it sits alongside Tier 6 (performance), Tier 7 (chaos), and the mutation tier as a comparison gate whose unit of analysis is a delta from a stored baseline (here: the recorded expected.json).

Fuzz tier (Tier 2 sibling)¶

python fuzz/fuzz_sql_translator.py … / …fuzz_dyn_proto.py / …fuzz_arrow_bridge.py — three Atheris coverage-guided harnesses covering the project's three highest-attack-surface translator- input boundaries:

Harness	Surface	Entry point
`fuzz_sql_translator.py`	SQL translator	`SQLTranslator.translate`
`fuzz_dyn_proto.py`	Storage Write API dynamic protobuf	`ProtoRowDecoder.decode`
`fuzz_arrow_bridge.py`	Arrow REST-JSON bridge + Arrow IPC deserialiser	`bq_rows_to_arrow` + `deserialize_arrow_rows`

Each harness's:func:TestOneInput asserts the baseline contract: any uncaught Python exception that is NOT a documented domain error (or the parser-specific upstream error class — DecodeError for protobuf, ArrowInvalid / ValueError for Arrow) is a bug. The fuzz tier is the only tier in the project that exercises translator inputs nobody hand-authored — the long tail of malformed UTF-8, unbalanced syntactic tokens, oversized arrays, mis-typed protobuf fields, and Arrow buffers with bogus length prefixes that a human-authored fixture catalogue cannot enumerate.

Tool choice is Atheris 3.0.0 (Google's CPython binding for libFuzzer). Atheris supports Python 3.11/3.12/3.13 — matching the ci.yml per-PR matrix. The dev-box's Python 3.14 is NOT supported yet; make test-fuzz prints a remediation message routing the operator to a 3.13 venv.

The committed fuzz/corpus/ tree carries one seed per major translator branch (canonical SQL shapes; a representative populated proto wire-bytes blob; a valid Arrow IPC stream + the documented zero-row / garbage shapes). Atheris's coverage-guided mutation expands from there.

The fuzz.yml workflow ships with workflow_dispatch only — the gating / cadence decision (per-PR vs nightly vs release-gate vs stays-manual) is deferred until post-repo-setup when there is real CI traffic to measure runtime / flakiness / runner-cost trade-offs. Each harness runs for 10 minutes in CI (30 minutes total wall time); make test-fuzz runs for 60 seconds per harness locally. The design contract — surface enumeration, tool choice rationale, baseline-contract invariant, per-harness time budget, no-skip-list discipline, and triage protocol on crash — is locked in ADR 0031.

The fuzz tier is intentionally not numbered into the seven-tier pyramid above — it sits alongside the differential (Tier 5 sibling) and mutation tiers as a comparison gate whose sampling discipline (coverage-guided libFuzzer mutation) differs from the pyramid tiers' "one invariant per test" structure. It shares the property-tier (Tier 2) discipline rather than asserting fresh invariants.

Mutation testing¶

mutmut runs manual-only (workflow_dispatch) in the mutation.yml workflow against a curated pilot scope (nine pure-domain modules, ~1 800 LOC). The cadence migration (per-PR scoped vs nightly vs release-gate) is deferred per the deferred-cadence CI policy chaos / perf / mutation / differential / fuzz all share. The committed baseline lives at tests/mutation/baseline.json; the regression gate fails the release when the live mutation score drops more than 2 percentage points below it. Re-baselining is the operator action make test-mutation-baseline.

The mutation tier is intentionally not numbered into the seven-tier pyramid above — it sits alongside Tier 6 (performance) as a comparison gate whose unit of analysis is a delta from a stored baseline. The design contract — pilot scope, score formula (killed / (killed + survived), excluding no_tests / skipped / timeout / suspicious), cadence, and the v1.0.x scope-expansion plan — is locked in ADR 0026.

Determinism¶

Clock protocol (default SystemClock; tests inject FrozenClock).
IdGenerator protocol (default UUID4; tests inject deterministic sequences).
Fixed random seeds; Hypothesis uses explicit per-test seeds.

Never do¶

Mock DuckDB in integration or e2e tests.
Skip a client language for an e2e scenario.
Merge a PR that drops coverage below 90%.
Add a test that sometimes fails ("flaky") without an issue marking it to be fixed.

Testing strategy¶

Tier 1 — Unit (tests/unit/)¶

Tier 2 — Property (tests/property/)¶

Tier 3 — Integration (tests/integration/)¶

Tier 4 — E2E against live containers (tests/e2e/)¶

Tier 5 — Conformance (tests/conformance/)¶

Tier 6 — Performance (tests/perf/)¶

Tier 7 — Chaos (tests/chaos/)¶