ADR 0026: Mutation-tier design contract¶

Status: Accepted

Context¶

The docs/architecture/testing-strategy.md document defines seven tiers of automated tests plus two independent gates (fuzz and mutation) that sit alongside the pyramid. Tiers 1–7 have shipped; the Phase 11 roadmap doc lists "mutation score tracked; regressions >2 points block release" as a v1.0 ship criterion. That criterion has been aspirational since Phase 0 — no baseline was ever recorded, so the gate could not actually fail.

Two operational facts make mutation testing fundamentally different from the seven-tier pyramid:

A mutation score with no baseline is meaningless. A run that ends "killed=420, survived=12, score=97.3%" is not actionable without a prior reference point. Like Tier 6 (performance), Tier mutation's unit of analysis is a comparison, not an assertion.
First-run cost dwarfs steady-state cost. Mutmut applies thousands of source-tree mutations and runs the unit suite once per mutant. The first run takes hours; subsequent runs reuse mutants/ cache and only re-test mutants whose source neighbourhood changed. The CI cadence has to match that asymmetry.

This ADR captures the contract every mutation run must honour. The two-point regression threshold was already set by the Phase 11 doc and the v1-confidence-plan; this ADR locks in the scope, storage, and gate plumbing.

Decisions¶

1. The mutation gate is a v1.0 ship criterion, not a pyramid tier¶

Tiers 1–7 are listed in docs/architecture/testing-strategy.md. Mutation testing is intentionally not numbered into the pyramid — it sits alongside Tier 6 (performance) as an independent comparison gate whose unit of analysis is a delta from a stored baseline, not an intrinsic assertion. The fuzz harness (P3.c) will take the same shape when it lands.

The gate fails the release when the mutation score drops more than 2 percentage points below the committed baseline. That bound is the Phase 11 contract; this ADR codifies it in scripts/check_mutation_baseline.py and wires it into make test-mutation.

2. v1.0 ships a pilot scope; broader scope is v1.0.x¶

Mutmut applies ~5–10 mutants per non-comment LOC. The full src/bqemulator/ source tree (after the structural exclusions in decision 3) is ~27 000 lines spread over 133 modules — a first run would burn 10+ hours of wall-clock and ~80% of the resulting surviving mutants would either be in code exercised only through integration / e2e tiers (slow per-mutant) or in framework-driven modules whose mutants are functionally equivalent (FastAPI routes, gRPC servicers, Click decorators). Scoring on uncoverable surface inflates the baseline and produces a gate that flakes on coverage churn alone.

The v1.0 pilot scope mutates nine modules — pure-domain, deterministic, with strong direct unit-test coverage:

Module	LOC	Tests
`src/bqemulator/catalog/etag.py`	40	`tests/unit/catalog/test_etag.py`
`src/bqemulator/sql/cache.py`	136	`tests/unit/sql/test_cache.py`
`src/bqemulator/scripting/lexer.py`	299	`tests/unit/scripting/test_lexer.py`
`src/bqemulator/scripting/frames.py`	128	`tests/unit/scripting/test_frames.py`
`src/bqemulator/scripting/exceptions.py`	65	`tests/unit/scripting/test_exceptions.py`
`src/bqemulator/scripting/ast.py`	180	(exercised through `test_parser.py` + `test_interpreter.py`)
`src/bqemulator/jobs/error_mapper.py`	420	`tests/unit/jobs/test_error_mapper.py`
`src/bqemulator/types/interval.py`	370	`tests/unit/types/test_interval.py`
`src/bqemulator/types/range_type.py`	157	`tests/unit/types/test_range_type.py`

These nine modules collectively own the deterministic, framework-free invariants the project rests on (ETag stability, LRU eviction, scripting lexer / frame semantics, INTERVAL & RANGE arithmetic, DuckDB → BigQuery error mapping). A regression in any of them silently breaks the wire-shape contracts the v1.0 ship criteria depend on — exactly the surface where mutation testing beats line coverage.

The v1.0.x roadmap entry for "mutation scope expansion" sweeps in the next concentric ring — catalog/memory_repository.py, sql/translator.py, sql/rules/*, versioning/, row_access/policy.py, udf/, storage/arrow_bridge.py — once the pilot's CI cadence has proven out.

3. Hard-excluded surfaces (`do_not_mutate`-equivalent)¶

Even when scope expands in v1.0.x, the following surfaces are permanently out of the mutation tier:

Path	Reason
`src/bqemulator/grpc_api/proto/`	Generated protobuf stubs; no semantic logic to mutate.
`src/bqemulator/observability/`	structlog / OTel / Prometheus wiring — mutants flip logger names or counter labels, not behaviour.
`src/bqemulator/testing/`	Test fixtures and helpers exercised through the e2e / CI tiers, not `tests/unit/`.
`src/bqemulator/api/routes/`	FastAPI route handlers — exercised primarily through `tests/integration`; per-mutant runtime is dominated by ASGI startup and most surviving mutants are wire-shape edge cases unit tests can't pin.
`src/bqemulator/grpc_api/` (excluding proto)	gRPC servicers — exercised through `tests/integration`; same wall-clock problem as routes.
`src/bqemulator/server.py` / `__main__.py` / `cli.py`	Process bootstrap, uvicorn glue, Click decorators — many equivalent mutants on argv handling.

The pilot scope is intentionally a strict subset of "everything not on this list."

4. Score formula¶

Score = killed / (killed + survived) expressed as a percentage. Both no_tests mutants (mutmut couldn't find a test that touches the line at all) and skipped mutants are excluded from the denominator. A no_tests mutant reflects a coverage-tier gap, not a test-tier weakness; counting them would inflate or deflate the score on coverage churn alone.

timeout and suspicious mutants are also excluded — both are infrastructure signals (the test runner hit its budget; the runner produced an inconsistent exit code) rather than test-quality signals.

The mapping is:

mutmut status	Counted in numerator?	Counted in denominator?
`killed`	yes	yes
`survived`	no	yes
`no_tests`	no	no
`skipped`	no	no
`timeout`	no	no
`suspicious`	no	no

5. Baseline is committed; updates are a deliberate operator action¶

tests/mutation/baseline.json is committed to the repo. Every field carries the recording date and the raw mutmut counts so a reader can audit drift. The committed shape:

{
  "score": 92.43,
  "killed": 423,
  "survived": 35,
  "no_tests": 9,
  "skipped": 0,
  "timeout": 0,
  "suspicious": 0,
  "total": 467,
  "run_at": "2024-05-19"
}

Re-baselining is a deliberate operator command:

make test-mutation-baseline   # overwrites tests/mutation/baseline.json

The forcing function mirrors performance baselines (ADR 0025) and conformance fixtures: a baseline drift is a code change that lands through review, not an automated diff in CI.

The committed JSON is the gate's reference. The surviving-mutant detail (mutants/mutmut-cicd-stats.json) is not committed — it leaks the implementation neighbourhood without adding signal. Triage data lives on the CI artefact attached to each nightly run.

6. Cadence: nightly, not on every commit¶

The full mutmut run takes 30+ minutes even on the pilot scope (and hours on the full scope when v1.0.x expands it). Wiring it into make verify would 30× the CI runtime for a comparison whose delta-from-baseline is mostly noise on a per-commit cadence — most PRs touch zero of the nine pilot modules.

The mutation.yml workflow runs nightly on main and uploads the surviving-mutant detail as a CI artefact. PRs that touch a pilot module can opt in via workflow_dispatch; the gate's >2pp check fires on both nightly and dispatched runs.

A regression detected by the nightly run blocks the next release tag — not the in-flight PR that introduced it — because mutation score is not a per-commit signal. The PR-level signal is the existing 90% line + branch coverage gate; the mutation gate catches quality drift that line coverage hides.

7. `mutate_only_covered_lines` is off (v1.0 pilot)¶

Mutmut's mutate_only_covered_lines = true option uses coverage.py to identify lines the unit suite actually executes, then mutates only those lines. On a larger scope it pays for itself by avoiding thousands of mutants on dead code. On the v1.0 pilot scope (nine modules, ~1 800 LOC, near-100% direct unit coverage) the saving is small.

It also costs us a Python-3.14 compatibility hazard. Coverage's trace function interferes with DuckDB's C-extension submodule registration (_duckdb._sqltypes) when import duckdb runs under coverage.collect(); the failure surfaces as ModuleNotFoundError: No module named '_duckdb._sqltypes' only in the coverage-gather phase of mutmut run. The setting is off for the pilot to keep the toolchain stable on the project's reference Python (3.14).

The v1.0.x scope-expansion entry will re-evaluate enabling it once either (a) the upstream DuckDB / coverage.py interaction is fixed or (b) the v1.0.x scope includes enough uncovered lines that the filter pays for itself.

8. mutmut 3.x configuration (not 2.x)¶

The mutmut>=2.4 floor in pyproject.toml resolves to mutmut 3.5+ in modern Python environments. The 3.x config differs from 2.x in three ways the project depends on:

Aspect	2.x	3.x
`paths_to_mutate`	comma-separated string	list of strings
`tests_dir`	comma-separated string	list of strings
`runner`	shell command string	removed — pytest is invoked in-process
Working dir	`.mutmut-cache`	`mutants/` (a parallel copy of the source tree)

The [tool.mutmut] section in pyproject.toml reflects 3.x shape. also_copy is set to the full src/bqemulator/ tree so the package remains importable when pytest runs from inside mutants/ (mutmut 3.x copies only paths_to_mutate files by default, which would break import bqemulator.foo for any module outside the scope).

Consequences¶

Positive. The Phase 11 ship criterion "mutation score tracked; regressions >2 points block release" is now enforceable. Phase 11's contract was aspirational for nine months; this session recorded the baseline and wired the comparison.
Positive. Score-formula choice (excluding no_tests and skipped from the denominator) means the gate measures test-suite quality, not coverage. A drop in line coverage doesn't move the mutation score; only a drop in the suite's ability to distinguish good code from mutated code does.
Positive. A nightly CI cadence + per-PR coverage gate is cheaper than running both per-PR. Nightly is the right cadence for a multi-hour comparison whose value is delta-from-baseline.
Negative. A pilot scope of nine modules covers a small fraction of src/bqemulator/. The v1.0 ship criterion is met (the gate fires; the baseline is real), but the gate catches regressions only in the pilot surface. A regression in, e.g., sql/rules/spatial.py won't be caught by Tier mutation until v1.0.x sweeps the SQL-rule cluster into scope.
Negative. Mutmut 3.x's mutants/ working directory conflicts with the convention 2.x established (.mutmut-cache). The Makefile's clean target and .gitignore were updated in the same PR.
Negative. First-run wall clock is non-trivial. A nightly job budget of 30+ minutes is reasonable; a per-commit budget of 30+ minutes would block PR merge, which is why the gate is nightly.
Negative — local-recording limitation on macOS. Mutmut 3.5.0 calls multiprocessing.set_start_method('fork') at module level in mutmut/__main__.py and then uses os.fork() directly for each mutant. macOS prohibits fork() after the parent process has loaded native extensions that touch the Objective-C runtime (DuckDB, pyarrow, structlog, mini-racer in this project); the child segfaults on first non-trivial work and the mutant is recorded as segfault. The effect: a local make test-mutation on a developer's macOS box records every mutant as segfault and yields a 0 % score. The mutation.yml workflow runs on Linux GitHub-Actions runners where fork() is not subject to the same restriction, so the nightly gate (and the record-baseline=true workflow_dispatch mode) work correctly. Re-baselining is therefore a CI-side operation: the operator dispatches the workflow with record-baseline=true, downloads the mutation-baseline artefact, and commits it. Until upstream mutmut accepts a fix to use spawn semantics for child processes (tracked in the v1.0.x mutation-scope expansion entry), this asymmetry stands.

Implementation notes¶

Pilot scope is encoded in pyproject.toml [tool.mutmut] paths_to_mutate.
mutmut>=2.4 floor is kept in the [dev] extra (the project uses 3.x in practice, but the floor is honest about minimum compatibility).
make test-mutation runs mutmut run + mutmut export-cicd-stats
the regression-check script and fails when the score drops

2pp.
make test-mutation-baseline re-runs and overwrites the committed baseline; intended for operator use after the test suite has materially expanded.
scripts/check_mutation_baseline.py reads mutants/mutmut-cicd-stats.json (mutmut's output) and compares to tests/mutation/baseline.json (committed).
.github/workflows/mutation.yml schedules the nightly run and uploads the surviving-mutant artefact for triage.

References¶

Mutation testing subsection in the Phase 11 roadmap doc
Mutation testing in the testing-strategy doc
ADR 0025 — Tier 6 perf gate is structurally analogous (compare-to-baseline, deliberate re-record, per-arch storage)
v1-confidence-plan workstream P4.a — this ADR closes the workstream