ADR 0026: Mutation-tier design contract¶
- Status: Accepted
Context¶
The docs/architecture/testing-strategy.md
document defines seven tiers of automated tests plus two
independent gates (fuzz and mutation) that sit alongside the
pyramid. Tiers 1–7 have shipped; the
Phase 11 roadmap doc lists
"mutation score tracked; regressions >2 points block release" as
a v1.0 ship criterion. That criterion has been aspirational since
Phase 0 — no baseline was ever recorded, so the gate could not
actually fail.
Two operational facts make mutation testing fundamentally different from the seven-tier pyramid:
- A mutation score with no baseline is meaningless. A run that ends "killed=420, survived=12, score=97.3%" is not actionable without a prior reference point. Like Tier 6 (performance), Tier mutation's unit of analysis is a comparison, not an assertion.
- First-run cost dwarfs steady-state cost. Mutmut applies
thousands of source-tree mutations and runs the unit suite once
per mutant. The first run takes hours; subsequent runs reuse
mutants/cache and only re-test mutants whose source neighbourhood changed. The CI cadence has to match that asymmetry.
This ADR captures the contract every mutation run must honour. The two-point regression threshold was already set by the Phase 11 doc and the v1-confidence-plan; this ADR locks in the scope, storage, and gate plumbing.
Decisions¶
1. The mutation gate is a v1.0 ship criterion, not a pyramid tier¶
Tiers 1–7 are listed in
docs/architecture/testing-strategy.md.
Mutation testing is intentionally not numbered into the
pyramid — it sits alongside Tier 6 (performance) as an independent
comparison gate whose unit of analysis is a delta from a stored
baseline, not an intrinsic assertion. The fuzz harness (P3.c) will
take the same shape when it lands.
The gate fails the release when the mutation score drops more than
2 percentage points below the committed baseline. That bound is the
Phase 11 contract; this ADR codifies it in
scripts/check_mutation_baseline.py
and wires it into make test-mutation.
2. v1.0 ships a pilot scope; broader scope is v1.0.x¶
Mutmut applies ~5–10 mutants per non-comment LOC. The full
src/bqemulator/ source tree (after the structural exclusions in
decision 3) is ~27 000 lines spread over 133 modules — a first run
would burn 10+ hours of wall-clock and ~80% of the resulting
surviving mutants would either be in code exercised only through
integration / e2e tiers (slow per-mutant) or in framework-driven
modules whose mutants are functionally equivalent (FastAPI routes,
gRPC servicers, Click decorators). Scoring on uncoverable surface
inflates the baseline and produces a gate that flakes on coverage
churn alone.
The v1.0 pilot scope mutates nine modules — pure-domain, deterministic, with strong direct unit-test coverage:
| Module | LOC | Tests |
|---|---|---|
src/bqemulator/catalog/etag.py |
40 | tests/unit/catalog/test_etag.py |
src/bqemulator/sql/cache.py |
136 | tests/unit/sql/test_cache.py |
src/bqemulator/scripting/lexer.py |
299 | tests/unit/scripting/test_lexer.py |
src/bqemulator/scripting/frames.py |
128 | tests/unit/scripting/test_frames.py |
src/bqemulator/scripting/exceptions.py |
65 | tests/unit/scripting/test_exceptions.py |
src/bqemulator/scripting/ast.py |
180 | (exercised through test_parser.py + test_interpreter.py) |
src/bqemulator/jobs/error_mapper.py |
420 | tests/unit/jobs/test_error_mapper.py |
src/bqemulator/types/interval.py |
370 | tests/unit/types/test_interval.py |
src/bqemulator/types/range_type.py |
157 | tests/unit/types/test_range_type.py |
These nine modules collectively own the deterministic, framework-free invariants the project rests on (ETag stability, LRU eviction, scripting lexer / frame semantics, INTERVAL & RANGE arithmetic, DuckDB → BigQuery error mapping). A regression in any of them silently breaks the wire-shape contracts the v1.0 ship criteria depend on — exactly the surface where mutation testing beats line coverage.
The v1.0.x roadmap entry for "mutation scope expansion" sweeps in
the next concentric ring — catalog/memory_repository.py,
sql/translator.py, sql/rules/*, versioning/,
row_access/policy.py, udf/, storage/arrow_bridge.py —
once the pilot's CI cadence has proven out.
3. Hard-excluded surfaces (do_not_mutate-equivalent)¶
Even when scope expands in v1.0.x, the following surfaces are permanently out of the mutation tier:
| Path | Reason |
|---|---|
src/bqemulator/grpc_api/proto/ |
Generated protobuf stubs; no semantic logic to mutate. |
src/bqemulator/observability/ |
structlog / OTel / Prometheus wiring — mutants flip logger names or counter labels, not behaviour. |
src/bqemulator/testing/ |
Test fixtures and helpers exercised through the e2e / CI tiers, not tests/unit/. |
src/bqemulator/api/routes/ |
FastAPI route handlers — exercised primarily through tests/integration; per-mutant runtime is dominated by ASGI startup and most surviving mutants are wire-shape edge cases unit tests can't pin. |
src/bqemulator/grpc_api/ (excluding proto) |
gRPC servicers — exercised through tests/integration; same wall-clock problem as routes. |
src/bqemulator/server.py / __main__.py / cli.py |
Process bootstrap, uvicorn glue, Click decorators — many equivalent mutants on argv handling. |
The pilot scope is intentionally a strict subset of "everything not on this list."
4. Score formula¶
Score = killed / (killed + survived) expressed as a percentage.
Both no_tests mutants (mutmut couldn't find a test that touches
the line at all) and skipped mutants are excluded from the
denominator. A no_tests mutant reflects a coverage-tier gap, not
a test-tier weakness; counting them would inflate or deflate the
score on coverage churn alone.
timeout and suspicious mutants are also excluded — both are
infrastructure signals (the test runner hit its budget; the runner
produced an inconsistent exit code) rather than test-quality
signals.
The mapping is:
| mutmut status | Counted in numerator? | Counted in denominator? |
|---|---|---|
killed |
yes | yes |
survived |
no | yes |
no_tests |
no | no |
skipped |
no | no |
timeout |
no | no |
suspicious |
no | no |
5. Baseline is committed; updates are a deliberate operator action¶
tests/mutation/baseline.json is committed to the repo. Every
field carries the recording date and the raw mutmut counts so a
reader can audit drift. The committed shape:
{
"score": 92.43,
"killed": 423,
"survived": 35,
"no_tests": 9,
"skipped": 0,
"timeout": 0,
"suspicious": 0,
"total": 467,
"run_at": "2024-05-19"
}
Re-baselining is a deliberate operator command:
The forcing function mirrors performance baselines (ADR 0025) and conformance fixtures: a baseline drift is a code change that lands through review, not an automated diff in CI.
The committed JSON is the gate's reference. The
surviving-mutant detail (mutants/mutmut-cicd-stats.json)
is not committed — it leaks the implementation neighbourhood
without adding signal. Triage data lives on the CI artefact
attached to each nightly run.
6. Cadence: nightly, not on every commit¶
The full mutmut run takes 30+ minutes even on the pilot scope (and
hours on the full scope when v1.0.x expands it). Wiring it into
make verify would 30× the CI runtime for a comparison whose
delta-from-baseline is mostly noise on a per-commit cadence — most
PRs touch zero of the nine pilot modules.
The
mutation.yml
workflow runs nightly on main and uploads the surviving-mutant
detail as a CI artefact. PRs that touch a pilot module can opt in
via workflow_dispatch; the gate's >2pp check fires on both
nightly and dispatched runs.
A regression detected by the nightly run blocks the next release tag — not the in-flight PR that introduced it — because mutation score is not a per-commit signal. The PR-level signal is the existing 90% line + branch coverage gate; the mutation gate catches quality drift that line coverage hides.
7. mutate_only_covered_lines is off (v1.0 pilot)¶
Mutmut's mutate_only_covered_lines = true option uses
coverage.py to identify lines
the unit suite actually executes, then mutates only those lines.
On a larger scope it pays for itself by avoiding thousands of mutants
on dead code. On the v1.0 pilot scope (nine modules, ~1 800 LOC,
near-100% direct unit coverage) the saving is small.
It also costs us a Python-3.14 compatibility hazard. Coverage's
trace function interferes with DuckDB's C-extension submodule
registration (_duckdb._sqltypes) when import duckdb runs
under coverage.collect(); the failure surfaces as
ModuleNotFoundError: No module named '_duckdb._sqltypes' only in
the coverage-gather phase of mutmut run. The setting is off for
the pilot to keep the toolchain stable on the project's reference
Python (3.14).
The v1.0.x scope-expansion entry will re-evaluate enabling it once either (a) the upstream DuckDB / coverage.py interaction is fixed or (b) the v1.0.x scope includes enough uncovered lines that the filter pays for itself.
8. mutmut 3.x configuration (not 2.x)¶
The mutmut>=2.4 floor in pyproject.toml resolves to
mutmut 3.5+ in modern Python environments. The 3.x config differs
from 2.x in three ways the project depends on:
| Aspect | 2.x | 3.x |
|---|---|---|
paths_to_mutate |
comma-separated string | list of strings |
tests_dir |
comma-separated string | list of strings |
runner |
shell command string | removed — pytest is invoked in-process |
| Working dir | .mutmut-cache |
mutants/ (a parallel copy of the source tree) |
The
[tool.mutmut]
section in pyproject.toml reflects 3.x shape. also_copy
is set to the full src/bqemulator/ tree so the package remains
importable when pytest runs from inside mutants/ (mutmut 3.x
copies only paths_to_mutate files by default, which would
break import bqemulator.foo for any module outside the scope).
Consequences¶
-
Positive. The Phase 11 ship criterion "mutation score tracked; regressions >2 points block release" is now enforceable. Phase 11's contract was aspirational for nine months; this session recorded the baseline and wired the comparison.
-
Positive. Score-formula choice (excluding
no_testsandskippedfrom the denominator) means the gate measures test-suite quality, not coverage. A drop in line coverage doesn't move the mutation score; only a drop in the suite's ability to distinguish good code from mutated code does. -
Positive. A nightly CI cadence + per-PR coverage gate is cheaper than running both per-PR. Nightly is the right cadence for a multi-hour comparison whose value is delta-from-baseline.
-
Negative. A pilot scope of nine modules covers a small fraction of
src/bqemulator/. The v1.0 ship criterion is met (the gate fires; the baseline is real), but the gate catches regressions only in the pilot surface. A regression in, e.g.,sql/rules/spatial.pywon't be caught by Tier mutation until v1.0.x sweeps the SQL-rule cluster into scope. -
Negative. Mutmut 3.x's
mutants/working directory conflicts with the convention 2.x established (.mutmut-cache). The Makefile'scleantarget and.gitignorewere updated in the same PR. -
Negative. First-run wall clock is non-trivial. A nightly job budget of 30+ minutes is reasonable; a per-commit budget of 30+ minutes would block PR merge, which is why the gate is nightly.
-
Negative — local-recording limitation on macOS. Mutmut 3.5.0 calls
multiprocessing.set_start_method('fork')at module level inmutmut/__main__.pyand then usesos.fork()directly for each mutant. macOS prohibitsfork()after the parent process has loaded native extensions that touch the Objective-C runtime (DuckDB, pyarrow, structlog, mini-racer in this project); the child segfaults on first non-trivial work and the mutant is recorded assegfault. The effect: a localmake test-mutationon a developer's macOS box records every mutant as segfault and yields a 0 % score. Themutation.ymlworkflow runs on Linux GitHub-Actions runners wherefork()is not subject to the same restriction, so the nightly gate (and therecord-baseline=trueworkflow_dispatchmode) work correctly. Re-baselining is therefore a CI-side operation: the operator dispatches the workflow withrecord-baseline=true, downloads themutation-baselineartefact, and commits it. Until upstream mutmut accepts a fix to usespawnsemantics for child processes (tracked in the v1.0.x mutation-scope expansion entry), this asymmetry stands.
Implementation notes¶
- Pilot scope is encoded in
pyproject.toml[tool.mutmut]paths_to_mutate. mutmut>=2.4floor is kept in the[dev]extra (the project uses 3.x in practice, but the floor is honest about minimum compatibility).make test-mutationrunsmutmut run+mutmut export-cicd-stats- the regression-check script and fails when the score drops
2pp.
make test-mutation-baselinere-runs and overwrites the committed baseline; intended for operator use after the test suite has materially expanded.scripts/check_mutation_baseline.pyreadsmutants/mutmut-cicd-stats.json(mutmut's output) and compares totests/mutation/baseline.json(committed)..github/workflows/mutation.ymlschedules the nightly run and uploads the surviving-mutant artefact for triage.
References¶
- Mutation testing subsection in the Phase 11 roadmap doc
- Mutation testing in the testing-strategy doc
- ADR 0025 — Tier 6 perf gate is structurally analogous (compare-to-baseline, deliberate re-record, per-arch storage)
- v1-confidence-plan workstream P4.a — this ADR closes the workstream