ADR 0021: Chaos-tier (Tier 7) design contract¶
- Status: Accepted
Context¶
Phase 10's production-readiness audit surfaced eight failure modes that no existing test tier reliably covered: process crashes mid-DDL and mid-stream, concurrent stale-readers on materialized views, retry storms at 1000+ threads, resource exhaustion (disk-full, FD-cap, buffer-cap), DuckDB file-lock contention across processes, forward- only migration partial-apply recovery, the spatial-extension offline startup path, and gRPC stream cancellation under back-pressure.
These failure modes share three properties:
- They are real production hazards an operator will see at some point.
- They are not naturally reproducible from a unit or integration test — they need fault injection or process-level orchestration.
- They are slower than the per-commit gate tolerates: a single subprocess-kill scenario takes seconds, not the milliseconds the unit/property/integration tiers target.
We need a dedicated test tier that owns these scenarios with rules that distinguish a meaningful chaos test from a flaky one. ADR 0017 documented the MV refresh model that one of these scenarios validates under contention; ADR 0013 documented the Write API's exactly-once semantics that the retry-storm scenario property-tests. Neither contract was actually exercised under real contention until Phase 11 shipped the chaos tier.
This ADR captures the contract every chaos test in
tests/chaos/ must honour.
Decisions¶
1. Chaos is Tier 7 of a 7-tier pyramid¶
The testing-strategy document
defines seven tiers. Chaos sits at the top — slower than the gates
run on every commit, but mandatory before a release. Chaos runs
nightly in CI; make test-chaos runs it locally.
Tiers 1-4 + 6 (unit, property, integration, e2e, perf) run on every commit. Tier 5 (conformance) and Tier 7 (chaos) run nightly.
2. Five categories, one file per category¶
The Phase 11 audit mapped each open gap to one of five categories. Adding a new chaos scenario means picking the category it belongs to and adding it to the existing file:
| File | Category | Audit gap |
|---|---|---|
test_concurrency.py |
Concurrent races | #5 (MV refresh) + extends #4 |
test_resource_exhaustion.py |
Resource pressure | #6 |
test_crash_recovery.py |
Process-level crash | #2 process-side |
test_storage_failures.py |
Storage primitives | #7, #9, #10 |
test_network_failures.py |
gRPC / REST boundary | #2 network-side |
Six categories would have been a defensible split (e.g. "lock contention" separated from "concurrent races"); five matches the audit's natural fault-mode taxonomy and keeps the surface manageable.
3. Deterministic given a seed¶
Every chaos test must reproduce the same interleaving on every run given the same seed. Three implementation strategies, in order of preference:
- No randomness at all. A
threading.Barrierto release N threads simultaneously, anasyncio.Barrierfor async scenarios — the interleaving is determined by the lock + scheduling primitives, not by chance. - Fixed seed for RNG-driven inputs. The chaos conftest seeds
randomat import. Tests may consume the RNG freely; theBQEMU_CHAOS_SEEDenv var overrides the default (0). The seed is printed in pytest's session header so a CI flake can be reproduced by copying the seed line. - Subprocess-based scenarios publish a
READYsentinel on stdout before the parent injects the fault. The parent reads the pipe (not a sleep) to know the child has reached the right state.
What we explicitly reject:
time.sleep(0.1)as a "let the other thread get there" pattern.random.random() < 0.5to choose which fault to inject.hypothesissettings that vary across runs without a stored seed.
A chaos test that "usually" passes is a regression-class bug. The remedy is to introduce a primitive that forces the desired ordering, or to assert a weaker property that holds under any ordering.
4. Every test asserts one of three outcomes¶
A chaos scenario must explicitly assert one of:
- (a) Invariant preserved despite fault. Example: 1000 threads
retrying the same offset → exactly one OK, 999 ALREADY_EXISTS;
next_offsetis exactly 1 regardless of OS scheduling. - (b) Clean documented failure. Example: corrupt catalog row
raises
InternalErrorwhose message names the offending row's identity. The error class and message-shape regression-test the operator-facing contract. - © Graceful degradation. Example: a second emulator on a
locked
data_dirraises a recognisable subset of exceptions (IOException,InternalError, or chained variant) without hanging.
A chaos test must not assert "didn't crash" — that's too weak a contract for a tier designed to enumerate failure modes. The distinction matters because a real production incident landed in one of these three buckets, and the operator's runbook reads from the same taxonomy.
5. 60-second per-scenario timeout¶
make test-chaos runs pytest with --timeout=60. Inside the chaos
conftest we re-assert the cap so direct pytest tests/chaos/
invocations get the same bound. Override via BQEMU_CHAOS_TIMEOUT
when debugging.
A scenario that hits the timeout is a bug — either the scenario
hangs (deadlock, missed event), or it's truly slower than the chaos
tier permits and needs to live in a separate slow tier.
6. Environment-dependent scenarios may pytest.skip¶
Two scenarios in Phase 11's initial chaos build-out are
environment-dependent: the spatial-extension-offline scenario in
test_storage_failures.py and the FD-cap scenario in
test_resource_exhaustion.py. Both fall back to
pytest.skip(...) with a documented reason when the host's
configuration can't reproduce the fault (DuckDB has a cached
spatial extension; the soft FD limit can't be lowered without
breaking pytest's own logging FDs).
This is acceptable iff:
- The skip reason is specific and actionable (not "skipped because slow").
- A unit-tier counterpart exists that asserts the same contract deterministically. The unit test is the canonical lock; chaos adds a process-level dimension for environments that support it.
Without the unit-tier counterpart, the scenario must use a different injection mechanism that doesn't depend on the host's caches.
7. Subprocess scenarios share a helper¶
tests/chaos/test_crash_recovery.py and tests/chaos/test_storage_failures.py
both spawn subprocesses to inject process-level faults. They share
the _spawn_emulator_child / _wait_for_ready pattern: the
child runs a snippet, prints READY once setup is complete,
then awaits an asyncio.Event until the parent kills it.
This pattern matters because it makes the test deterministic — the parent's kill is synchronised to the child's state transition rather than to wall-clock time. A more elaborate IPC (named pipes, sockets) would buy nothing here; stdout line-buffering is sufficient.
Consequences¶
-
Positive. The Phase 10 audit gaps are now property-tested under contention rather than just claimed in design notes. The exactly-once guarantee on COMMITTED streams, the MV collapse-onto-one-recompute guarantee, the catalog corruption recovery flow, and the kill-recovery semantics all have explicit regression coverage.
-
Positive. The five-category split keeps the file surface manageable. A new chaos scenario maps to one of the five obvious files.
-
Positive. Determinism rules make CI investigation cheap. A failure is reproducible from the seed alone; there's no "couldn't reproduce locally" failure mode.
-
Negative. Chaos runs nightly (not on every commit), so a scenario regression can land in main and persist for up to 24 hours. The mitigation is that every chaos scenario also has a unit-tier counterpart or an integration-tier prerequisite; the unit gates catch most regressions before chaos sees them.
-
Negative. Subprocess-based scenarios are slower than unit tests — typically 0.5-2 seconds each because Python startup + emulator initialisation dominates. The 60-second per-scenario budget absorbs this without complaint, but it's a real cost compared to a pure-Python unit test.
-
Negative. Two scenarios skip in some environments (spatial extension cache present; FD cap can't be lowered). The unit-tier counterparts (
test_engine_spatial.py) prevent the contract from drifting, but the chaos test's process-level coverage is conditional. We accept this rather than build elaborate environment manipulation that would itself become a maintenance liability.
Implementation notes¶
- The
chaospytest marker is registered inpyproject.toml. - The
make test-chaostarget invokespytest tests/chaos -m chaos --timeout=60. pytest-timeoutis declared as an explicit testing dependency inpyproject.toml(not a transitive of another package) somake dev-setupalways installs it.- The chaos seed is logged in pytest's session header, not via
warnings.warn(the project'sfilterwarnings = ["error",...]would convert that into a test failure). - New chaos scenarios should add their audit-gap reference to the module docstring; the Phase 11 review (created when Phase 11 closes) carries the audit-finding traceability table.
References¶
- Tier 7 in the testing-strategy doc
- Phase 11 roadmap doc — chaos section
- Phase 10 review's production-readiness audit — the source of the 8 gaps the chaos tier closes (the Phase 11 review, when Phase 11 closes, will carry the audit-finding traceability table)
- ADR 0013 — exactly-once guarantee the retry-storm scenario property-tests
- ADR 0017 — collapse-onto-one- recompute guarantee the MV scenario property-tests
- ADR 0019 — spatial extension required- startup that the gap-#10 chaos scenario verifies the failure path of