TPC-DS expansion plan — closing the 40-query gap¶

The conformance corpus today carries 59 of 99 TPC-DS queries (and 22/22 TPC-H). This document tracks the planned expansion to 99/99 and lives in the repo so the work survives session boundaries.

What's missing¶

40 queries, enumerated below in numerical order with a one-line complexity hint (table count + key SQL feature). "Easier" rows are roughly 2-4 tables and a single GROUP BY or window function; "harder" rows are 5+ tables, multi-CTE chains, or ROLLUP/GROUPING SETS.

#	Q	Hint (tables / key feature)	Complexity
1	q10	8 tables: customer, c_address, c_demo, h_demo, store_sales, web_sales, catalog_sales, date_dim. EXISTS to 3 fact tables.	hard
2	q11	6 tables: customer, c_address, store_sales, web_sales, date_dim. Multi-CTE: year1 / year2 ratio.	medium
3	q12	3 tables: web_sales, item, date_dim. Window SUM OVER PARTITION BY i_class.	easier
4	q13	4 tables: store_sales, store, customer_demographics, household_demographics + date_dim. Three OR'd customer-segment filters + DECIMAL aggregates.	medium
5	q19	5 tables: date_dim, store_sales, item, customer, customer_address, store. Manager-rank by brand-category-state. ORDER BY i_brand_id, item, store; LIMIT 100.	medium
6	q20	3 tables: catalog_sales, item, date_dim. Item revenue ratio over class. Window SUM OVER.	easier
7	q25	5 tables: store_sales, store_returns, catalog_sales, store, item, date_dim ×3 (sold + returned + reseller). Customer return ratio.	hard
8	q26	4 tables: catalog_sales, customer_demographics, date_dim, item, promotion. Gender + marital + ed-status filter; promotion ID predicates.	medium
9	q30	6 tables: web_returns, date_dim, customer_address, customer + customer-state ranking. Top-100 customers per state via CTE.	hard
10	q32	3 tables: catalog_sales, item, date_dim. SUM(cs_ext_discount_amt) > 1.3 × AVG(...) correlated subquery.	medium
11	q33	8 tables (3 channels × 3 dimension tables): store/web/catalog_sales × item × date_dim × customer_address. UNION ALL across 3 channels with GMT-zone filter.	hard
12	q37	3 tables: item, inventory, date_dim, catalog_sales. Item-on-promotion + min-inventory filter.	easier
13	q40	3 tables: catalog_sales, catalog_returns, date_dim, item, warehouse. Returns net-sales by warehouse + item, date-range PIVOT.	medium
14	q45	4 tables: web_sales, customer, customer_address, date_dim, item. Zip-prefix filter + city aggregation.	easier
15	q46	6 tables: store_sales, customer_demographics, household_demographics, customer, customer_address, store, date_dim. Multi-segment customer filter.	medium
16	q54	8 tables: catalog_sales + web_sales + customer + c_address + store + item + date_dim. Customer-ROI per item class. Multi-CTE.	hard
17	q57	3 tables: call_center, item, date_dim, catalog_sales. Avg-sales window over call-center / month.	easier
18	q58	3 tables: store_sales / web_sales / catalog_sales × item × date_dim. Item-rank by week sales across 3 channels. UNION ALL + window.	medium
19	q59	3 tables: store_sales, date_dim, store. Week-over-week sales ratio. Self-join on week_seq.	medium
20	q60	8 tables: store/web/catalog_sales × item × date_dim × customer_address. Sales channel UNION ALL + GROUPING SETS-like ROLLUP-by-item.	hard
21	q61	8 tables: store_sales + promotion + date_dim + customer + customer_address + item + store. Promotion-driven sales ratio (with-promotion / no-promotion).	medium
22	q62	5 tables: web_sales, warehouse, ship_mode, web_site, date_dim. Shipping-mode aggregates with NULLs for un-shipped.	medium
23	q65	4 tables: store, item, store_sales, date_dim. DENSE_RANK over store-item revenue.	easier
24	q68	6 tables: store_sales, date_dim, store, household_demographics, customer_address, customer. Customer-store visit counts.	medium
25	q69	5 tables: customer, customer_address, customer_demographics, store_sales/web_sales/catalog_sales (EXISTS to one, NOT EXISTS to other two).	hard
26	q71	6 tables: web_sales / catalog_sales / store_sales × item × time_dim × date_dim × promotion. By-hour sales rank across 3 channels.	medium
27	q74	6 tables: customer, store_sales, web_sales, date_dim. Year-over-year ratio. Multi-CTE (year1 / year2 per channel).	medium
28	q76	3 tables (one of: store/web/catalog_sales) × item × date_dim. NULL with missing-value statistics by channel. UNION ALL.	medium
29	q78	3 tables: store_sales/web_sales/catalog_sales paired with returns. Sales-vs-returns ratio per customer per year.	medium
30	q79	4 tables: store_sales, date_dim, store, household_demographics. Aggregates by date + customer (clean rendering for the day's grand-total).	easier
31	q80	8 tables: store_sales / web_sales / catalog_sales each + their returns + date_dim + item + promotion. Returns ratio UNION ALL across 3 channels with ROLLUP.	hard
32	q81	4 tables: store_returns, customer, customer_address, date_dim. Top-100 customers by return-amt with address fragments.	easier
33	q82	5 tables: item, inventory, date_dim, store_sales. Item dependency on store sales over date-range.	easier
34	q89	4 tables: item, store_sales, date_dim, store. Year-over-year monthly trend with AVG window.	medium
35	q91	5 tables: call_center, catalog_returns, date_dim, customer, customer_address, customer_demographics, household_demographics. Call-center returns by manager.	hard
36	q92	4 tables: web_sales, item, date_dim. SUM(ws_ext_discount_amt) > 1.3 × AVG correlated subquery.	medium
37	q93	3 tables: store_sales, store_returns, reason. UPDATE-style net-amount per customer with reason filter.	easier
38	q94	4 tables: web_sales, web_returns, date_dim, customer_address, web_site. State-filtered returns ratio.	medium
39	q95	5 tables: web_sales, web_returns, date_dim, customer_address, web_site. Adds late-shipped filter on top of q94's shape.	medium
40	q98	3 tables: store_sales, item, date_dim. ROLLUP over item-class with retail-price ratio.	easier

Counts: 11 easier, 20 medium, 9 hard.

Recommended execution order: ascending complexity within family, so the easiest fixtures land first, prove the toolchain, and the hard ones benefit from the seed-data patterns the easier ones establish.

Per-query recipe¶

Each fixture is a directory under tests/conformance/sql_corpus/standard_functions/tpcds_q<N>/ containing exactly three files:

setup.sql — minimal seeded data (CREATE OR REPLACE TABLE ${DATASET}.<name> + INSERT VALUES). The dataset is per-fixture and ephemeral; ${DATASET} is substituted by the recorder at record-time. Seed rows are designed so the query returns a small, non-empty, deterministic result — never an empty set (which would fail to prove the query did anything), never thousands of rows (which bloats expected.json). Target: 1-20 result rows.
query.sql — the TPC-DS reference query body, adapted to BigQuery's syntax. References tables via \`${DATASET}.<table>\``` backtick quoting. Template substitution variables in the TPC-DS spec (e.g.,[YEAR],[STATE]``) are hard-coded to specific values matching the seed data.
expected.json — the recorder writes this file with the real-BigQuery result, schema, job_id, total_bytes_processed, duration_class, and statement_type. Never hand-edit — the runner cross-checks against the recorded job_id.

The triple is round-tripped end-to-end by make test-conformance (offline, replay-only — no BQ creds needed; the baselines are committed) and re-recorded by make record-conformance (live, requires the GCP project + ADC credentials).

Adapting TPC-DS reference SQL to BigQuery¶

TPC-DS templates are ANSI SQL with a few Oracle-isms. Required adaptations:

TPC-DS reference	BigQuery
`"identifier"` (double-quoted)	`\identifier``` (backtick) — only for column / table identifiers that need quoting; otherwise unquoted.
`schema.table`	`\${DATASET}.table``` — single-segment dataset binding via the recorder.
`DATE '2000-01-01'`	`DATE "2000-01-01"` (double-quoted literal — BQ-preferred)
`INTERVAL '30' DAY`	`INTERVAL 30 DAY` (no quotes around the integer)
`DECIMAL'123.45'`	`NUMERIC "123.45"` (BQ's DECIMAL is named NUMERIC)
`COUNT_BIG(*)`, `DECODE()`	Avoid — use `COUNT(*)`, `CASE WHEN` instead.
`ROWNUM`, `CONNECT BY`	Out of scope; never appears in TPC-DS but worth flagging.

The 59 existing fixtures collectively exercise every adaptation pattern. When in doubt, grep the existing corpus for the same function / construct.

Seed-data sizing¶

The conformance recorder enforces a 1 GiB per-fixture byte-scan cap. Real TPC-DS data at SF=1 is several GiB per fact table — the emulator's corpus deliberately uses minimal seeded data (a few rows per dimension; a few rows per fact) so:

Each recording job scans < 1 MiB (well under the cap).
The total cost of recording all 40 fixtures is ~$0.001 (TPC-DS SF=1 scan cost would be ~$5).
The recorded result rows fit in a hand-readable expected.json (typically 5-30 rows).

Seed-data design rule: pick the smallest set of rows that exercises every WHERE-clause branch the query has. If the query has WHERE region IN ('US','EU','APAC'), seed 3 rows with those regions + 1 row with a different region (so the filter actually filters something).

Anti-patterns¶

Don't copy TPC-DS template literally; the template has substitution variables. Replace them with concrete values that match seed data.
Don't seed data that returns empty results — that proves nothing.
Don't seed thousands of rows — recorder + runner both pay the cost.
Don't hand-edit expected.json — the runner cross-checks the recorded job_id.
Don't use random / non-deterministic functions in the query body (RAND, GENERATE_UUID, SESSION_USER, CURRENT_*). These are documented as conformance-excluded by ADR 0022 §1.2.

Cost guardrails¶

Guardrail	Value	Where
Per-fixture byte-scan cap	1 GiB (configurable via the `--byte-cap` flag)	`scripts/record_conformance_fixtures.py` — the `DEFAULT_BYTE_CAP` constant + the `--byte-cap` CLI flag
Project billing	`$BQEMU_CONFORMANCE_PROJECT` (operator-supplied)	`Makefile` `record-conformance` target
Expected total scan for all 40 fixtures	< 40 MiB	seeded-data sizing rule
Expected total cost (US multi-region, on-demand)	< $0.01	$5 / TiB scanned

The cap is a safety net: any well-formed TPC-DS fixture with minimal seeded data scans single-digit kilobytes, multiple orders of magnitude below the cap.

Recording flow (operator)¶

# 1. Authenticate to GCP (one-time per terminal session).
gcloud auth application-default login

# 2. Set the project that owns the recording dataset.
export BQEMU_CONFORMANCE_PROJECT=<your-gcp-project>
export BQEMU_CONFORMANCE_LOCATION=US   # or EU; matches existing corpus

# 3. (Optional) Filter to a single fixture or substring.
python scripts/record_conformance_fixtures.py \
    --project "$BQEMU_CONFORMANCE_PROJECT" \
    --location "$BQEMU_CONFORMANCE_LOCATION" \
    --filter tpcds_q12

# Re-record everything (rare; almost always run --filter instead):
make record-conformance

# 4. Replay offline to confirm the recording matches the emulator.
make test-conformance -- -k tpcds_q12

# 5. Commit setup.sql + query.sql + expected.json together.

What's NOT changing¶

No new SQL rules are needed — every missing query uses features already supported by the 92-rule translator. If a recording reveals an unsupported feature, raise it in the chip's follow-up notes (the gap analysis lives in docs/reference/gap-analysis.md).
No conformance-corpus exclusion changes — tests/conformance/_surface_inventory.py's SESSION_USER / GENERATE_UUID exclusions (ADR 0022 §1.2) stand; the 40 new queries don't use either function.
No ADR — adding more fixtures of an existing shape isn't an architectural decision. The shape's design is in ADR 0022.

Open questions (resolve before starting bulk recording)¶

Should any of the 40 queries be excluded from the corpus? The existing 59 are unanimously inclusion-worthy; the 40 missing are similarly mainstream. Recommendation: include all 40.
Cost approval. The total recording cost is ~$0.01 against the supplied GCP project's billing account. No approval needed for that order of magnitude.
Re-record cadence. TPC-DS query bodies are stable (specifications change rarely). Re-recording is only needed when the emulator changes a translator rule whose output diverges from the recorded baseline; that's caught by the per-PR replay step. No periodic re-record needed.