Specialized types architecture¶
The emulator supports three BigQuery types — GEOGRAPHY, RANGE<T>, and
INTERVAL — none of which map 1:1 with a DuckDB type. Each is backed
by a modelled DuckDB representation plus a thin translation layer.
Module map¶
| Concern | Module |
|---|---|
| Function-name and shape table | types/geography.py |
| WKB → WKT helper | types/geography.py (wkb_to_wkt) |
| RANGE element-type validation | types/range_type.py |
| Interval literal parser | types/interval.py (parse_interval_literal) |
| JUSTIFY expression builder | types/interval.py (justify_*_expr) |
| BQ canonical interval emit | types/interval.py (format_bq_interval) |
| Type mapping | storage/type_map.py |
| REST schema round-trip | api/routes/tables.py, api/routes/tabledata.py |
| Arrow ↔ REST wire | storage/arrow_bridge.py |
| Pre-translator rewrites | sql/rewriter/specialized_types.py |
| Post-translator SQL rules | sql/rules/{spatial,range_rules,interval_rules}.py |
Backing representations¶
GEOGRAPHY → DuckDB GEOMETRY¶
DuckDB's spatial extension is required at engine startup
(storage/engine.py::_load_spatial). The emulator fails fast with a
clear InternalError if INSTALL spatial or LOAD spatial fails —
GEOGRAPHY queries cannot be emulated without it.
Storage on disk: WKB (DuckDB's native GEOMETRY representation).
Wire format on the way out: WKT, produced by a lazy in-process
DuckDB conversion connection in types/geography.py::wkb_to_wkt.
The conversion connection avoids a heavyweight shapely / geoarrow
dependency.
Trade-off: DuckDB's GEOMETRY is planar (Cartesian) while BigQuery's
GEOGRAPHY is spheroidal. Distance, area, and perimeter values
diverge at continental scales. Documented in
ADR 0019.
RANGE → STRUCT¶
DuckDB has no native RANGE. The emulator models it as a two-field
struct. The field names start and end mirror the BigQuery
projection names — r.start, r.end. In DDL the field names are
double-quoted because end is a SQL keyword.
The constructor RANGE(a, b) is rewritten before SQLGlot's
transpile pass to a BigQuery STRUCT literal
(STRUCT(a AS \start`, b AS `end`)). This prevents SQLGlot's
DuckDB-side parser from folding the two-argument call into aGenerateSeriesnode — the same shape it uses forGENERATE_ARRAY`,
which would lose the distinction between the two.
The RANGE_* family (CONTAINS, OVERLAPS, INTERSECT,
GENERATE_RANGE_ARRAY) lives in the post-translator rule pass; each
rule expands the BigQuery call into a DuckDB expression that accesses
the struct's start and end fields.
RANGE_SESSIONIZE is a TVF. A pre-translator at
src/bqemulator/sql/rewriter/range_sessionize.py rewrites the
RANGE_SESSIONIZE(TABLE <ref>, '<range_col>', [<part_cols>] [,
'<sessionize_option>']) call into a windowed gaps-and-islands
subquery. The rewrite operates on the raw SQL text because
SQLGlot's BigQuery parser rejects the TABLE <ref> keyword in TVF
arguments. Mode dispatch matches the documented BigQuery
semantics: MEETS (default) and the OVERLAPS_OR_MEETS alias use
strict > for the new-session predicate so touching ranges share
a session; OVERLAPS uses >= so touching ranges form separate
sessions. The expansion emits RANGE(MIN(...) OVER...,
MAX(...) OVER...) which the existing _rewrite_range_constructor
pass picks up and converts to the canonical STRUCT shape; the
resulting STRUCT("start" T, "end" T) column lands as RANGE<T>
on the REST wire via detect_range_element. A second pass
_rewrite_range_data_types rewrites RANGE<T> column-type
references so DDL like CREATE TABLE t (col RANGE<DATE>)
survives the DuckDB parser.
INTERVAL → DuckDB INTERVAL¶
DuckDB's INTERVAL is a 3-tuple of months / days / microseconds.
Most BigQuery interval forms map directly: INTERVAL 1 DAY,
MAKE_INTERVAL(...), and DATE / TIMESTAMP arithmetic all parse
correctly under DuckDB's grammar.
The exception is BigQuery's compound literal
INTERVAL '1-2 3 4:5:6.789' YEAR TO SECOND, which DuckDB's parser
refuses. The pre-translator rewriter parses the string in Python
(types/interval.parse_interval_literal) and emits the equivalent
sum of single-unit intervals.
JUSTIFY_HOURS / JUSTIFY_DAYS / JUSTIFY_INTERVAL are absent from
DuckDB and synthesised from extract + to_<unit> primitives at
translate time.
Translation pipeline¶
BigQuery SQL
│
▼
[pre-translator] sql/rewriter/range_sessionize.py (scope-#15)
│ - RANGE_SESSIONIZE(TABLE …, …) → windowed gaps-and-islands
│ subquery emitting RANGE(MIN OVER …, MAX OVER …)
│ (operates on raw SQL text — SQLGlot rejects TABLE <ref>)
▼
[pre-translator] sql/rewriter/specialized_types.py
│ - INTERVAL '…-… … …:…:…' YEAR TO SECOND → sum of singles
│ - RANGE(a, b) (Anonymous in BQ AST) → STRUCT(a AS start, b AS end)
│ - RANGE<T> '[s, e)' typed literal → STRUCT(CAST AS T)
│ - RANGE<T> column-type / non-literal-CAST → STRUCT<start T, end T>
▼
[SQLGlot transpile] read=bigquery, write=duckdb
│
▼
[post-translator rule pass] sql/translator.py::_apply_rules
│ Walks AST in post-order (reversed pre-order) so children get
│ rewritten before parents.
│ Rules:
│ - ST_*_RENAME (BQ ST_* → DuckDB ST_* + alias mapping)
│ - ST_GEOGFROMWKB (wraps arg in hex())
│ - ST_ISCOLLECTION (→ ST_GeometryType(…) IN (…))
│ - RANGE_CONTAINS / RANGE_OVERLAPS / RANGE_INTERSECT
│ - GENERATE_RANGE_ARRAY
│ - JUSTIFY_HOURS / JUSTIFY_DAYS / JUSTIFY_INTERVAL
▼
DuckDB SQL
REST wire format¶
tables.insert / tables.get accept and emit:
{"type": "GEOGRAPHY"}— no sub-fields.{"type": "INTERVAL"}— no sub-fields.{"type": "RANGE", "rangeElementType": {"type": "DATE"}}— element type isDATE/DATETIME/TIMESTAMP. Required when the field type isRANGE.
tabledata.insertAll accepts BigQuery-style JSON values:
- GEOGRAPHY: WKT string (e.g.
"POINT(1 2)"). The emulator converts WKT → hex-WKB inbq_rows_to_arrowand emits anINSERT INTO... SELECT ST_GeomFromHEXWKB(col_hex), … FROM <register>for the storage write. - INTERVAL: BigQuery-canonical string (e.g.
"1-2 3 4:5:6.789"). - RANGE:
{"start": "...", "end": "..."}JSON object.
getQueryResults / tabledata.list emit:
- GEOGRAPHY column → WKT string in the row.
- INTERVAL column → BigQuery-canonical string (
Y-M D H:M:S[.ffffff]). - RANGE column → standard STRUCT nested-row shape.