⚙️ Expression DSL & AST¶
This document describes elevata’s vendor-neutral Expression DSL (Domain Specific Language) and its corresponding AST (Expression Abstract Syntax Tree).
It forms the foundation of the multi-dialect SQL engine and powers:
- surrogate-key hashing
- foreign-key lineage hashing
- CONCAT/COALESCE operations
- window functions (ROW_NUMBER, etc.)
- subqueries in the LogicalPlan
- deterministic SQL generation across dialects
🔧 1. Purpose of the DSL & AST¶
The architecture introduces:
- a safe, declarative Expression DSL stored in metadata
- a parser converting DSL → AST
- a vendor-neutral AST describing expressions
- dialect renderers (BigQuery, Databricks, DuckDB, Fabric Warehouse, MSSQL, Postgres, Snowflake) that emit actual SQL
This ensures:
- deterministic SQL generation
- cross-dialect reproducibility
- fully testable and composable expression logic
- consistent hashing on all platforms
🔧 2. DSL → AST → SQL Rendering Pipeline¶
DSL string → DSL Parser → Expression AST → Dialect Renderer → Final SQL
Example DSL:
HASH256(
CONCAT_WS('|',
CONCAT('productid', '~', COALESCE({expr:productid}, 'null_replaced')),
'pepper'
)
)
AST (conceptual):
Hash256(
ConcatWs('|', [
Concat([
Literal('productid'),
Literal('~'),
Coalesce(ColumnRef('productid'), Literal('null_replaced'))
]),
Literal('pepper')
])
)
Dialect renderings:
- BigQuery →
TO_HEX(SHA256(CONCAT_WS('|', ...))) - Databricks →
SHA2(CONCAT_WS('|', ...), 256) - DuckDB →
SHA256(CONCAT_WS('|', ...)) - Fabric Warehouse →
CONVERT(VARCHAR(64), HASHBYTES('SHA2_256', CAST(CONCAT_WS('|', ...) AS VARCHAR(4000))), 2) - MSSQL →
CONVERT(VARCHAR(64), HASHBYTES('SHA2_256', CONCAT_WS('|', ...)), 2) - Postgres →
ENCODE(DIGEST(CONCAT_WS('|', ...), 'sha256'), 'hex') - Snowflake →
LOWER(TO_HEX(SHA2(CONCAT_WS('|', ...), 256)))
🔧 3. DSL Syntax¶
🧩 3.1 Supported core functions¶
| DSL Function | Description |
|---|---|
HASH256(expr) |
Vendor-neutral SHA‑256 hash wrapper |
CONCAT(a,b,...) |
Null-propagating concatenation |
CONCAT_WS(sep,a,b,...) |
Null-safe concatenation with separator |
COALESCE(a,b) |
Standard SQL coalesce |
COL(name) |
Column reference |
{expr:column} |
Reference to upstream expression column |
The DSL is intentionally minimal and safe.
🧩 3.2 Identifiers¶
COL(bk1)andCOL("bk1")behave equivalently.- Dialects re-apply proper quoting.
🧩 3.3 Literals¶
String literals may be defined using '...' or "...".
🧩 3.4 Upstream Expression References¶
Syntax:
{expr:column_name}
This refers to an upstream expression already defined in the execution graph.
🔧 4. DSL Parser¶
Located in: metadata/rendering/dsl.py
Responsibilities:
- Normalize input
- Detect function calls
- Parse nested expressions
- Parse literals
- Split arguments respecting parentheses
- Convert to AST nodes
Specialized rules:
COL(name)→ColumnRef'literal'→Literal{expr:x}→ExprRef
🔧 5. Expression AST Nodes¶
All expression classes derive from a common base.
🧩 5.1 Primitive Nodes¶
🔎 Literal(value)¶
Represents a literal value in SQL.
🔎 ColumnRef(column_name, table_alias=None)¶
Represents a reference to a column.
🔎 ExprRef(name)¶
References an upstream-generated expression.
🧩 5.2 Function Expression Nodes¶
🔎 ConcatExpr(args)¶
Represents CONCAT(a,b,...).
🔎 ConcatWsExpr(separator, args)¶
Represents CONCAT_WS(sep, ...).
🔎 CoalesceExpr(a,b)¶
Represents COALESCE(a,b).
🔎 Hash256Expr(expr)¶
Vendor-neutral representation of SHA‑256 hashing.
Each dialect chooses its own SQL form.
🔧 6. Window Functions¶
Represented by:
WindowFunctionExpr(
function_name,
partition_by=[...],
order_by=[OrderItem(expr, direction)]
)
Example:
ROW_NUMBER() OVER (PARTITION BY src ORDER BY updated_at DESC)
Used primarily in multi-source Stage ranking mode.
🔧 7. Subqueries in the AST¶
Subqueries are modeled using:
SubquerySource(select, alias)
Example pattern:
SELECT *
FROM (
SELECT *, ROW_NUMBER() OVER (...) AS rn
FROM union_all
) AS ranked
WHERE rn = 1
The dialect handles parenthesis placement and alias rendering.
🔧 8. Dialect Rendering Responsibilities¶
Each SQL dialect must render:
- literals
- identifiers
- CONCAT / CONCAT_WS
- COALESCE
- HASH256
- window functions
- subqueries
Consistency across dialects is ensured because all begin from the same AST.
Examples:
- BigQuery:
TO_HEX(SHA256(...)) - Databricks:
SHA2(..., 256) - DuckDB:
SHA256(...) - Fabric Warehouse:
CONVERT(VARCHAR(64), HASHBYTES('SHA2_256', ...), 2) - MSSQL:
HASHBYTES('SHA2_256', ...) - Postgres:
ENCODE(DIGEST(...), 'hex') - Snowflake:
LOWER(TO_HEX(SHA2(..., 256)))
🔧 9. Use in Surrogate & Foreign Keys¶
The SK/FK hashing pipeline uses the DSL and AST exclusively.
Guarantees:
- deterministic key generation
- lexicographically ordered BK parts
- proper literal separators:
'~'(pair) and'|'(between pairs) - null-protection using
COALESCE - dialect-consistent hashing algorithms
🔧 10. Benefits of the DSL & AST¶
- deterministic and reproducible SQL
- clean abstraction from vendor SQL
- first-class testability
- enables multi-dialect rendering
- no SQL in metadata
- simple addition of new dialects
- supports window functions and subqueries
© 2025-2026 elevata - Technical Documentation