Skip to content

⚙️ Expression DSL & AST

This document describes elevata’s vendor-neutral Expression DSL (Domain Specific Language) and its corresponding AST (Expression Abstract Syntax Tree).

It forms the foundation of the multi-dialect SQL engine and powers:
- surrogate-key hashing
- foreign-key lineage hashing
- CONCAT/COALESCE operations
- window functions (ROW_NUMBER, etc.)
- subqueries in the LogicalPlan
- deterministic SQL generation across dialects


🔧 1. Purpose of the DSL & AST

The architecture introduces:
- a safe, declarative Expression DSL stored in metadata
- a parser converting DSL → AST
- a vendor-neutral AST describing expressions
- dialect renderers (BigQuery, Databricks, DuckDB, Fabric Warehouse, MSSQL, Postgres, Snowflake) that emit actual SQL

This ensures:
- deterministic SQL generation
- cross-dialect reproducibility
- fully testable and composable expression logic
- consistent hashing on all platforms


🔧 2. DSL → AST → SQL Rendering Pipeline

DSL string  →  DSL Parser  →  Expression AST  →  Dialect Renderer  →  Final SQL

Example DSL:

HASH256(
  CONCAT_WS('|',
    CONCAT('productid', '~', COALESCE({expr:productid}, 'null_replaced')),
    'pepper'
  )
)

AST (conceptual):

Hash256(
  ConcatWs('|', [
    Concat([
      Literal('productid'),
      Literal('~'),
      Coalesce(ColumnRef('productid'), Literal('null_replaced'))
    ]),
    Literal('pepper')
  ])
)

Dialect renderings:
- BigQueryTO_HEX(SHA256(CONCAT_WS('|', ...)))
- DatabricksSHA2(CONCAT_WS('|', ...), 256)
- DuckDBSHA256(CONCAT_WS('|', ...))
- Fabric WarehouseCONVERT(VARCHAR(64), HASHBYTES('SHA2_256', CAST(CONCAT_WS('|', ...) AS VARCHAR(4000))), 2)
- MSSQLCONVERT(VARCHAR(64), HASHBYTES('SHA2_256', CONCAT_WS('|', ...)), 2)
- PostgresENCODE(DIGEST(CONCAT_WS('|', ...), 'sha256'), 'hex')
- SnowflakeLOWER(TO_HEX(SHA2(CONCAT_WS('|', ...), 256)))


🔧 3. DSL Syntax

🧩 3.1 Supported core functions

DSL Function Description
HASH256(expr) Vendor-neutral SHA‑256 hash wrapper
CONCAT(a,b,...) Null-propagating concatenation
CONCAT_WS(sep,a,b,...) Null-safe concatenation with separator
COALESCE(a,b) Standard SQL coalesce
COL(name) Column reference
{expr:column} Reference to upstream expression column

The DSL is intentionally minimal and safe.

🧩 3.2 Identifiers

  • COL(bk1) and COL("bk1") behave equivalently.
  • Dialects re-apply proper quoting.

🧩 3.3 Literals

String literals may be defined using '...' or "...".

🧩 3.4 Upstream Expression References

Syntax:

{expr:column_name}

This refers to an upstream expression already defined in the execution graph.


🔧 4. DSL Parser

Located in: metadata/rendering/dsl.py

Responsibilities:
1. Normalize input
2. Detect function calls
3. Parse nested expressions
4. Parse literals
5. Split arguments respecting parentheses
6. Convert to AST nodes

Specialized rules:
- COL(name)ColumnRef
- 'literal'Literal
- {expr:x}ExprRef


🔧 5. Expression AST Nodes

All expression classes derive from a common base.

🧩 5.1 Primitive Nodes

🔎 Literal(value)

Represents a literal value in SQL.

🔎 ColumnRef(column_name, table_alias=None)

Represents a reference to a column.

🔎 ExprRef(name)

References an upstream-generated expression.


🧩 5.2 Function Expression Nodes

🔎 ConcatExpr(args)

Represents CONCAT(a,b,...).

🔎 ConcatWsExpr(separator, args)

Represents CONCAT_WS(sep, ...).

🔎 CoalesceExpr(a,b)

Represents COALESCE(a,b).

🔎 Hash256Expr(expr)

Vendor-neutral representation of SHA‑256 hashing.

Each dialect chooses its own SQL form.


🔧 6. Window Functions

Represented by:

WindowFunctionExpr(
  function_name,
  partition_by=[...],
  order_by=[OrderItem(expr, direction)]
)

Example:

ROW_NUMBER() OVER (PARTITION BY src ORDER BY updated_at DESC)

Used primarily in multi-source Stage ranking mode.


🔧 7. Subqueries in the AST

Subqueries are modeled using:

SubquerySource(select, alias)

Example pattern:

SELECT *
FROM (
  SELECT *, ROW_NUMBER() OVER (...) AS rn
  FROM union_all
) AS ranked
WHERE rn = 1

The dialect handles parenthesis placement and alias rendering.


🔧 8. Dialect Rendering Responsibilities

Each SQL dialect must render:
- literals
- identifiers
- CONCAT / CONCAT_WS
- COALESCE
- HASH256
- window functions
- subqueries

Consistency across dialects is ensured because all begin from the same AST.

Examples:
- BigQuery: TO_HEX(SHA256(...))
- Databricks: SHA2(..., 256)
- DuckDB: SHA256(...)
- Fabric Warehouse: CONVERT(VARCHAR(64), HASHBYTES('SHA2_256', ...), 2)
- MSSQL: HASHBYTES('SHA2_256', ...)
- Postgres: ENCODE(DIGEST(...), 'hex')
- Snowflake: LOWER(TO_HEX(SHA2(..., 256)))


🔧 9. Use in Surrogate & Foreign Keys

The SK/FK hashing pipeline uses the DSL and AST exclusively.

Guarantees:
- deterministic key generation
- lexicographically ordered BK parts
- proper literal separators: '~' (pair) and '|' (between pairs)
- null-protection using COALESCE
- dialect-consistent hashing algorithms


🔧 10. Benefits of the DSL & AST

  • deterministic and reproducible SQL
  • clean abstraction from vendor SQL
  • first-class testability
  • enables multi-dialect rendering
  • no SQL in metadata
  • simple addition of new dialects
  • supports window functions and subqueries

© 2025-2026 elevata Labs — Internal Technical Documentation