⚙️ SQL Dialect System¶

The elevata dialect system is the execution abstraction layer that makes the platform
vendor-neutral, deterministic, and production-ready across modern data platforms.

A dialect is responsible for translating a logical, metadata-driven query plan into
concrete, executable SQL for a specific target engine — while preserving:

semantic correctness
deterministic behavior
schema and contract guarantees

This means:

The same logical dataset definition produces the same semantic result —
regardless of whether it runs on BigQuery, Databricks, DuckDB, Fabric Warehouse, MSSQL, Postgres
or Snowflake.

🔧 1. Role of the Dialect System¶

The Dialect System abstracts all SQL-vendor differences.

High-level pipeline:

LogicalPlan  →  Expression AST  →  SqlDialect renderer  →  final SQL string

Each dialect is responsible for:

identifier quoting
literal rendering
expression rendering (DSL AST)
function names and argument patterns
window functions
subqueries and UNIONs

The LogicalPlan and Expression AST are vendor-neutral. Only SqlDialect knows how to turn them into concrete SQL.

🔧 2. `SqlDialect` Base Class¶

All dialects extend a shared base class, conceptually:

class SqlDialect:
    DIALECT_NAME: str

    def render_expr(self, expr: Expr) -> str:
        ...

    def render_select(self, select: LogicalSelect) -> str:
        ...

    def cast_expression(self, expr: Expr, target_type: str) -> str:
        ...

Every dialect must implement at least:

render_expr(expr) – render Expression AST nodes
render_select(select) – render LogicalSelect / LogicalUnion / SubquerySource
cast_expression(expr, target_type) – render dialect-specific CAST/CONVERT syntax

Helper methods inside dialects (e.g. quote_ident, _render_literal, _render_function_call) are built on top of these primitives.

🔧 3. Dialect Registry & Factory¶

Dialects are registered via a central registry defined in dialect_factory.py.

Each dialect module defines a class with a unique DIALECT_NAME, for example:

class PostgresDialect(SqlDialect):
    DIALECT_NAME = "postgres"

When the package is imported, dialect classes are discovered and registered into an internal dictionary:

REGISTRY = {
    "bigquery": BigQueryDialect,
    "databricks": DatabricksDialect,
    "duckdb": DuckDbDialect,
    "fabric_warehouse": FabricWarehouseDialect,
    "mssql": MssqlDialect,
    "postgres": PostgresDialect,
    "snowflake": SnowflakeDialect,
}

Public factory functions:

get_dialect(name: str) -> type[SqlDialect]
get_active_dialect(optional_override: str | None = None) -> SqlDialect
available_dialects() -> dict[str, type[SqlDialect]]

This allows:

explicit dialect selection by name
defaulting from environment or profile
future dynamic extension without touching core code

🔧 4. Runtime Dialect Selection¶

Dialect selection follows a layered strategy:

Explicit override (e.g. from UI query param):
?dialect=postgres → PostgresDialect
Profile / configuration (if set)
Environment variable:
ELEVATA_SQL_DIALECT=duckdb
Fallback: a default dialect (usually DuckDB)

In the SQL Preview UI, the user selects a dialect from a dropdown.
The selection is passed as ?dialect=... to the preview endpoint, which then calls
get_active_dialect() with that override.

🔧 5. Identifier Rendering¶

All dialects receive unquoted identifiers from the LogicalPlan / AST and apply quoting rules themselves.

DuckDB: "identifier"
Postgres: "identifier"
MSSQL: "identifier"
...

This avoids confusion around vendor-specific quoting and supports consistent behaviour across dialects.

Rules:
- identifiers that contain uppercase letters, spaces, or reserved words are always quoted
- schema and table are rendered as "schema"."table" AS "alias"

🔧 6. Literal Rendering¶

Literals are rendered via dialect helpers.

General rules:

strings → 'value' with proper escaping
integers / decimals → as-is
booleans → TRUE / FALSE (or 1/0)
None → NULL

There is no dialect-specific SQL stored in metadata.

🔧 7. Expression Rendering (DSL AST)¶

All expressions in the LogicalPlan are built from the Expression DSL & AST.

Key node types include:

Literal
ColumnRef
ExprRef
ConcatExpr
ConcatWsExpr
CoalesceExpr
Hash256Expr
WindowFunctionExpr

The dialect implements render_expr(expr) by pattern-matching on node types.

🔧 8. Hashing (HASH256)¶

HASH256(<expr>)

All dialects must return a hex-encoded, 64-character, lowercase string.

🧩 BigQuery¶

TO_HEX(SHA256(<expr>))

🧩 Databricks¶

sha2(<expr>, 256)

🧩 DuckDB¶

sha256(<expr>)

🧩 Fabric Warehouse¶

CONVERT(VARCHAR(64), HASHBYTES('sha2_256', CAST(<expr> AS VARCHAR(4000))), 2)

🧩 MSSQL¶

CONVERT(VARCHAR(64), HASHBYTES('sha2_256', <expr>), 2)

🧩 Postgres¶

encode(digest(<expr>, 'sha256'), 'hex')

🧩 Snowflake¶

LOWER(TO_HEX(SHA2(<expr>, 256)))

Feature	BigQuery	Databricks	DuckDB	Fabric Warehouse	MSSQL	Postgres	Snowflake
Identifier quoting	`...`	`...`	"..."	"..."	"..."	"..."	"..."
HASH256 implementation	SHA256+TO_HEX	SHA2	SHA256	HASHBYTES+CONVERT	HASHBYTES+CONVERT	DIGEST+ENCODE	SHA2+TO_HEX
CONCAT / CONCAT_WS	✓	✓	✓	✓	✓	✓	✓
COALESCE	✓	✓	✓	✓	✓	✓	✓
Window functions	✓	✓	✓	✓	✓	✓	✓
Subqueries in FROM	✓	✓	✓	✓	✓	✓	✓
UNION / UNION ALL	✓	✓	✓	✓	✓	✓	✓

🔧 9. Window Functions¶

Window functions such as ROW_NUMBER() are represented at the AST level by WindowFunctionExpr.

Example:

ROW_NUMBER() OVER (
  PARTITION BY bk
  ORDER BY updated_at DESC
)

Rendered in all dialects using ANSI SQL syntax. Dialects only differ in quoting of identifiers.

These are heavily used in multi-source Stage non-identity mode to compute __src_rank_ord.

🔧 10. Subqueries & UNION Rendering¶

🧩 Subqueries¶

Represented by SubquerySource(select, alias) in the LogicalPlan.

Rendered as:

(
  SELECT ...
) AS alias

Dialects can apply their own line-breaking / formatting rules, but the structure is identical.

🧩 UNION / UNION ALL¶

Represented by LogicalUnion(selects=[...], union_type='ALL'|'DISTINCT').

Rendered as:

SELECT ...
UNION ALL
SELECT ...
UNION ALL
SELECT ...

Again, only identifier/literal rendering and optional formatting differ by dialect.

🔧 12. Adding a New Dialect¶

To add a new dialect:

Create a module under metadata/rendering/dialects/, e.g. snowflake.py.
Implement a class:

class SnowflakeDialect(SqlDialect):
    DIALECT_NAME = "snowflake"

    def render_expr(self, expr: Expr) -> str:
        ...

    def render_select(self, select: LogicalSelect) -> str:
        ...

    def cast_expression(self, expr: Expr, target_type: str) -> str:
        ...

Ensure the module is imported (so the class registers itself).
Add tests in tests/test_dialect_snowflake.py.
Optionally expose it in the SQL preview dropdown.

No changes to metadata are required — all dialect logic is encapsulated.

🔧 13. Identifier vs. Table Identifier Rendering¶

Dialects implement two distinct methods for rendering identifiers:

Method	Responsibility	Example Output
`render_identifier(name: str) -> str`	Renders a single identifier only (e.g., column name)	`"customer_id"`
`render_table_identifier(schema: str \| None, name: str) -> str`	Renders a schema-qualified table reference	`"rawcore"."rc_customer"`

🧩 Table existence semantics¶

Dialect introspection must reliably distinguish between:

table does not exist
table exists but has no columns (rare engine-specific case)

The planner assumes that table_exists=False means that the table must be provisioned.
Dialect implementations are responsible for handling engine-specific metadata quirks.

🧩 Usage in SQL Generation¶

col_sql = dialect.render_identifier(col_name)
tbl_sql = dialect.render_table_identifier(schema_name, table_name)

This separation ensures that: - table-level quoting rules do not affect column expressions - engines without schemas can pass None for schema - identifiers remain valid for cross-schema and cross-dialect SQL

🧩 Dialect Support Summary¶

Feature	BigQuery	Databricks	DuckDB	Fabric Warehouse	MSSQL	Postgres	Snowflake
Identifier quoting	`...`	`...`	"..."	"..."	"..."	"..."	"..."
Schema-qualified tables via `render_table_identifier`	✓	✓	✓	✓	✓	✓	✓

🔧 14. Testing Strategy¶

The Dialect System is validated via:

unit tests for each dialect
cross-dialect hashing tests (SK/FK equality)
SQL preview tests
LogicalPlan → SQL snapshot tests

Focus areas:

identifier quoting correctness
literal escaping
HASH256 implementation parity
CONCAT / CONCAT_WS behavior
window function correctness
subquery and UNION formatting

If all dialect tests pass, the multi-dialect engine behaves consistently.

🔧 15. Summary¶

The Dialect System is the backbone of elevata’s multi-backend strategy:

SQL is generated from a vendor-neutral LogicalPlan + AST
dialects implement only the final rendering
SK/FK hashing is cross-dialect identical
adding new engines is straightforward

This architecture ensures that elevata can support more backends (Snowflake, BigQuery, Databricks, …) without changing core metadata or generation logic.

🔧 16. Dialect diagnostics & health check¶

The Dialect System exposes a lightweight diagnostics layer to verify that all registered SQL dialects behave as expected.

Diagnostics can be accessed in two ways:

Programmatic API
CLI command via manage.py elevata_dialect_check

🧩 16.1 Programmatic diagnostics¶

The module metadata.rendering.dialects.diagnostics provides convenience functions to inspect all dialects at once:

collect_dialect_diagnostics(dialect)
snapshot_all_dialects()

Each snapshot contains:

name / class_name
supports_merge
supports_delete_detection
supports_hash_expression
example literal renderings (TRUE/FALSE/NULL/date)
example expressions for CONCAT and HASH256

This is useful for debugging and for asserting capabilities in tests, without having to introspect each dialect manually.

🧩 16.2 CLI: `elevata_dialect_check`¶

For a quick end-to-end smoke test of all registered SQL dialects, use:

python manage.py elevata_dialect_check

This command:

discovers all registered dialects (DuckDB, Postgres, MSSQL, …),
prints basic capabilities:
- supports_merge
- supports_delete_detection
runs a set of small self-checks per dialect:
- identifier quoting (quote_ident)
- literal rendering for strings, numbers, booleans, dates, datetimes, decimals
- concat_expression(...)
- hash_expression(...)
- (optionally) render_insert_into_table(...)
- (optionally) render_merge_statement(...)
- (optionally) render_delete_detection_statement(...)
reports each check as:
- OK → check succeeded
- N/I → NotImplementedError (feature not implemented yet)
- FAIL → any other exception

This is intentionally non-invasive: it only renders SQL; it does not execute it
against a live database. The command is meant as a quick guardrail during
development and CI to detect regressions in dialect implementations early.

🔧 17. Execution engines¶

Each SQL dialect can optionally provide an execution engine that knows how to
run SQL statements against a concrete target system.

The base interface lives in rendering/dialects/base.py:

BaseExecutionEngine.execute(sql: str) -> int | None
SqlDialect.get_execution_engine(system) -> BaseExecutionEngine

Concrete dialects (e.g. DuckDBDialect) implement their own execution engine in the same module:

DuckDBDialect.get_execution_engine(system) returns a DuckDbExecutionEngine
DuckDbExecutionEngine implements execute(sql: str)

🧩 Execution Engine vs SQLAlchemy Engine¶

In elevata, SQL execution and metadata introspection are intentionally separated concerns.

Two different connection mechanisms may exist for the same target system:

Component	Purpose
Execution Engine	Executes rendered SQL statements (`--execute`)
SQLAlchemy Engine	Used for introspection, materialization planning, and schema inspection

In many databases (e.g. Postgres, DuckDB, Snowflake) both roles may use the same underlying connection mechanism.

However, analytical platforms such as BigQuery or Databricks may require different technical connectors:

Execution via native API or warehouse connector
Introspection via SQLAlchemy-compatible drivers

This separation is intentional and allows elevata to support execution environments
that do not expose a full SQLAlchemy-compatible runtime.

🔧 18. Dialect Parity Checklist¶

Purpose This checklist defines the mandatory contract every officially supported dialect must fulfill
in order to support --execute, auto‑provisioning, and observability consistently across BigQuery, Databricks, DuckDB, Fabric Warehouse, MSSQL, PostgreSQL and Snowflake.

The goal is behavioral parity, not identical SQL syntax.

🧩 18.1 Core Rendering Contract (SQL Preview & Generation)¶

Every dialect must implement the core SQL rendering primitives so that SQL preview and generated SQL behave consistently.

Required methods:
- render_identifier(name: str) -> str
- render_table_identifier(schema: str | None, name: str) -> str
- render_literal(value: Any) -> str
- render_expr(expr: Expr) -> str
- render_select(select: LogicalSelect | LogicalUnion | ...) -> str
- cast_expression(expr: Expr, target_type: str) -> str

Required parity guarantees (v0.6.x):
- Deterministic HASH256 rendering (hex‑encoded, 64 characters)
- Consistent CONCAT / CONCAT_WS semantics
- Window functions parity (ROW_NUMBER, partitioning, ordering)
- Nested subqueries and UNION rendering parity

🧩 18.2 Execution Contract (`--execute`)¶

Every dialect that is considered supported must provide a working execution engine.

Required methods: - get_execution_engine(system) -> BaseExecutionEngine
- BaseExecutionEngine.execute(sql: str) -> int | None

Execution expectations:
- Executes multi‑statement SQL safely
- Uses the resolved target connection (system.security["connection_string"])
- Raises clear, actionable exceptions on connection or SQL errors
- Returns affected row counts where the backend supports it (optional)

Dialects without a working execution engine are considered render‑only and must not be advertised as fully supported targets.

🧩 18.3 Auto‑Provisioning Contract (Warehouse‑Side DDL)¶

All supported dialects must support idempotent warehouse provisioning.

Required methods:
- render_create_schema_if_not_exists(schema: str) -> str
- render_create_table_if_not_exists(td: TargetDataset) -> str
- render_create_table_if_not_exists_from_columns(schema: str, table: str, columns: list[dict[str, object]]) -> str
- render_add_column(schema: str, table: str, column: str, column_type: str | None) -> str

Rules:
- DDL must be safe to execute multiple times
- No destructive operations (no DROP)
- Target table DDL must be derived from TargetColumn metadata

🧩 18.4 Observability & Run Logging Contract¶

Every supported dialect must support warehouse‑level execution logging.

Required behavior:
- Logging is emitted into meta.load_run_log with a canonical schema (registry-driven).
- The INSERT logic is centralized (dialects must not each hardcode their own log INSERT).

🧩 18.5 Incremental & Historization Contract¶

Dialects that support incremental pipelines (Rawcore / History) must be able to execute:

New-row inserts
Changed-row handling (close previous version + insert new version)
Delete detection SQL
Delete marking in history (version_state = 'deleted')

Dialect implementations must render the required SQL primitives consistently; architectural details are specified in:
- Load SQL Architecture
- Historization Architecture

🧩 18.6 Diagnostics & Parity Validation¶

Every supported dialect must pass a minimal diagnostic suite:

Render-only smoke tests
Identifier quoting tests
Literal escaping tests
HASH256 output consistency tests
Minimal merge / historization render tests (where applicable)

Execution parity is validated by:
- Running --execute on all supported targets
- Verifying schema provisioning, table provisioning, and run logging

Summary
A dialect is only considered fully supported in elevata when it satisfies all sections 18.1 – 18.6.
Partial implementations must be clearly marked as render-only or experimental.

🔧 19. Merge & Historization Contract¶

The Dialect System is responsible not only for expression and SELECT rendering,
but also for deterministic incremental data movement and historization semantics.

The architectural principle is:

load_sql provides semantic ingredients.
The dialect owns the SQL shape.

This guarantees clean separation of concerns and cross-dialect behavioral parity.

🧩 19.1 Merge Contract (Incremental RAWCORE Loads)¶

Incremental RAWCORE datasets require deterministic merge semantics.

The core layer prepares semantic ingredients:

source_select_sql
key_columns
update_columns
insert_columns

The dialect must implement:

render_merge_statement(...)

Dialect responsibilities:

Use native MERGE where supported
Provide a performant fallback (UPDATE + INSERT) when MERGE is unavailable
Render deterministic key predicates
Preserve type correctness
Apply identifier quoting consistently

All officially supported dialects must support merge semantics.

🧩 19.2 Delete Detection Contract¶

For incremental pipelines with handle_deletes=True, dialects must implement:

render_delete_detection_statement(...)

The core provides:

target schema + table
stage schema + table
join predicates
scope filter

The dialect renders the DELETE (or equivalent) statement.

Dialects may opt out by setting:

supports_delete_detection = False

In that case, the core raises a clear NotImplementedError.

🧩 19.3 Historization Contract (SCD Type 2)¶

Historization pipelines for *_hist datasets are fully dialect-driven.

The core prepares semantic ingredients and calls:

render_hist_incremental_statement(...)

The default orchestration renders:

Changed version UPDATE
Deleted version UPDATE
Changed version INSERT
New version INSERT

Dialects may override orchestration if required by the backend.

Strict guarantees:

Exactly one open version per business key
Deterministic row_hash comparison
No overlapping validity intervals
Append-only behavior

Historization SQL must be fully executable and not a placeholder.

🧩 19.4 Strict Mode Guarantees¶

The core validates:

Required SCD columns exist
Insert column alignment is 1:1
Required surrogate keys are present

Dialects can assume validated semantic input and focus solely on SQL rendering.

🧩 19.5 Architectural Symmetry¶

Merge and Historization follow the same layered responsibility model:

Layer	Responsibility
Logical layer	Semantic intent
`load_sql`	Semantic ingredient preparation
Dialect	SQL shape rendering
Execution engine	Execution only

This symmetry ensures:

Predictable cross-dialect behavior
Clean extensibility
Stable internal architecture
No SQL-shape logic in the core layer

🔧 20. Reserved Keyword Registry¶

Each dialect provides a deterministic reserved keyword registry located in
rendering/dialects/keywords/<dialect>.py.

Keyword extraction follows one of two strategies:

Engine-truth (preferred when available)
Documentation-based parsing with strict sanity validation

All SQL required for engine-based extraction must be implemented inside the dialect class
via render_reserved_keywords_query().

Management commands must not contain vendor-specific SQL.