Skip to content

⚙️ SQL Dialect System

The elevata dialect system is the execution abstraction layer that makes the platform
vendor-neutral, deterministic, and production-ready across modern data platforms.

A dialect is responsible for translating a logical, metadata-driven query plan into
concrete, executable SQL for a specific target engine — while preserving:

  • semantic correctness
  • deterministic behavior
  • schema and contract guarantees

This means:

The same logical dataset definition produces the same semantic result —
regardless of whether it runs on BigQuery, Databricks, DuckDB, Fabric Warehouse, MSSQL, Postgres
or Snowflake.


🔧 1. Role of the Dialect System

The Dialect System abstracts all SQL-vendor differences.

High-level pipeline:

LogicalPlan  →  Expression AST  →  SqlDialect renderer  →  final SQL string

Each dialect is responsible for:

  • identifier quoting
  • literal rendering
  • expression rendering (DSL AST)
  • function names and argument patterns
  • window functions
  • subqueries and UNIONs

The LogicalPlan and Expression AST are vendor-neutral. Only SqlDialect knows how to turn them into concrete SQL.


🔧 2. SqlDialect Base Class

All dialects extend a shared base class, conceptually:

class SqlDialect:
    DIALECT_NAME: str

    def render_expr(self, expr: Expr) -> str:
        ...

    def render_select(self, select: LogicalSelect) -> str:
        ...

    def cast_expression(self, expr: Expr, target_type: str) -> str:
        ...

Every dialect must implement at least:

  • render_expr(expr) – render Expression AST nodes
  • render_select(select) – render LogicalSelect / LogicalUnion / SubquerySource
  • cast_expression(expr, target_type) – render dialect-specific CAST/CONVERT syntax

Helper methods inside dialects (e.g. quote_ident, _render_literal, _render_function_call) are built on top of these primitives.


🔧 3. Dialect Registry & Factory

Dialects are registered via a central registry defined in dialect_factory.py.

Each dialect module defines a class with a unique DIALECT_NAME, for example:

class PostgresDialect(SqlDialect):
    DIALECT_NAME = "postgres"

When the package is imported, dialect classes are discovered and registered into an internal dictionary:

REGISTRY = {
    "bigquery": BigQueryDialect,
    "databricks": DatabricksDialect,
    "duckdb": DuckDbDialect,
    "fabric_warehouse": FabricWarehouseDialect,
    "mssql": MssqlDialect,
    "postgres": PostgresDialect,
    "snowflake": SnowflakeDialect,
}

Public factory functions:

  • get_dialect(name: str) -> type[SqlDialect]
  • get_active_dialect(optional_override: str | None = None) -> SqlDialect
  • available_dialects() -> dict[str, type[SqlDialect]]

This allows:

  • explicit dialect selection by name
  • defaulting from environment or profile
  • future dynamic extension without touching core code

🔧 4. Runtime Dialect Selection

Dialect selection follows a layered strategy:

  1. Explicit override (e.g. from UI query param):
  2. ?dialect=postgres → PostgresDialect
  3. Profile / configuration (if set)
  4. Environment variable:
  5. ELEVATA_SQL_DIALECT=duckdb
  6. Fallback: a default dialect (usually DuckDB)

In the SQL Preview UI, the user selects a dialect from a dropdown.
The selection is passed as ?dialect=... to the preview endpoint, which then calls
get_active_dialect() with that override.


🔧 5. Identifier Rendering

All dialects receive unquoted identifiers from the LogicalPlan / AST and apply quoting rules themselves.

  • DuckDB: "identifier"
  • Postgres: "identifier"
  • MSSQL: "identifier"
  • ...

This avoids confusion around vendor-specific quoting and supports consistent behaviour across dialects.

Rules:
- identifiers that contain uppercase letters, spaces, or reserved words are always quoted
- schema and table are rendered as "schema"."table" AS "alias"


🔧 6. Literal Rendering

Literals are rendered via dialect helpers.

General rules:

  • strings → 'value' with proper escaping
  • integers / decimals → as-is
  • booleans → TRUE / FALSE (or 1/0)
  • NoneNULL

There is no dialect-specific SQL stored in metadata.


🔧 7. Expression Rendering (DSL AST)

All expressions in the LogicalPlan are built from the Expression DSL & AST.

Key node types include:

  • Literal
  • ColumnRef
  • ExprRef
  • ConcatExpr
  • ConcatWsExpr
  • CoalesceExpr
  • Hash256Expr
  • WindowFunctionExpr

The dialect implements render_expr(expr) by pattern-matching on node types.


🔧 8. Hashing (HASH256)

HASH256(<expr>)

All dialects must return a hex-encoded, 64-character, lowercase string.

🧩 BigQuery

TO_HEX(SHA256(<expr>))

🧩 Databricks

sha2(<expr>, 256)

🧩 DuckDB

sha256(<expr>)

🧩 Fabric Warehouse

CONVERT(VARCHAR(64), HASHBYTES('sha2_256', CAST(<expr> AS VARCHAR(4000))), 2)

🧩 MSSQL

CONVERT(VARCHAR(64), HASHBYTES('sha2_256', <expr>), 2)

🧩 Postgres

encode(digest(<expr>, 'sha256'), 'hex')

🧩 Snowflake

LOWER(TO_HEX(SHA2(<expr>, 256)))

Feature BigQuery Databricks DuckDB Fabric Warehouse MSSQL Postgres Snowflake
Identifier quoting ... ... "..." "..." "..." "..." "..."
HASH256 implementation SHA256+TO_HEX SHA2 SHA256 HASHBYTES+CONVERT HASHBYTES+CONVERT DIGEST+ENCODE SHA2+TO_HEX
CONCAT / CONCAT_WS
COALESCE
Window functions
Subqueries in FROM
UNION / UNION ALL

🔧 9. Window Functions

Window functions such as ROW_NUMBER() are represented at the AST level by WindowFunctionExpr.

Example:

ROW_NUMBER() OVER (
  PARTITION BY bk
  ORDER BY updated_at DESC
)

Rendered in all dialects using ANSI SQL syntax. Dialects only differ in quoting of identifiers.

These are heavily used in multi-source Stage non-identity mode to compute __src_rank_ord.


🔧 10. Subqueries & UNION Rendering

🧩 Subqueries

Represented by SubquerySource(select, alias) in the LogicalPlan.

Rendered as:

(
  SELECT ...
) AS alias

Dialects can apply their own line-breaking / formatting rules, but the structure is identical.

🧩 UNION / UNION ALL

Represented by LogicalUnion(selects=[...], union_type='ALL'|'DISTINCT').

Rendered as:

SELECT ...
UNION ALL
SELECT ...
UNION ALL
SELECT ...

Again, only identifier/literal rendering and optional formatting differ by dialect.


🔧 12. Adding a New Dialect

To add a new dialect:

  1. Create a module under metadata/rendering/dialects/, e.g. snowflake.py.
  2. Implement a class:
class SnowflakeDialect(SqlDialect):
    DIALECT_NAME = "snowflake"

    def render_expr(self, expr: Expr) -> str:
        ...

    def render_select(self, select: LogicalSelect) -> str:
        ...

    def cast_expression(self, expr: Expr, target_type: str) -> str:
        ...
  1. Ensure the module is imported (so the class registers itself).
  2. Add tests in tests/test_dialect_snowflake.py.
  3. Optionally expose it in the SQL preview dropdown.

No changes to metadata are required — all dialect logic is encapsulated.


🔧 13. Identifier vs. Table Identifier Rendering

Dialects implement two distinct methods for rendering identifiers:

Method Responsibility Example Output
render_identifier(name: str) -> str Renders a single identifier only (e.g., column name) "customer_id"
render_table_identifier(schema: str | None, name: str) -> str Renders a schema-qualified table reference "rawcore"."rc_customer"

🧩 Table existence semantics

Dialect introspection must reliably distinguish between:

  • table does not exist
  • table exists but has no columns (rare engine-specific case)

The planner assumes that table_exists=False means that the table must be provisioned.
Dialect implementations are responsible for handling engine-specific metadata quirks.

🧩 Usage in SQL Generation

col_sql = dialect.render_identifier(col_name)
tbl_sql = dialect.render_table_identifier(schema_name, table_name)

This separation ensures that: - table-level quoting rules do not affect column expressions - engines without schemas can pass None for schema - identifiers remain valid for cross-schema and cross-dialect SQL

🧩 Dialect Support Summary

Feature BigQuery Databricks DuckDB Fabric Warehouse MSSQL Postgres Snowflake
Identifier quoting ... ... "..." "..." "..." "..." "..."
Schema-qualified tables via render_table_identifier

🔧 14. Testing Strategy

The Dialect System is validated via:

  • unit tests for each dialect
  • cross-dialect hashing tests (SK/FK equality)
  • SQL preview tests
  • LogicalPlan → SQL snapshot tests

Focus areas:

  • identifier quoting correctness
  • literal escaping
  • HASH256 implementation parity
  • CONCAT / CONCAT_WS behavior
  • window function correctness
  • subquery and UNION formatting

If all dialect tests pass, the multi-dialect engine behaves consistently.


🔧 15. Summary

The Dialect System is the backbone of elevata’s multi-backend strategy:

  • SQL is generated from a vendor-neutral LogicalPlan + AST
  • dialects implement only the final rendering
  • SK/FK hashing is cross-dialect identical
  • adding new engines is straightforward

This architecture ensures that elevata can support more backends (Snowflake, BigQuery, Databricks, …) without changing core metadata or generation logic.


🔧 16. Dialect diagnostics & health check

The Dialect System exposes a lightweight diagnostics layer to verify that all registered SQL dialects behave as expected.

Diagnostics can be accessed in two ways:

  1. Programmatic API
  2. CLI command via manage.py elevata_dialect_check

🧩 16.1 Programmatic diagnostics

The module metadata.rendering.dialects.diagnostics provides convenience functions to inspect all dialects at once:

  • collect_dialect_diagnostics(dialect)
  • snapshot_all_dialects()

Each snapshot contains:

  • name / class_name
  • supports_merge
  • supports_delete_detection
  • supports_hash_expression
  • example literal renderings (TRUE/FALSE/NULL/date)
  • example expressions for CONCAT and HASH256

This is useful for debugging and for asserting capabilities in tests, without having to introspect each dialect manually.

🧩 16.2 CLI: elevata_dialect_check

For a quick end-to-end smoke test of all registered SQL dialects, use:

python manage.py elevata_dialect_check

This command:

  • discovers all registered dialects (DuckDB, Postgres, MSSQL, …),
  • prints basic capabilities:
    • supports_merge
    • supports_delete_detection
  • runs a set of small self-checks per dialect:
    • identifier quoting (quote_ident)
    • literal rendering for strings, numbers, booleans, dates, datetimes, decimals
    • concat_expression(...)
    • hash_expression(...)
    • (optionally) render_insert_into_table(...)
    • (optionally) render_merge_statement(...)
    • (optionally) render_delete_detection_statement(...)
  • reports each check as:
    • OK → check succeeded
    • N/I → NotImplementedError (feature not implemented yet)
    • FAIL → any other exception

This is intentionally non-invasive: it only renders SQL; it does not execute it
against a live database. The command is meant as a quick guardrail during
development and CI to detect regressions in dialect implementations early.


🔧 17. Execution engines

Each SQL dialect can optionally provide an execution engine that knows how to
run SQL statements against a concrete target system.

The base interface lives in rendering/dialects/base.py:

  • BaseExecutionEngine.execute(sql: str) -> int | None
  • SqlDialect.get_execution_engine(system) -> BaseExecutionEngine

Concrete dialects (e.g. DuckDBDialect) implement their own execution engine in the same module:

  • DuckDBDialect.get_execution_engine(system) returns a DuckDbExecutionEngine
  • DuckDbExecutionEngine implements execute(sql: str)

🧩 Execution Engine vs SQLAlchemy Engine

In elevata, SQL execution and metadata introspection are intentionally separated concerns.

Two different connection mechanisms may exist for the same target system:

Component Purpose
Execution Engine Executes rendered SQL statements (--execute)
SQLAlchemy Engine Used for introspection, materialization planning, and schema inspection

In many databases (e.g. Postgres, DuckDB, Snowflake) both roles may use the same underlying connection mechanism.

However, analytical platforms such as BigQuery or Databricks may require different technical connectors:

  • Execution via native API or warehouse connector
  • Introspection via SQLAlchemy-compatible drivers

This separation is intentional and allows elevata to support execution environments
that do not expose a full SQLAlchemy-compatible runtime.


🔧 18. Dialect Parity Checklist

Purpose This checklist defines the mandatory contract every officially supported dialect must fulfill
in order to support --execute, auto‑provisioning, and observability consistently across BigQuery, Databricks, DuckDB, Fabric Warehouse, MSSQL, PostgreSQL and Snowflake.

The goal is behavioral parity, not identical SQL syntax.

🧩 18.1 Core Rendering Contract (SQL Preview & Generation)

Every dialect must implement the core SQL rendering primitives so that SQL preview and generated SQL behave consistently.

Required methods:
- render_identifier(name: str) -> str
- render_table_identifier(schema: str | None, name: str) -> str
- render_literal(value: Any) -> str
- render_expr(expr: Expr) -> str
- render_select(select: LogicalSelect | LogicalUnion | ...) -> str
- cast_expression(expr: Expr, target_type: str) -> str

Required parity guarantees (v0.6.x):
- Deterministic HASH256 rendering (hex‑encoded, 64 characters)
- Consistent CONCAT / CONCAT_WS semantics
- Window functions parity (ROW_NUMBER, partitioning, ordering)
- Nested subqueries and UNION rendering parity

🧩 18.2 Execution Contract (--execute)

Every dialect that is considered supported must provide a working execution engine.

Required methods: - get_execution_engine(system) -> BaseExecutionEngine
- BaseExecutionEngine.execute(sql: str) -> int | None

Execution expectations:
- Executes multi‑statement SQL safely
- Uses the resolved target connection (system.security["connection_string"])
- Raises clear, actionable exceptions on connection or SQL errors
- Returns affected row counts where the backend supports it (optional)

Dialects without a working execution engine are considered render‑only and must not be advertised as fully supported targets.

🧩 18.3 Auto‑Provisioning Contract (Warehouse‑Side DDL)

All supported dialects must support idempotent warehouse provisioning.

Required methods:
- render_create_schema_if_not_exists(schema: str) -> str
- render_create_table_if_not_exists(td: TargetDataset) -> str
- render_create_table_if_not_exists_from_columns(schema: str, table: str, columns: list[dict[str, object]]) -> str
- render_add_column(schema: str, table: str, column: str, column_type: str | None) -> str

Rules:
- DDL must be safe to execute multiple times
- No destructive operations (no DROP)
- Target table DDL must be derived from TargetColumn metadata

🧩 18.4 Observability & Run Logging Contract

Every supported dialect must support warehouse‑level execution logging.

Required behavior:
- Logging is emitted into meta.load_run_log with a canonical schema (registry-driven).
- The INSERT logic is centralized (dialects must not each hardcode their own log INSERT).


🧩 18.5 Incremental & Historization Contract

Dialects that support incremental pipelines (Rawcore / History) must be able to execute:

  • New-row inserts
  • Changed-row handling (close previous version + insert new version)
  • Delete detection SQL
  • Delete marking in history (version_state = 'deleted')

Dialect implementations must render the required SQL primitives consistently; architectural details are specified in:
- Load SQL Architecture
- Historization Architecture

🧩 18.6 Diagnostics & Parity Validation

Every supported dialect must pass a minimal diagnostic suite:

  • Render-only smoke tests
  • Identifier quoting tests
  • Literal escaping tests
  • HASH256 output consistency tests
  • Minimal merge / historization render tests (where applicable)

Execution parity is validated by:
- Running --execute on all supported targets
- Verifying schema provisioning, table provisioning, and run logging


Summary
A dialect is only considered fully supported in elevata when it satisfies all sections 18.1 – 18.6.
Partial implementations must be clearly marked as render-only or experimental.


🔧 19. Merge & Historization Contract

The Dialect System is responsible not only for expression and SELECT rendering,
but also for deterministic incremental data movement and historization semantics.

The architectural principle is:

load_sql provides semantic ingredients.
The dialect owns the SQL shape.

This guarantees clean separation of concerns and cross-dialect behavioral parity.


🧩 19.1 Merge Contract (Incremental RAWCORE Loads)

Incremental RAWCORE datasets require deterministic merge semantics.

The core layer prepares semantic ingredients:

  • source_select_sql
  • key_columns
  • update_columns
  • insert_columns

The dialect must implement:

render_merge_statement(...)

Dialect responsibilities:

  • Use native MERGE where supported
  • Provide a performant fallback (UPDATE + INSERT) when MERGE is unavailable
  • Render deterministic key predicates
  • Preserve type correctness
  • Apply identifier quoting consistently

All officially supported dialects must support merge semantics.


🧩 19.2 Delete Detection Contract

For incremental pipelines with handle_deletes=True, dialects must implement:

render_delete_detection_statement(...)

The core provides:

  • target schema + table
  • stage schema + table
  • join predicates
  • scope filter

The dialect renders the DELETE (or equivalent) statement.

Dialects may opt out by setting:

supports_delete_detection = False

In that case, the core raises a clear NotImplementedError.


🧩 19.3 Historization Contract (SCD Type 2)

Historization pipelines for *_hist datasets are fully dialect-driven.

The core prepares semantic ingredients and calls:

render_hist_incremental_statement(...)

The default orchestration renders:

  1. Changed version UPDATE
  2. Deleted version UPDATE
  3. Changed version INSERT
  4. New version INSERT

Dialects may override orchestration if required by the backend.

Strict guarantees:

  • Exactly one open version per business key
  • Deterministic row_hash comparison
  • No overlapping validity intervals
  • Append-only behavior

Historization SQL must be fully executable and not a placeholder.


🧩 19.4 Strict Mode Guarantees

The core validates:

  • Required SCD columns exist
  • Insert column alignment is 1:1
  • Required surrogate keys are present

Dialects can assume validated semantic input and focus solely on SQL rendering.


🧩 19.5 Architectural Symmetry

Merge and Historization follow the same layered responsibility model:

Layer Responsibility
Logical layer Semantic intent
load_sql Semantic ingredient preparation
Dialect SQL shape rendering
Execution engine Execution only

This symmetry ensures:

  • Predictable cross-dialect behavior
  • Clean extensibility
  • Stable internal architecture
  • No SQL-shape logic in the core layer

🔧 20. Reserved Keyword Registry

Each dialect provides a deterministic reserved keyword registry located in
rendering/dialects/keywords/<dialect>.py.

Keyword extraction follows one of two strategies:

  • Engine-truth (preferred when available)
  • Documentation-based parsing with strict sanity validation

All SQL required for engine-based extraction must be implemented inside the dialect class
via render_reserved_keywords_query().

Management commands must not contain vendor-specific SQL.



© 2025-2026 elevata Labs — Internal Technical Documentation