⚙️ Load SQL Architecture¶

This document describes how elevata transforms metadata into executable SQL for load operations.

🔧 1. Overview¶

The load SQL pipeline turns metadata into complete SQL statements suitable for execution on analytical backends.

High‑level flow:

Metadata → Logical Plan → Expression AST → Dialect Rendering → SQL Load Statement → Execution

The system is designed so that:
- metadata remains backend‑agnostic
- logical plans describe what is required, not how it is written
- dialects encapsulate syntactic differences
- execution engines are decoupled from SQL generation

Note
This document focuses exclusively on SQL generation and rendering.
Execution order, orchestration semantics, retries, failure handling,
and execution observability are described separately in
Load Execution & Orchestration Architecture.

🔧 2. Logical Plans for Load Operations¶

elevata represents load operations using SQL primitives:

LogicalSelect – core building block
LogicalUnion – multi-source resolution
SubquerySource – ranking, filtering, pre‑aggregation

Logical plans are intentionally dialect‑neutral.

🔧 3. Expression AST in Load SQL¶

Every column expression is represented as an AST node during load generation.

Supported node types include:
- ColumnRef
- Literal
- ExprRef
- ConcatExpr
- ConcatWsExpr
- CoalesceExpr
- WindowFunctionExpr
- Hash256Expr

The AST guarantees that:
- hashing is consistent across dialects
- CONCAT / COALESCE behave uniformly
- window functions are structured, not string‑built
- SQL remains predictable and comparable

🔧 4. Dialect Rendering¶

After a logical plan is constructed, the selected dialect renders the plan and its AST into concrete SQL.

sql = dialect.render_select(plan)

Each dialect implements:
- identifier quoting
- literal rendering
- hashing functions
- CONCAT / CONCAT_WS
- COALESCE
- window functions
- subqueries and unions

This ensures consistent semantics while using native SQL syntax per backend.

🔧 5. Intent layer (generation + ingestion)¶

elevata uses a small "intent layer" to keep core decisions consistent across
target generation, ingestion, drift detection and quality checks.

🧩 Landing intent¶

The function landing_required(SourceDataset) determines whether a dataset
conceptually requires a RAW landing object.

Decision rules:
- SourceDataset.integrate must be True (hard gate).
- SourceDataset.generate_raw_table overrides the system default.
- If unset, inherit System.generate_raw_tables.

This decision is shared across:
- RAW target generation
- ingestion execution planning
- future drift detection and quality checks

🧩 Ingestion mode¶

The function resolve_ingest_mode(SourceDataset) determines how RAW is populated:
- native: elevata extracts and loads
- external: an external tool populates RAW; elevata validates + continues
- none: no landing ingestion (federated/virtual access)

If RAW landing is required but include_ingest='none', configuration is inconsistent
and must fail fast.

🧩 Allowed states (RAW landing × include_ingest)¶

The decision whether a RAW landing exists is driven by landing_required(SourceDataset).
The execution mode is driven by System.include_ingest, but only becomes relevant
when RAW landing is required.

landing_required	include_ingest	Result / behavior
false	none	✅ Valid. No RAW landing. Ingestion is skipped (federated/virtual or intentionally no landing).
false	external	✅ Valid (ignored). No RAW landing, so ingestion is skipped.
false	native	✅ Valid (ignored). No RAW landing, so ingestion is skipped.
true	native	✅ Valid. elevata performs native ingestion (extract + load into RAW).
true	external	✅ Valid. RAW is expected to be populated by an external tool; elevata validates RAW existence and continues with downstream steps (drift/quality/etc.).
true	none	❌ Invalid. Configuration inconsistency: RAW landing is required but no ingestion mode is enabled. Must fail fast.

Notes:
- include_ingest is only actionable when landing_required=True.
- For include_ingest=external, elevata does not extract data but still logs runs and can run drift/quality checks on RAW.

🔧 6. Load Runner¶

The Load Runner CLI (elevata_load) orchestrates SQL generation and execution.

It:
- resolves the active profile and target system
- reads target dataset metadata
- constructs the logical plan
- renders SQL via the active dialect
- optionally executes SQL in the target warehouse

The same pipeline is used for SQL preview and execution.

🔧 7. Deterministic Generation¶

The SQL generation pipeline is fully deterministic:
- stable business-key ordering
- stable hashing patterns
- stable helper column naming
- stable logical plan structure

This guarantees reproducible SQL and predictable diffs.

🔧 8. Merge‑based Incremental SQL Generation (Rawcore)¶

This section documents how merge‑based incremental loads are implemented for Rawcore targets.

🧩 8.1 Source Resolution¶

For targets using incremental_strategy = "merge", the SQL layer resolves the Stage upstream dataset as the merge source:

source: stage.<table> AS s
target: rawcore.<table> AS t

Lineage metadata guarantees compatible natural keys and attribute sets.

🧩 8.2 Natural Key Join¶

Natural key fields define:
- the merge join condition
- identification of new vs. existing rows
- delete‑detection scope

If no natural key is defined, SQL generation fails.

🧩 8.3 Logical Plan Reuse¶

All column expressions used in UPDATE and INSERT branches are reused from the logical plan.

Business logic is defined once and rendered consistently.

🧩 8.4 Dialect‑dependent Strategy¶

Dialects choose between:
- native MERGE statements
- fallback UPDATE + INSERT ... WHERE NOT EXISTS patterns

Both paths reuse the same logical plan expressions.

🧩 8.5 Delete Detection¶

Delete detection is implemented as a separate anti‑join statement that runs before the merge.

The SQL layer translates incremental scope filters from source lineage into target column expressions.

The incremental scope used for delete detection is derived from the SourceDataset.increment_filter and
translated via lineage into target column references.

🔧 9. Execution Semantics¶

Execution semantics are defined by target layer:

Layer	Behaviour
`raw`	Replace when loaded or seeded
`stage`	Always replace (truncate before insert)
`rawcore`	Replace only when `mode = full`; incremental runs never truncate
`*_hist`	Never truncate; versioned updates only

Execution always runs inside the target system.

🔧 10. Execution, Auto‑Provisioning & Warehouse Logging¶

🧩 10.1 Execution Modes¶

elevata_load supports:

Dry‑run: render SQL without executing it
Execute (--execute): render and execute SQL in the target warehouse

🧩 10.2 Layer-aware execution (RAW = ingestion)¶

elevata_load --execute is intentionally layer-aware:

For raw targets, --execute does not produce SQL but triggers ingestion logic instead.
The same execution command therefore has different semantics depending on the target layer.
For downstream layers (stage, rawcore, *_hist), --execute renders and executes
warehouse-native SQL as usual.

Why this matters:
- elevata treats ingestion as a first-class citizen of the pipeline.
- The same lineage metadata that drives target generation also drives ingestion planning.
- This closes an important gap in dbt-style stacks: dbt excels at transformations but does not provide
ingestion as part of its core execution model.

Practical rule:
- If you can --execute a RAW table, elevata will bring the data in (native/external mode).
- If you --execute a Stage model, elevata assumes that its upstream exists inside the target execution context
(RAW landing or federated/external availability), and will fail fast otherwise.

🔎 Ingestion configuration (`ingestion_config`)¶

For RAW ingestion, source connection details and behavior are defined via SourceDataset.ingestion_config.

For relational sources, scoping is controlled via static_filter and increment_filter.
For non-relational sources (Files / REST), ingestion_config contains connector-specific parameters
(e.g. uri or url).

Regardless of source type, RAW ingestion is always executed as Full Replace:

Drop (if supported)
Create
Truncate
Insert

🧩 10.3 Auto‑Provisioning¶

When enabled, execution automatically provisions:
- target schemas
- the meta schema
- the load_run_log table

All DDL is idempotent.

🧩 10.4 Warehouse‑level Load Run Log¶

Each executed load writes a row into meta.load_run_log, capturing:
- batch and load run IDs
- target dataset and system
- load mode and flags
- timestamps and durations
- execution status and error details

This enables warehouse‑native observability and auditing.

🧩 10.5 Materialization Plan (Safe Drift Sync)¶

Before executing generated DML, elevata may apply a materialization plan that aligns
physical target tables with metadata-defined schemas:

ENSURE_SCHEMA (idempotent)
RENAME TABLE (dataset rename via former_names)
RENAME COLUMN (column rename via former_names)
ADD COLUMN (when safe and supported)

The plan is generated by the materialization planner and executed by the applier,
which only runs safe, SQL-bearing steps.

Table creation remains the responsibility of the load runner via ensure_target_table(...).

🔧 11. Load Observability & Debugging¶

Load runs expose structured summaries, batch grouping, and CLI‑level logging to support debugging and monitoring.

🔧 12. CLI Usage¶

The elevata_load command supports preview, debugging, batch execution, and warehouse execution.

🔧 13. Execute Mode¶

The --execute flag enables direct execution of load SQL in the target warehouse via dialect‑specific execution engines.