⚙️ Supported Source Backends¶

This document provides an overview of supported database platforms, tested SQLAlchemy dialects,
and connection examples for elevata.

It describes how elevata connects to external systems for schema discovery, metadata introspection,
and optional federated access at the Stage layer.
Execution semantics are defined separately.

It also documents non-relational ingestion sources (Files and REST) and how to configure them via SourceDataset.ingestion_config.

🔧 1. General Pattern (SQLAlchemy-backed systems)¶

All SQL database connections use the standard SQLAlchemy URI format:

dialect[+driver]://username:password@host:port/database[?params]

Example:

mssql+pyodbc://user:pwd@server,1433/db?driver=ODBC%20Driver%2018%20for%20SQL%20Server

In your .env, define these values using the elevata secret convention:

SEC_<ENVIRONMENT>_CONN_<TYPE>_<SHORTNAME>=<SQLAlchemy URI>

Example:

SEC_DEV_CONN_MSSQL_SAP=mssql+pyodbc://user:pwd@sql01,1433/SAPDB?driver=ODBC%20Driver%2018%20for%20SQL%20Server

🔧 2. Fully Supported Backends¶

Type	Label	Driver / Package	Reflection Support	Example URI
`bigquery`	Google BigQuery	`pybigquery`	✅ Columns only	`bigquery://project-id`
`databricks`	Databricks SQL (beta)	`sqlalchemy-databricks`	⚠️ Columns only	`databricks://token:<TOKEN>@<host>?http_path=/sql/1.0/warehouses/<id>`
`duckdb`	DuckDB	`duckdb-engine`	✅ Full	`duckdb:///C:/data/local.duckdb`
`mssql`	Microsoft SQL Server	`pyodbc`	✅ Full	`mssql+pyodbc://user:pwd@host,1433/db?driver=ODBC%20Driver%2018%20for%20SQL%20Server`
`mysql`	MySQL / MariaDB	`mysqlclient` or `pymysql`	✅ Full	`mysql+pymysql://user:pwd@localhost:3306/db`
`oracle`	Oracle	`python-oracledb`	✅ Columns & PKs	`oracle+oracledb://user:pwd@host:1521/servicename`
`postgres` / `postgresql`	PostgreSQL	`psycopg2`	✅ Full	`postgresql+psycopg2://user:pwd@localhost:5432/dbname`
`redshift`	Amazon Redshift	`sqlalchemy-redshift`	✅ Columns only	`redshift+psycopg2://user:pwd@redshift:5439/dev`
`snowflake`	Snowflake	`snowflake-sqlalchemy`	✅ Columns only	`snowflake://user:pwd@account/DB/SCHEMA?warehouse=WH`
`sqlite`	SQLite	built-in	✅ Full	`sqlite:///C:/data/dev.db`

🔧 3. Connection Notes¶

Use URL-encoded driver names (e.g. ODBC%20Driver%2018%20for%20SQL%20Server).
Backslashes in paths should be replaced by slashes (C:/path/file.db).
SSL parameters and special driver arguments can be appended via query params.
For cloud systems (e.g. Snowflake, BigQuery, Databricks), credentials and tokens can be stored
in .env or a vault.
For BigQuery, SQLAlchemy-based reflection is used for schema and column discovery.
Execution as a target backend is described separately.

🔧 4. Beta Dialects (Reflection may be limited)¶

trino / presto
clickhouse
teradata
db2
hana

You can register these manually under System.type if you want to experiment — elevata will attempt a
generic reflection where possible, but may not retrieve primary or foreign key details.

🔧 5. File-based Sources (RAW Ingestion)¶

elevata supports file-based sources for native RAW ingestion. The file type is determined via System.type.

🧩 Supported File Types¶

`System.type`	Format	Notes
`csv`	CSV with header row	Header row is used as column names
`json`	JSON array of objects (`[{...}, {...}]`)	Each element must be an object
`jsonl`	Newline-delimited JSON (`{"a":1} {"a":2}`)	One object per line
`parquet`	Parquet	Currently: local path or `file://`
`excel`	Excel (`.xlsx`, `.xlsm`)	Sheet and header configurable

RAW landing is always Full Replace: Drop/Create/Truncate/Insert.

RAW tables are system-managed landing zones and always include technical columns such as:
- load_run_id
- loaded_at
- payload (preserved JSON object for the original record)

🧩 `SourceDataset.ingestion_config` (File)¶

🔎 Required fields¶

{
  "uri": "file:///data/orders.parquet"
}

uri (string, required)
Path or URI to the file.

Supported: - file:///... (recommended)
- local paths (e.g. /data/orders.csv, C:/data/orders.xlsx)
- http(s)://... for CSV/JSON/JSONL/Excel (Parquet currently not supported over HTTP)

Local paths vs. file URIs

The uri field accepts both plain local paths and file:// URIs.

Valid examples:

Linux / macOS:

/data/orders.csv

file:///data/orders.csv

Windows:

C:/data/orders.xlsx

file:///C:/data/orders.xlsx

Using file:/// is recommended for portability and clarity in configuration files, but it is not strictly required.

Backslashes (\) should be avoided. Use forward slashes (/) instead.

🔎 Environment variable expansion¶

uri supports environment variable expansion using the ${VAR_NAME} syntax.

Example:

{
  "uri": "${ELEVATA_INGEST_ROOT}/finance/orders.xlsx"
}

🧩 `.env` expectations¶

Recommended pattern:

ELEVATA_INGEST_ROOT=/mnt/dev/inbox

Then reference it in metadata:

{
  "uri": "${ELEVATA_INGEST_ROOT}/finance/orders.xlsx"
}

This allows using the same metadata across dev/test/prod while switching file roots via .env.

🔎 Optional fields¶

{
  "uri": "https://example.com/download?id=123",
  "file_type": "jsonl"
}

file_type (string, optional)
Overrides detection based on file suffix. Useful if the URL does not contain a file extension (e.g. presigned URLs).

Allowed values: csv | json | jsonl | parquet | excel

Important: If set, file_type must match System.type (to avoid inconsistent configuration).

Detection rules (if file_type is not set):
- .csv → CSV
- .json → JSON array
- .jsonl / .ndjson → JSON Lines
- .parquet → Parquet
- .xlsx / .xlsm → Excel

🧩 CSV-specific notes (`System.type = "csv"`)¶

CSV must have a header row.
Values are read as strings; type inference happens via auto-import / column profiling.

{
  "uri": "file:///data/orders.csv",
  "delimiter": ";",
  "quotechar": "\"",
  "encoding": "utf-8-sig"
}

delimiter (string, optional, default: ,)
Column separator character.
Common in German Excel exports: ;.
quotechar (string, optional, default: \")
Character used to quote fields containing delimiters or line breaks.
Standard CSV uses double quotes (").
encoding (string, optional, default: utf-8)
Character encoding used when decoding the file.
Useful values:
utf-8
utf-8-sig (common for Excel exports on Windows)
cp1252 (legacy Windows encoding)

If not specified, elevata assumes UTF-8 encoding and standard CSV quoting.

🧩 JSON-specific notes (`System.type = "json"`)¶

The file must contain a JSON array of objects.
Non-object elements are ignored.

🧩 JSONL-specific notes (`System.type = "jsonl"`)¶

One JSON object per line.
Empty lines are ignored.

🧩 Parquet-specific notes (`System.type = "parquet"`)¶

Parquet ingestion is chunked for memory safety.
Currently supported locations: local path or file:// URI.

🧩 Excel-specific Options (`System.type = "excel"`)¶

{
  "uri": "file:///data/orders.xlsx",
  "sheet_name": "Orders",
  "header_row": 1,
  "max_rows": 50000
}

sheet_name (string, optional)
Name of the worksheet. If not set, the first sheet is used.
sheet_index (int, optional)
0-based sheet index (alternative to sheet_name).
header_row (int, optional, default: 1)
1-based row number containing the header.
max_rows (int, optional)
Limits the number of data rows read (after the header).

🔧 6. REST Sources (RAW Ingestion)¶

For REST APIs, use System.type = "rest".

REST ingestion configuration is split into:

System-level REST connection (environment-specific):
base_url and default headers
Dataset-level ingestion_config (dataset-specific):
path, record extraction, pagination, cursor rules

This keeps metadata stable across dev/test/prod while allowing endpoint switches via secrets.

Important: base_url is environment-/system-level configuration and must be provided via secrets
(e.g. .env or vault). It should not be duplicated per dataset.

🧩 System-level REST connection (secret)¶

REST systems resolve connection details via the profile-aware secret reference:

sec/{profile}/conn/rest/{short_name}

The resolved secret must provide at least:

base_url (string, required)

Optional:

headers (object): default headers included with every request
query (object): fixed query parameters included with every request

Example secret JSON:

{
  "base_url": "https://jsonplaceholder.typicode.com",
  "headers": {"Authorization": "Bearer <TOKEN>"},
  "query": {"locale": "en_US"}
}

🧩 `.env` expectations¶

In local development, this is typically backed by .env values using the same convention as SQL connection strings.

Example (conceptual) .env entry:

SEC_DEV_CONN_REST_JSONPH={"base_url":"https://jsonplaceholder.typicode.com"}

Optional default headers and fixed query params may be provided as well:

SEC_DEV_CONN_REST_REQRES={"base_url":"https://reqres.in","headers":{"Authorization":"Bearer <TOKEN>"},"query":{"locale":"en_US"}}

Note: The exact .env representation depends on your secret provider.
The important contract is that resolving the secret yields a JSON object containing at least base_url.

🧩 `SourceDataset.ingestion_config` (REST)¶

🔎 Required fields¶

{
  "path": "/posts"
}

path (string, required)
Endpoint path relative to the system base_url.

🔎 Optional fields¶

{
  "path": "/api/users",
  "method": "GET",
  "headers": {"X-Debug": "1"},
  "params": {"limit": 100},
  "record_path": "data"
}

method (string, optional, default: GET)
HTTP method.
headers (object, optional)
Request headers added on top of system default headers.
params (object, optional)
Request query parameters added on top of system fixed query parameters.
record_path (string, optional)
Dotted path to the list of records inside the JSON response (e.g. data.items). If omitted, the response must be a JSON array.

🔎 Pagination (page-based)¶

{
  "path": "/api/users",
  "page_param": "page",
  "page_size_param": "per_page",
  "page_size": 6,
  "record_path": "data"
}

page_param (string, optional)
Name of the page parameter.
page_size_param (string, optional)
Name of the page size parameter.
page_size (int, optional)
Requested page size.

🔎 Incremental scoping (cursor / since)¶

Note: RAW remains Full Replace. Cursor configuration affects only the extraction scope.

{
  "path": "/events",
  "since_param": "since",
  "cursor_field": "updated_at",
  "record_path": "items"
}

since_param (string, optional)
Query parameter receiving the delta cutoff value.
cursor_field (string, optional)
Field name used to derive a cursor/max timestamp.

⚙️ Supported Source Backends¶

🔧 1. General Pattern (SQLAlchemy-backed systems)¶

🔧 2. Fully Supported Backends¶

🔧 3. Connection Notes¶

🔧 4. Beta Dialects (Reflection may be limited)¶

🔧 5. File-based Sources (RAW Ingestion)¶

🧩 Supported File Types¶

🧩 SourceDataset.ingestion_config (File)¶

🔎 Required fields¶

🔎 Environment variable expansion¶

🧩 .env expectations¶

🔎 Optional fields¶

🧩 CSV-specific notes (System.type = "csv")¶

🧩 JSON-specific notes (System.type = "json")¶

🧩 JSONL-specific notes (System.type = "jsonl")¶

🧩 Parquet-specific notes (System.type = "parquet")¶

🧩 Excel-specific Options (System.type = "excel")¶

🔧 6. REST Sources (RAW Ingestion)¶

🧩 System-level REST connection (secret)¶

🧩 .env expectations¶

🧩 SourceDataset.ingestion_config (REST)¶

🔎 Required fields¶

🔎 Optional fields¶

🔎 Pagination (page-based)¶

🔎 Incremental scoping (cursor / since)¶

🧩 `SourceDataset.ingestion_config` (File)¶

🧩 `.env` expectations¶

🧩 CSV-specific notes (`System.type = "csv"`)¶

🧩 JSON-specific notes (`System.type = "json"`)¶

🧩 JSONL-specific notes (`System.type = "jsonl"`)¶

🧩 Parquet-specific notes (`System.type = "parquet"`)¶

🧩 Excel-specific Options (`System.type = "excel"`)¶

🧩 `.env` expectations¶

🧩 `SourceDataset.ingestion_config` (REST)¶