⚙️ Hashing Architecture¶
This document describes elevata’s surrogate-key (SK) and foreign-key (FK) hashing architecture.
The hashing engine is:
- deterministic (stable outputs for identical metadata)
- dialect-neutral (AST → rendered per SQL dialect)
- metadata-safe (no vendor SQL stored in the database)
- cross-dialect identical (BigQuery, Databricks, DuckDB, Fabric Warehouse, MSSQL, Postgres, Snowflake produce the same hash)
- testable (AST inspections instead of string asserts)
Hashing is standardized as hex-encoded, 64-character, lowercase SHA-256 across all dialects.
🔧 1. Goals of the Architecture¶
- a vendor-neutral Hashing DSL
- a DSL Parser → AST
- a structured Expression AST
- dialect-specific SQL renderers
- deterministic ordering for all BK components
🔧 2. SK/FK Hashing Requirements¶
🧩 2.1 Surrogate Key (SK) must be:¶
- deterministic
- reproducible across dialects
- independent of physical column order
- sensitive only to BK semantics
🔎 SK derivation inputs:¶
- ordered list of business key columns
- literal separators
~between field name & value|between BK pairs- null replacement literal:
'null_replaced' - system-wide pepper: e.g.
'pepper'^
🧩 2.3 History Surrogate Keys¶
- History SK reuses the parent rawcore SK structure:
- natural key columns: [rawcore_sk_column, "version_started_at"].
- same hashing DSL & AST (CONCAT_WS, COALESCE, HASH256, pepper).
- This ensures:
- BK for history = rawcore SK + version_started_at.
- joinability between rawcore and history later (if needed).
- _hist never defines its own FK hashes; all FK semantics stay in the rawcore layer.
🧩 2.2 Foreign Key (FK) must:¶
- follow parent’s SK structure exactly
- replace parent column references with child references
- maintain ordering
- use the same pepper
- remain cross-dialect consistent
FK structure for two BK columns looks like:
CONCAT_WS('|',
CONCAT('bk1', '~', COALESCE(parent_child_bk1, 'null_replaced')),
CONCAT('bk2', '~', COALESCE(parent_child_bk2, 'null_replaced')),
'pepper'
)
🔧 3. Hashing DSL (Human-Readable)¶
Surrogate keys and foreign keys are generated via a safe declarative DSL:
HASH256(
CONCAT_WS('|',
CONCAT('productid', '~', COALESCE({expr:productid}, 'null_replaced')),
'pepper'
)
)
Characteristics:
- No vendor SQL inside the DSL
- All logic expressible via COL(), CONCAT(), CONCAT_WS(), COALESCE(), HASH256()
- Designed to be a serialization format for metadata
🔧 4. DSL Parser → AST¶
The DSL parser (in dsl.py) converts the DSL into a structured AST composed of:
- Literal
- ColumnRef
- ExprRef
- ConcatExpr
- ConcatWsExpr
- CoalesceExpr
- Hash256Expr
Example AST snippet:
Hash256Expr(
ConcatWsExpr('|', [
ConcatExpr([
Literal('productid'),
Literal('~'),
CoalesceExpr(ColumnRef('productid'), Literal('null_replaced'))
]),
Literal('pepper')
])
)
🔧 5. Deterministic Ordering Rules (Critical)¶
BK components are sorted lexicographically by BK name.
For BKs: [bk2, bk1, bk10] → sorted → [bk1, bk10, bk2].
This ensures:
- identical hashes regardless of metadata ordering
- stable lineage comparisons
- deterministic FK reconstruction
🧩 Ordering within pair:¶
CONCAT("bk1", '~', COALESCE(expr, 'null_replaced'))
🧩 Ordering of pairs:¶
Joined using CONCAT_WS('|', ...).
🔧 6. Foreign Key Hashing¶
FK hashing mirrors SK hashing exactly, except column references point to child columns.
Example:
CONCAT_WS('|',
CONCAT('bk1', '~', COALESCE(child.bk1, 'null_replaced')),
CONCAT('bk2', '~', COALESCE(child.bk2, 'null_replaced')),
'pepper'
)
This is guaranteed by using ExprRef and standard BK ordering.
The dialect renderer never needs to know whether it’s SK or FK.
🔧 7. Dialect Rendering of HASH256¶
The AST-based hash expression is rendered differently per dialect.
🧩 BigQuery¶
TO_HEX(SHA256(CONCAT_WS('|', ...)))
🧩 Databricks¶
SHA2(CONCAT_WS('|', ...), 256)
🧩 DuckDB¶
SHA256(CONCAT_WS('|', ...))
🧩 Fabric Warehouse¶
CONVERT(VARCHAR(64),
HASHBYTES('SHA2_256', CAST(CONCAT_WS('|', ...) AS VARCHAR(4000))),
2)
🧩 MSSQL¶
CONVERT(VARCHAR(64),
HASHBYTES('SHA2_256', CONCAT_WS('|', ...)),
2)
🧩 Postgres¶
On PostgreSQL, SHA256 hashing relies on the pgcrypto extension.
ENCODE(
DIGEST(CONCAT_WS('|', ...), 'sha256'),
'hex'
)
🧩 Snowflake¶
TO_HEX(SHA2(TO_VARCHAR(CONCAT_WS('|', ...)), 256))
All platforms produce byte-identical SHA-256 outputs.
🔧 8. Null Semantics¶
All BK values are wrapped in:
COALESCE(value, 'null_replaced')
This avoids platform-specific differences:
- NULL || 'x' vs CONCAT(NULL, 'x')
- Postgres treating empty strings differently
- MSSQL + operator behavior
🔧 9. Pepper Semantics¶
A global pepper (e.g. 'pepper') is appended as the last argument of the CONCAT_WS call.
Purpose:
- prevent predictable hashes
- add stability across dialects
- disable hash attacks on BKs
Pepper is constant and not column-dependent.
🔧 10. How FK hashing mirrors parent SK logic¶
FK hashing logic:
1. Retrieve parent SK structure (BK names, ordering)
2. Inject child columns into the same structure
3. Reconstruct the same AST pattern
4. Apply Hash256Expr
This guarantees:
- referential equality
- stable lineage
- consistent join keys
🔧 11. Advantages of the New Hashing Engine¶
- cross-dialect identical hashing
- no SQL inside metadata
- fully testable (AST-level tests)
- deterministic BK ordering
- safe DSL (no SQL injection)
- shared logic for SKs and FKs
- consistent subquery + expression behavior
🔧 12. Testing Strategy¶
Tests cover:
- DSL → AST correctness
- dialect rendering for SK & FK
- BigQuery/Databricks/DuckDB/Fabric Warehouse/MSSQL/Postgres/Snowflake hash equivalence
- BK ordering rules
- null-coalescing behavior
- pepper correctness
All tests pass when AST rendering is correct.
© 2025-2026 elevata Labs — Internal Technical Documentation