⚙️ Historization Architecture (SCD Type 2)¶

This document describes the architecture, semantics, and SQL generation logic
used by elevata to maintain history (“_hist”) tables for any Rawcore dataset.
It covers the full lifecycle: new, changed, and deleted business keys, and how
incremental historization SQL is generated in a dialect-agnostic way.

🔧 1. Purpose & Scope¶

Rawcore tables represent the latest snapshot of a domain dataset.

History (“_hist”) tables represent the temporal evolution of that dataset
based on SCD Type 2 semantics:

Every Business Key (BK) maintains a timeline of versions.
Every version is valid for a continuous time interval.
Changes in attribute values create new versions.
Disappearing business keys generate deleted versions.

Historization is fully metadata-driven and uses the same logical plan as the
Rawcore load.

🧩 1.1 Placeholder Standardisation¶

All runtime placeholders must use double-brace notation and are resolved during execution:

{{ load_timestamp }}
{{ load_run_id }}

🧩 1.2 Execution Semantics¶

History tables are never truncated.

The SCD Type 2 pipeline operates exclusively via:
- closing existing versions (version_ended_at)
- inserting new versions

At no point is historical data removed or replaced.

🔧 2. History Table Structure¶

Every History table contains:

🧩 2.1 Surrogate Keys¶

<rawcorename>_hist_key (history surrogate key, deterministic hash key)
<rawcorename>_key (foreign key to the Rawcore surrogate key)

🧩 2.2 Business Attributes¶

Inherited from Rawcore via lineage:

all non-surrogate, non-technical Rawcore columns
column types and expressions are identical to the Rawcore schema

🧩 2.3 SCD Technical Columns¶

Column	Type	Meaning
`version_started_at`	timestamp	Start of validity interval
`version_ended_at`	timestamp	End of validity; NULL for open versions
`version_state`	text	`new`, `changed`, `deleted`
`load_run_id`	string	ID of the load that produced this version

version_ended_at IS NULL uniquely identifies the current active version.

🔧 3. Schema Sync, Drift & Renames¶

History tables are an exact schema mirror of their Rawcore base table (plus SCD technical fields).
To avoid accidental duplication when columns are renamed, elevata relies on

🧩 Metadata-driven rename tracking:¶

Base dataset column rename updates TargetColumn.former_names
Hist dataset metadata must retain corresponding former names as well
so that materialization planning can emit RENAME COLUMN instead of ADD COLUMN.

🧩 Orphan preservation (schema drift)¶

If a business column disappears from Rawcore, elevata does not drop it from _hist.
Instead, it is preserved as an inactive column without upstream inputs, so that:

historical data remains analyzable
old reports continue to work
downstream schemas do not silently change

Rename cases are not treated as orphans.
They are detected via existing TargetColumnInput lineage links and therefore migrated as proper renames, not preserved under the old name.

When a FK reference is deleted, elevata detaches dependent _hist inputs instead of dropping columns,
so historized schemas remain analyzable and stable.

🧩 Policy: Never drop business columns in `_hist`¶

_hist tables are treated as long-lived audit/history artifacts.
Therefore, elevata applies a strict policy:

Business columns in _hist are never physically dropped.
If a business column disappears from Rawcore (e.g., source stopped delivering it, FK reference deleted),
the corresponding _hist column is:
marked inactive (active=false, retired_at set)
detached from lineage (all TargetColumnInput links removed)

Technical SCD columns (version_*, load_run_id, loaded_at) remain generator-managed and may be rebuilt,
but business-history stays analyzable and stable over time.

🧩 Guardrails:¶

Hist datasets must only rename from hist-like former names (*_hist)
to prevent accidental base → hist table renames.

🔧 4. Historization Workflow (Incremental, SCD2)¶

Historization runs after the Rawcore Merge Load.

The historization pipeline consists of four steps, executed in order:

Changed-UPDATE
Delete-UPDATE
Changed-INSERT
New-INSERT

Each step uses surrogate keys and row hashes to detect which BKs require action.

Worked example:

Rawcore snapshot:
  - BK exists and attributes changed → new version
  - BK no longer exists              → deleted version
  - BK completely new                → new version

History table:
  updates old versions, inserts new ones accordingly

🔧 5. Step 1: Version Detection (UPDATE)¶

Change detection is based on a deterministic row_hash computed in Rawcore.
The hash covers all non-key, non-technical attributes and is reused unchanged
by the historization pipeline to ensure consistent and dialect-independent
change detection.

A version is considered changed if:

its Business Key exists in Rawcore, but
its row_hash differs from the current Rawcore row.

SQL shape:

UPDATE rawcore.rc_aw_product_hist AS h
SET
  version_ended_at = {{ load_timestamp }},
  version_state    = 'changed',
  load_run_id      = {{ load_run_id }}
WHERE h.version_ended_at IS NULL
  AND EXISTS (
    SELECT 1
    FROM rawcore.rc_aw_product AS r
    WHERE r.rc_aw_product_key = h.rc_aw_product_key
      AND r.row_hash <> h.row_hash
  );

🔧 6. Step 2: Delete Detection (UPDATE)¶

A version is considered deleted if:

its BK does not appear in the Rawcore snapshot.

SQL shape:

UPDATE rawcore.rc_aw_product_hist AS h
SET
  version_ended_at = {{ load_timestamp }},
  version_state    = 'deleted',
  load_run_id      = {{ load_run_id }}
WHERE h.version_ended_at IS NULL
  AND NOT EXISTS (
    SELECT 1
    FROM rawcore.rc_aw_product AS r
    WHERE r.rc_aw_product_key = h.rc_aw_product_key
  );

No hard-deletes occur – only closing open versions.

🔧 7. Step 3: Changed Version Insert (INSERT)¶

After Step 1 closed old changed versions, we must create new versions for
each changed BK.

SQL shape:

INSERT INTO rawcore.<hist_table> (<hist_cols...>)
SELECT
  r.<rawcorename>_key,
  <attribute columns>,
  r.row_hash,
  {{ load_timestamp }},
  NULL,
  'changed',
  {{ load_run_id }}
FROM rawcore.<rawcore_table> AS r
WHERE EXISTS (
  SELECT 1
  FROM rawcore.<hist_table> AS h
  WHERE h.version_ended_at = {{ load_timestamp }}
    AND h.version_state = 'changed'
    AND h.<rawcorename>_key = r.<rawcorename>_key
);

🔧 8. Step 4: New Version Insert (INSERT)¶

A version is considered new if:

its BK exists in Rawcore,
but no history entry for this BK exists at all.

SQL shape:

INSERT INTO rawcore.<hist_table> (<hist_cols...>)
SELECT
  r.<rawcorename>_key,
  <attribute columns>,
  r.row_hash,
  {{ load_timestamp }},
  NULL,
  'new',
  {{ load_run_id }}
FROM rawcore.<rawcore_table> AS r
WHERE NOT EXISTS (
  SELECT 1
  FROM rawcore.<hist_table> AS h
  WHERE h.<rawcorename>_key = r.<rawcorename>_key
);

🔧 9. Ordering Guarantees¶

Historization is always executed as a downstream step of the Rawcore load
and relies on the Rawcore snapshot being fully materialized for the same
load timestamp and load run ID.

The SCD2 pipeline must always run in the following order:

Close changed versions
Close deleted versions
Insert new changed versions
Insert new versions

This ensures:

correct temporal consistency
no overlapping validity periods
all new rows receive a correct SCD state
deletes never override changed states

🔧 10. Expression Reuse¶

All attribute expressions come from the Rawcore logical select plan.
This guarantees:

identical typing
identical coercions
identical cast rules
identical derived expressions (hashing, concatenations, etc.)

History never reimplements logic — it inherits all transformations from Rawcore.

🔧 11. Dialect Abstraction¶

All historization SQL is generated via the same dialect interface used
throughout the load engine:

identifier rendering
table rendering
quoting rules

History SQL is therefore dialect-safe and works across:

DuckDB
PostgreSQL
SQL Server
any custom dialect with the same contract

No dialect requires MERGE for historization.

🔧 12. Interaction with Rawcore Merge Load¶

Historization relies on Rawcore being loaded with a consistent incremental
merge before it runs.

Historization consumes:

surrogate keys
row hashes
attribute columns
business key lineage
load timestamp & load run ID

and produces a clean SCD timeline for all BKs.

🔧 13. Guarantees & Invariants¶

The historization layer guarantees:

exactly one open version per BK (or none after deletion)
no overlapping date ranges
strict sequential versioning
deterministic change detection via row_hash
idempotent reruns of the same load_run_id
append-only history (no data loss)

🔧 14. Testability¶

The SQL generation is fully covered via:

unit tests for UPDATE and INSERT blocks
integration test verifying the combined pipeline
monkeypatched lineage & expression maps for deterministic SQL
surrogate key joins validated in both EXISTS and NOT EXISTS branches

Historization never relies on ORM runtime behavior.

🔧 15. Optional Extensions (Out of Scope)¶

Optional extensions that can be built later:

effective dating vs. load dating
metadata-driven state labels
soft-deleted Rawcore entries
point-in-time reconstruction helpers
history pruning / archiving modules

🔧 16. Summary¶

Historization is a complete, metadata-driven SCD Type 2 engine:

fully generated SQL
lineage-consistent attribute mapping

⚙️ Historization Architecture (SCD Type 2)¶

🔧 1. Purpose & Scope¶

🧩 1.1 Placeholder Standardisation¶

🧩 1.2 Execution Semantics¶

🔧 2. History Table Structure¶

🧩 2.1 Surrogate Keys¶

🧩 2.2 Business Attributes¶

🧩 2.3 SCD Technical Columns¶

🔧 3. Schema Sync, Drift & Renames¶

🧩 Metadata-driven rename tracking:¶

🧩 Orphan preservation (schema drift)¶

🧩 Policy: Never drop business columns in _hist¶

🧩 Guardrails:¶

🔧 4. Historization Workflow (Incremental, SCD2)¶

🔧 5. Step 1: Version Detection (UPDATE)¶

🔧 6. Step 2: Delete Detection (UPDATE)¶

🔧 7. Step 3: Changed Version Insert (INSERT)¶

🔧 8. Step 4: New Version Insert (INSERT)¶

🔧 9. Ordering Guarantees¶

🔧 10. Expression Reuse¶

🔧 11. Dialect Abstraction¶

🔧 12. Interaction with Rawcore Merge Load¶

🔧 13. Guarantees & Invariants¶

🔧 14. Testability¶

🔧 15. Optional Extensions (Out of Scope)¶

🔧 16. Summary¶

🧩 Policy: Never drop business columns in `_hist`¶