Resolve Step — FK Resolution¶

The resolve step encapsulates the complete foreign key resolution pattern: validate business key completeness, join to a dimension lookup, assign sentinel values for invalid and unknown BKs, handle SCD2 point-in-time narrowing, and optionally include extra columns from the lookup.

Why resolve?¶

Without resolve, FK resolution requires chaining join + derive + filter + coalesce steps — verbose, error-prone, and lacking standardized sentinel handling. The resolve step replaces that entire chain with one declarative block.

Value over manual steps:

Sentinel assignment replaces join + coalesce + case_when
BK completeness check (null/blank detection) is automatic
Single output column — no cleanup of extra join columns
Batch mode — 8 FKs in one block vs 8 join chains
Lookup integration — references lookups by name
fk_sentinel_rate assertion — purpose-built FK quality gate

Single FK resolve¶

The simplest form maps a source column to a lookup business key and produces a surrogate key column:

steps:
  - resolve:
      name: plant_id
      lookup: dim_plant
      match: plant_code
      pk: id

match supports three sugar forms:

# String — single column, same name both sides
match: natural_id
# Equivalent to: { natural_id: natural_id }

# List — multiple columns, same names
match: [mandt, plant]
# Equivalent to: { mandt: mandt, plant: plant }

# Dict — explicit mapping (different column names)
match:
  plant_code: natural_id
  region: region_code

Sentinel values¶

Rows with incomplete or unmatched BKs receive sentinel values instead of nulls:

Scenario	Default	Config field
Null/blank source column	`-4`	`on_invalid`
No match in lookup	`-1`	`on_unknown`

These align with system member codes from dimension seeding.

- resolve:
    name: plant_id
    lookup: dim_plant
    match: plant_code
    pk: id
    on_invalid: -4
    on_unknown: -1

SCD2 resolution — effective block¶

For slowly changing dimensions, the effective block narrows the lookup to the correct version of each record.

Current flag¶

Filter to the active record using a boolean or coded column:

- resolve:
    name: product_id
    lookup: dim_product
    match: product_code
    pk: id
    effective:
      current: is_current

Custom value (e.g. "Y" instead of true):

effective:
  current:
    column: is_current
    value: "Y"

Date range¶

Half-open interval [from, to) checked against a fact date column. Null to matches as the current record:

- resolve:
    name: product_id
    lookup: dim_product
    match: product_code
    pk: id
    effective:
      date_column: order_date
      from: effective_from
      to: effective_to

Where predicate¶

General SQL filter applied to the lookup before joining:

- resolve:
    name: plant_id
    lookup: dim_plant
    match: plant_code
    pk: id
    where: "region = 'EMEA'"

effective and where compose with AND semantics — both filters apply when present together.

Include columns¶

Bring additional columns from the lookup into the fact:

- resolve:
    name: plant_id
    lookup: dim_plant
    match: plant_code
    pk: id
    include:
      - description
      - category
    include_prefix: "plant_"

include supports list (keep names), dict (rename), or string sugar (single column). include_prefix prepends a string to all included column names.

Normalization¶

Normalize both source and lookup BK columns before joining:

- resolve:
    name: plant_id
    lookup: dim_plant
    match: plant_code
    pk: id
    normalize: trim_lower

Available presets: trim_lower, trim_upper, trim, none.

Dropping source columns¶

After resolution, the original BK source columns can be dropped:

- resolve:
    name: plant_id
    lookup: dim_plant
    match: plant_code
    pk: id
    drop_source_columns: true

Defaults to false. In batch mode, source columns are dropped only after all FKs complete — preventing mid-batch failures when the same source column is used by multiple resolutions.

Duplicate handling¶

When multiple lookup rows match a single fact row:

on_duplicate: warn    # default — log warning, take first
on_duplicate: error   # raise an exception
on_duplicate: first   # take first silently

Batch mode¶

Resolve multiple FKs in one step with shared defaults:

- resolve:
    pk: id
    on_invalid: -4
    on_unknown: -1
    batch:
      - name: plant_id
        lookup: dim_plant
        match: plant_code
      - name: customer_id
        lookup: dim_customer
        match:
          customer_code: natural_id
        normalize: trim_lower
      - name: product_id
        lookup: dim_product
        match: product_code
        effective:
          current: is_current

Shared defaults (pk, on_invalid, etc.) apply to all items. Item-level values override the shared default. When drop_source_columns: true, source columns are dropped only after all FKs complete.

on_failure handling¶

Control behavior when a lookup cannot be found (single mode):

on_failure: abort   # default — fail the thread
on_failure: warn    # assign on_unknown to all rows, continue

In batch mode, a missing lookup always aborts regardless of on_failure.

fk_sentinel_rate assertion¶

Post-write assertion that checks FK sentinel rates:

assertions:
  # Single column, single sentinel
  - type: fk_sentinel_rate
    column: plant_id
    sentinel: -4
    max_rate: 0.05
    message: "plant FK invalid rate exceeded"

  # Named sentinel groups with per-group rates
  - type: fk_sentinel_rate
    column: plant_id
    sentinels:
      invalid:
        value: -4
        max_rate: 0.02
      unknown:
        value: -1
        max_rate: 0.10
    message: "plant FK quality check"

  # Multiple columns, shared config
  - type: fk_sentinel_rate
    columns: [plant_id, customer_id, product_id]
    sentinels:
      invalid: -4
      unknown: -1
    max_rate: 0.10

Resolution statistics¶

Each resolve step records per-FK statistics in the step result metadata:

total — total fact rows processed
matched — rows with a successful FK match
unknown — rows assigned on_unknown sentinel
invalid — rows assigned on_invalid sentinel
duplicates — rows with multiple matches (before dedup)
match_rate — percentage of matched rows