Skip to content

Release Notes — v1.10

Release date: March 2026

This release adds four new pipeline steps (concat, map, format, fill_null type defaults), two analytical target modes (dimension and fact), audit column templates with built-in presets, shared resource universality across loom/weave/thread levels, the resolve step for declarative FK resolution, the fk_sentinel_rate assertion, and a broad set of codebase quality improvements spanning model validation, error handling, security hardening, and telemetry reliability.


Concat Step

Concatenates multiple columns into a single string column with configurable null handling, separators, and trimming.

steps:
  - concat:
      target: full_address
      columns: [street, city, state, zip]
      separator: ", "
      null_mode: skip
      trim: true
      collapse_separators: true

Key features:

  • null_modeskip (omit nulls and collapse their separator), empty (convert nulls to empty strings), or literal (replace with a custom string)
  • null_literal — replacement string when null_mode is literal (default <NULL>)
  • trim — strip leading/trailing whitespace from source columns before joining
  • collapse_separators — when nulls are skipped, collapse adjacent separators into one (default true)
  • When all source values are null in skip mode, the result is NULL rather than an empty string

Map Step

Maps discrete values in a column to new values using a lookup dictionary. Handles nulls and unmapped values independently.

steps:
  - map:
      column: status_code
      target: status_label
      values:
        A: "Active"
        B: "Blocked"
        C: "Closed"
      default: "Unknown"
      on_null: "Missing"
      case_sensitive: false

Key features:

  • target — optional output column; when omitted, the source column is overwritten in place
  • default — fallback value for unmapped inputs (mutually exclusive with unmapped: null and unmapped: validate)
  • on_null — specific replacement for null inputs
  • unmappedkeep (retain original), null (set NULL), or validate (keep and add a boolean flag column __map_unmapped_{column})
  • case_sensitive — set false for case-insensitive matching (default true)

Format Step

Formats columns using pattern, number, or date rules. Each column specifies exactly one format type.

steps:
  - format:
      columns:
        phone:
          source: raw_phone
          pattern: "({1:3}){4:3}-{7:4}"
        amount_display:
          source: amount
          number: "#,##0.00"
        event_date:
          date: "yyyy-MM-dd"

Three format types:

  • pattern — positional extraction with {position:length} syntax (e.g., phone number formatting). on_short controls behavior when input is shorter than the pattern: null (default) or partial (best effort)
  • number — DecimalFormat pattern for numeric display (e.g., #,##0.001,234,567.89). strict_types: false auto-casts non-numeric sources
  • date — SimpleDateFormat pattern for date display (e.g., yyyy-MM-dd)

Type-Aware fill_null

The fill_null step gains a new type_defaults mode that assigns semantically appropriate defaults based on each column's data type.

steps:
  - fill_null:
      mode: type_defaults
      code: unknown
      include: ["amount_*"]
      exclude: ["id"]
      overrides:
        region: "Unspecified"

Semantic codes and their defaults:

Code String Integer Boolean Date
unknown Unknown 0 false 1970-01-01
not_applicable Not Applicable 0 false 1970-01-01
invalid Invalid 0 false 1970-01-01

Key features:

  • include/exclude — glob patterns to restrict which columns receive defaults (e.g., include: ["addr_*"])
  • overrides — per-column replacements applied on top of type-based defaults
  • where — conditional predicate to fill only matching rows
  • Composablemode: type_defaults and explicit columns can coexist in the same step; type defaults apply first, then explicit columns override

Analytical Target Modes

Dimension Mode

Declares a dimension table with surrogate key generation, business key identification, SCD Type 2 history tracking, change detection groups, and system member rows.

target:
  alias: gold.dim_customer
  dimension:
    business_key: [customer_id]
    surrogate_key:
      name: sk_customer_id
      algorithm: sha256
      columns: [customer_id]
      output: native
    track_history: true
    change_detection:
      attrs:
        columns: [customer_name, address]
        on_change: version
      static:
        columns: [region_code]
        on_change: static
    columns:
      valid_from: _valid_from
      valid_to: _valid_to
      is_current: _is_current
    system_members:
      - sk: -1
        code: UNKNOWN
        label: Unknown
      - sk: -4
        code: INVALID
        label: Invalid Data

Key features:

  • surrogate_key — hash-based key generation with configurable algorithm (sha256, xxhash64, md5, murmur3, etc.) and output type (native or string)
  • business_key — one or more columns forming the natural key
  • track_history — enables SCD Type 2 with valid_from, valid_to, and is_current tracking columns
  • change_detection — named groups with independent on_change behavior: version (new SCD row), overwrite (update in place), or static (non-versioned metadata). columns: auto captures remaining unclaimed columns
  • previous_columns — capture prior values before an update (e.g., previous_name: customer_name)
  • additional_keys — extra hash keys beyond the primary surrogate
  • system_members — sentinel rows with negative SK values for Unknown, Invalid, etc. seed_system_members: true inserts them on first load
  • history_filter — expose a filtered view of current rows only (default true)
  • dates — configurable SCD boundary dates (min default 1970-01-01, max default 9999-12-31)

Fact Mode

Declares a fact table with foreign key columns and sentinel value conventions for data quality enforcement.

target:
  alias: gold.fact_sales
  fact:
    foreign_keys:
      - customer_sk
      - product_sk
      - region_sk
    sentinel_values:
      invalid: -4
      missing: -1

Key features:

  • foreign_keys — required non-empty list of FK columns referencing dimensions
  • sentinel_values — conventions for missing and invalid data (defaults: invalid: -4, missing: -1). Values must be distinct to prevent stats double-counting
  • Pairs naturally with the resolve step and fk_sentinel_rate assertion for end-to-end FK resolution and quality gates

Audit Column Templates

Background

v1.6 introduced audit column injection via inline audit_columns dicts at the thread target level. v1.10 adds named templates — reusable sets of audit columns that can be defined at any level and referenced by name, with two built-in presets.

Built-in Presets

Preset Cols Description
fabric 9 Fabric pipeline metadata — batch, pipeline, workspace, Spark app
minimal 3 Lightweight — loaded_at, run_id, thread name

Configuration

Define custom templates and reference them by name:

# Define at loom, weave, or thread level
audit_templates:
  my_standard:
    columns:
      _loaded_at: "current_timestamp()"
      _run_id: "${param.run_id}"
      _environment: "'${param.env}'"

target:
  alias: gold.customers
  audit_template: minimal

Multiple templates can be referenced and merged in order:

target:
  alias: gold.orders
  audit_template:
    - minimal
    - my_standard
  audit_columns:
    _custom_col: "custom_value()"
  audit_columns_exclude:
    - "_batch_*"

Inheritance

Templates cascade from loom → weave → thread:

  • Templates defined at the loom level are inherited by all weaves and threads
  • Weave-level templates extend or override loom-level ones
  • Thread-level templates extend or override weave-level ones
  • Set audit_template_inherit: false on a thread to block inheritance from parent levels
  • Inline audit_columns merge additively on top of resolved templates (same-named columns override)
  • audit_columns_exclude applies glob patterns last to remove unwanted columns

Context Variables

Templates support runtime substitution:

  • ${thread.name}, ${thread.qualified_key}, ${thread.source}, ${thread.sources}
  • ${weave.name}, ${loom.name}
  • ${run.timestamp}, ${run.id}
  • ${param.*} (runtime parameters)

Shared Resource Universality

Prior to v1.10, shared resources like lookups, column sets, and variables could only be defined at the thread level. This release promotes resource definitions to the loom and weave levels, enabling centralized configuration that cascades down the hierarchy.

Resources at Every Level

The following resources can now be defined at loom, weave, or thread levels:

  • lookups — named lookup source definitions
  • column_sets — named column rename mappings
  • variables — named variable specs
  • pre_steps / post_steps — named hook steps
  • params — parameter definitions
  • execution — execution config (log level, tracing)
  • naming — column naming conventions
  • audit_templates — audit column template definitions

Cascading Rules

Resources merge from loom → weave → thread, with the most specific level winning on conflicts:

# loom.yaml — shared across all weaves
lookups:
  dim_customer:
    source:
      type: delta
      alias: staging.dim_customer

# weave.yaml — adds weave-specific lookups
lookups:
  ref_status:
    source:
      type: delta
      alias: reference.status_codes

# thread.yaml — overrides loom-level customer lookup
lookups:
  dim_customer:
    source:
      type: delta
      alias: dev.dim_customer_test

Merge semantics:

  • Additive mergelookups, column_sets, variables, pre_steps, post_steps merge by name across levels; same-named entries at a lower level override the parent
  • Replacementexecution, naming, params at a lower level replace the parent entirely

Resolve Step

Foreign key resolution is a universal pattern in dimensional modeling. Every fact table requires resolving business keys from source systems into surrogate keys from dimension tables. Previously, this required chaining join, derive, filter, and coalesce steps — verbose, error-prone, and lacking standardized sentinel handling.

Single FK Resolve

The resolve step encapsulates the complete FK resolution pattern in one declarative block:

steps:
  - resolve:
      name: plant_id
      lookup: dim_plant
      match: plant_code
      pk: id
      on_invalid: -4
      on_unknown: -1

Key features:

  • BK completeness check — null/blank source columns automatically receive the on_invalid sentinel
  • Sentinel assignment — aligned with system member codes (on_invalid default -4, on_unknown default -1)
  • Match sugar — string, list, or dict forms for column mapping
  • Normalizationtrim_lower, trim_upper, trim presets applied symmetrically to both sides
  • Include columns — bring additional lookup columns into the fact with optional rename dict and prefix
  • on_duplicatewarn (default), error, or first for multi-match scenarios
  • Resolution stats — per-FK metadata (total, matched, unknown, invalid, duplicates, match_rate)

SCD2 Narrowing

The effective block supports two sub-modes for point-in-time dimension resolution:

# Current flag (string sugar or dict form)
effective:
  current: is_current

# Date range (half-open interval [from, to))
effective:
  date_column: order_date
  from: effective_from
  to: effective_to

A general where predicate can compose with effective using AND semantics for custom narrowing.

Batch Mode

Resolve multiple FKs in one step with shared defaults:

- resolve:
    pk: id
    on_invalid: -4
    on_unknown: -1
    batch:
      - name: plant_id
        lookup: dim_plant
        match: plant_code
      - name: customer_id
        lookup: dim_customer
        match:
          customer_code: natural_id
        normalize: trim_lower

Item-level values override shared defaults. Source columns are dropped only after all FKs complete, preventing mid-batch failures when columns are shared across resolutions.

Pipeline Integration

The resolve step is dispatched via special handling in run_pipeline(), which now accepts an optional lookups parameter. The executor passes the effective cached lookups (merged from loom, weave, and thread levels) through to the pipeline.

on_failure: warn assigns the on_unknown sentinel to all rows and logs a warning instead of aborting the thread when a lookup cannot be found (single mode only).


fk_sentinel_rate Assertion

A new post-write assertion type for checking FK sentinel value rates:

assertions:
  - type: fk_sentinel_rate
    column: plant_id
    sentinel: -4
    max_rate: 0.05
    message: "plant FK invalid rate exceeded"

Supports:

  • Single column or columns list — each checked independently
  • Named sentinel groups — dict-of-int (shared max_rate) or dict-of-dict (per-group rates)
  • System member codes — string values resolved at evaluation time

Config Validation

validate_resolve_lookups() checks at config time that all resolve steps reference defined lookup names, catching configuration errors before execution.


Codebase Quality Improvements

Model Validation

  • Step discriminator fix — multi-word step types (case_when, fill_null, string_ops, date_ops) now round-trip correctly through the discriminated union when passed as model instances
  • Empty collection guardsSelectParams, DropParams, CastParams, DeriveParams, SortParams, UnionParams now reject empty column/source lists at parse time
  • Target requires alias or path — at least one must be set
  • Thread requires non-empty sources — empty sources dict rejected at parse time
  • ColumnSetSource type validationdelta requires alias, yaml requires path
  • fk_sentinel_rate presence validation — requires at least one of column/columns and sentinel/sentinels
  • on_invalid/on_unknown uniqueness — equal sentinel values rejected to prevent stats double-counting
  • DimensionSurrogateKeyConfig — renamed from SurrogateKeyConfig in the dimension module to resolve namespace collision with keys.SurrogateKeyConfig

Public API

  • STEP_TYPES — renamed from _STEP_TYPES (now public)
  • New exports from weevr.model: ConcatStep, ConcatParams, MapStep, MapParams, FormatStep, FormatSpec, FormatParams, ResolveStep, ResolveParams, ResolveBatchItem, EffectiveConfig, CurrentConfig, DimensionSurrogateKeyConfig
  • quote_identifier — renamed from _quote_identifier
  • CONTEXT_VAR_PATTERN — renamed from _CONTEXT_VAR_PATTERN

Security and Reliability

  • SQL injection guard — table aliases in quality gate SQL queries are now backtick-escaped via _quote_table_ref()
  • Path traversal guardresolve_ref_path rejects refs that escape the project root
  • Thread name sanitization — single quotes escaped in table property keys
  • Assert replacement — all assert statements in production code replaced with explicit if-raise guards that survive Python -O optimization
  • Error type normalizationValueError replaced with ExecutionError in resolve and formatting handlers for consistent error taxonomy

Telemetry and Observability

  • Span finalization — weave and loom telemetry spans are now finalized in finally blocks, producing complete traces even on failure paths
  • Weave status fix — a weave where all threads succeeded or were conditionally skipped now correctly returns "success" (was "partial")
  • CDC count consolidation — three separate .count() calls in execute_cdc_merge consolidated into a single groupBy aggregation

Minor Improvements

  • __dedup_rn__ temp column name prevents collision with user columns named _rn
  • FormatSpec rejects empty pattern strings
  • ConcatParams rejects empty null_literal with literal mode
  • Export treats empty-string path/alias same as None
  • fact.py logs exceptions at DEBUG level instead of silently swallowing them in the FK sentinel advisory check
  • WeaveTelemetry.column_set_results typed as list[ColumnSetResult] (was list[Any])
  • Internal planning IDs stripped from all source and test files