Skip to content

Weave YAML Schema

A weave groups a collection of threads into a dependency graph. It defines which threads run, their execution order, optional shared defaults, and runtime settings.


Top-level keys

Key Type Required Default Description
config_version string yes -- Schema version identifier (e.g. "1.0")
name string no "" Human-readable weave name
threads list[ThreadEntry or string] yes -- Thread references. Strings are shorthand for { name: "<value>" } (local name references). Use ref for external file references.
lookups dict[string, Lookup] no null Named lookup sources shared across threads in this weave
column_sets dict[string, ColumnSet] no null Named column sets for bulk rename from external dictionaries. Shared across threads in this weave.
variables dict[string, VariableSpec] no null Weave-scoped typed variables, settable by hook steps
pre_steps list[HookStep] no null Hook steps executed before any thread runs
post_steps list[HookStep] no null Hook steps executed after all threads complete
defaults dict[string, any] no null Default values cascaded into every thread in this weave. audit_columns and exports use additive merge (see Exports guide).
params dict[string, ParamSpec] no null Typed parameter declarations scoped to this weave
execution ExecutionConfig no null Runtime settings (logging, tracing) cascaded to threads
naming NamingConfig no null Naming normalization cascaded to threads
audit_templates dict[string, AuditTemplate] no null Named audit column templates cascaded to all threads in this weave. Thread-level definitions override weave-level definitions with the same name. A flat dict[string, string] shorthand is also accepted. See Audit Templates guide.
connections dict[string, OneLakeConnection] no null Named connection definitions cascaded to threads. Thread-level connections with the same name override weave-level. See Connections guide.

threads (ThreadEntry)

Each entry in the threads list is either a plain string (shorthand) or a ThreadEntry object. Use ref to reference an external .thread file, or name for inline thread definitions within the weave.

Key Type Required Default Description
ref string no null Path to an external .thread file, relative to the project root. Mutually exclusive with inline name.
name string no "" Thread name. Required for inline definitions; derived from filename stem when using ref.
as string no null Alias overriding the thread's effective name. Required when the same ref appears multiple times. See Thread Templates guide.
params dict[string, any] no null Key-value pairs injected into the thread's ${param.x} namespace for this entry. See Thread Templates guide.
dependencies list[string] no null Explicit upstream thread names. Merged with auto-inferred dependencies from source/target path matching.
condition ConditionSpec no null Conditional execution gate

threads.condition (ConditionSpec)

Key Type Required Default Description
when string yes -- Condition expression. Supports ${param.x} references, table_exists(), table_empty(), row_count(), and boolean operators.

lookups

Named lookup data sources shared across threads. Each key is a lookup name that threads can reference via source.lookup instead of declaring a direct source type.

Key Type Required Default Description
source Source yes -- Source definition for the lookup data
materialize bool no false Pre-read and cache/broadcast the data before thread execution
strategy string no "cache" Materialization strategy: "broadcast" (Spark broadcast join hints) or "cache" (DataFrame caching). Only meaningful when materialize is true.
key list[string] no null Column(s) used for matching (join key). When set with values, only these columns plus values columns are retained.
values list[string] no null Payload column(s) to retrieve. Requires key to be set. Must not overlap with key.
filter string no null SQL WHERE expression applied to the source before projection
unique_key bool no false Validate that key columns form a unique key after filtering and projection
on_failure string no "abort" Behavior when unique_key check finds duplicates: "abort" or "warn". Only valid when unique_key is true.
lookups:
  dim_product:
    source:
      type: delta
      alias: silver.dim_product
    materialize: true
    strategy: cache
    key: [product_id]
    values: [product_name, category]
    unique_key: true

  exchange_rates:
    source:
      type: delta
      alias: ref.exchange_rates
    filter: "effective_date = current_date()"
    materialize: true
    strategy: broadcast

Threads reference lookups via the lookup field on a source:

# In a thread's sources block
sources:
  products:
    lookup: dim_product   # resolved from the weave's lookups map

column_sets

Named column sets define externally-sourced column mappings for bulk rename operations. Each key is a column set name that rename steps can reference via column_set: <name>.

Column sets can be sourced from a Delta table, a YAML file, or a runtime parameter. A column set must have exactly one of source or param — not both.

Key Type Required Default Description
source ColumnSetSource no null External source for the column mapping (Delta or YAML)
param string no null Runtime parameter name containing a dict[str, str] mapping. Mutually exclusive with source.
on_unmapped string no "pass_through" Behavior when a DataFrame column has no rename entry: "pass_through" or "error"
on_extra string no "ignore" Behavior when a mapping key references a column not in the DataFrame: "ignore", "warn", or "error"
on_failure string no "abort" Behavior when the source cannot be read or returns no rows: "abort", "warn", or "skip"

column_sets.source (ColumnSetSource)

A delta source can be addressed two ways: by registered table alias (default Spark catalog) or by named connection plus table (resolves through the lakehouse referenced by a connections: entry on the loom or weave). The two modes are mutually exclusive.

Key Type Required Default Description
type string yes -- Source type: "delta" or "yaml"
alias string cond. null Lakehouse table alias. One of alias or connection is required when type="delta". Mutually exclusive with connection.
path string cond. null File path. Required when type="yaml".
connection string cond. null Reference to a named connection defined at the loom or weave level. Only valid for type="delta". Mutually exclusive with alias.
schema string no null Schema name within the connection's lakehouse. Only valid alongside connection.
table string cond. null Table name within the connection's lakehouse. Required when connection is set.
from_column string no "source_name" Column containing the original (raw) column names
to_column string no "target_name" Column containing the target (friendly) column names
filter string no null SQL WHERE expression applied before collecting mappings
# Connection-backed Delta source — reads from a non-default lakehouse
# referenced by a loom-level connections: entry.
connections:
  ref:
    type: onelake
    workspace: ${param.workspace_id}
    lakehouse: ${param.ref_lakehouse_id}

column_sets:
  sap_dictionary:
    source:
      type: delta
      connection: ref
      schema: dictionary
      table: sap_column_renames
      from_column: raw_name
      to_column: friendly_name
      filter: "system = 'SAP'"
    on_unmapped: pass_through
    on_extra: ignore
    on_failure: abort

  # Alias-backed Delta source — resolves against the active Spark catalog
  # (the attached notebook lakehouse on Fabric).
  legacy_dictionary:
    source:
      type: delta
      alias: ref.column_dictionary

  hr_mappings:
    source:
      type: yaml
      path: mappings/hr_columns.yaml

  runtime_overrides:
    param: column_overrides

Rename steps reference column sets by name:

steps:
  - rename:
      column_set: sap_dictionary
      columns:
        MANUAL_COL: manual_override  # static wins

pre_steps / post_steps (HookStep)

Hook steps that run before or after thread execution within a weave. pre_steps run before any thread starts; post_steps run after all threads complete.

Each hook step is a discriminated union with three types, identified by the type field.

Common fields

All hook step types share these fields:

Key Type Required Default Description
type string yes -- Step type: "quality_gate", "sql_statement", or "log_message"
name string no null Optional name for telemetry and diagnostics
on_failure string no phase default Failure behavior: "abort" or "warn". Defaults to "abort" for pre_steps and "warn" for post_steps.

quality_gate

Runs a predefined quality check against a data source or target.

Key Type Required Default Description
check string yes -- Check type: "source_freshness", "row_count_delta", "row_count", "table_exists", "expression"
source string conditional null Table alias. Required for source_freshness and table_exists.
max_age string conditional null Duration string (e.g. "24h"). Required for source_freshness.
target string conditional null Table alias. Required for row_count_delta and row_count.
max_decrease_pct float no null Max allowed decrease percentage for row_count_delta
max_increase_pct float no null Max allowed increase percentage for row_count_delta
min_delta int no null Minimum absolute row change for row_count_delta
max_delta int no null Maximum absolute row change for row_count_delta
min_count int conditional null Minimum row count for row_count. At least one of min_count or max_count required.
max_count int conditional null Maximum row count for row_count
sql string conditional null Spark SQL boolean expression for expression check. Required for expression.
message string no null Failure message for expression check diagnostics

sql_statement

Executes an arbitrary Spark SQL statement. Optionally captures the scalar result into a weave-scoped variable.

Key Type Required Default Description
sql string yes -- Spark SQL statement to execute
set_var string no null Variable name to capture the scalar result into. The variable must be declared in variables.

log_message

Emits a log message. Supports ${var.name} placeholders resolved from weave variables.

Key Type Required Default Description
message string yes -- Message template. Supports ${var.name} placeholders.
level string no "info" Log level: "info", "warn", "error"
pre_steps:
  - type: quality_gate
    name: check_source_freshness
    check: source_freshness
    source: raw.orders
    max_age: "24h"

  - type: sql_statement
    name: capture_baseline
    sql: "SELECT count(*) FROM silver.fact_orders"
    set_var: baseline_count

post_steps:
  - type: quality_gate
    name: check_row_count
    check: row_count
    target: silver.fact_orders
    min_count: 1

  - type: log_message
    message: "Pipeline complete. Baseline was ${var.baseline_count} rows."
    level: info

variables (VariableSpec)

Typed scalar variables scoped to the weave. Variables can be set by sql_statement hook steps via set_var and referenced in downstream config as ${var.name}.

Key Type Required Default Description
type string yes -- Scalar type: "string", "int", "long", "float", "double", "boolean", "timestamp", "date"
default string\|int\|float\|bool no null Default value used when no hook step sets the variable
variables:
  baseline_count:
    type: int
    default: 0
  run_label:
    type: string
    default: "scheduled"

execution (ExecutionConfig)

Cascades to all threads within the weave. Thread-level settings override these.

Key Type Required Default Description
log_level string no "standard" Logging verbosity: "minimal", "standard", "verbose", "debug"
trace bool no true Collect execution spans for telemetry

naming (NamingConfig)

Cascades to all threads within the weave. Thread-level and target-level naming settings override these.

Key Type Required Default Description
columns NamingPattern no null Column naming pattern
tables NamingPattern no null Table naming pattern
exclude list[string] no [] Names or glob patterns excluded from normalization
on_collision string no "error" Behavior when normalization produces duplicate names: "error" or "suffix" (appends _2, _3, etc.)
reserved_words ReservedWordConfig no null Reserved word protection for column and table names

Supported patterns: snake_case, camelCase, PascalCase, UPPER_SNAKE_CASE, Title_Snake_Case, Title Case, lowercase, UPPERCASE, kebab-case, none.

kebab-case requires backtick-quoting

Column names with hyphens must be backtick-quoted in Spark SQL expressions (e.g., `my-column`). A validation warning is emitted when kebab-case is selected.

naming.reserved_words (ReservedWordConfig)

Key Type Required Default Description
strategy string no "quote" Collision strategy (see below)
prefix string no "_" Prefix for prefix strategy
suffix string no "_col" Suffix for suffix strategy
rename_map dict cond. null Map for rename strategy
fallback string no "quote" Fallback for unmapped rename
preset string \| list no null Word list presets
extend list[string] no [] Extra reserved words
exclude list[string] no [] Words to remove from check

Strategies: quote, prefix, suffix, error, rename, revert, drop. The rename strategy requires rename_map (non-empty dict). The drop strategy is not valid for table names — neither as a direct strategy nor as a fallback. See Configuration Keys for full strategy descriptions and YAML examples.


Complete example

config_version: "1.0"
name: sales_pipeline

lookups:
  dim_product:
    source:
      type: delta
      alias: silver.dim_product
    materialize: true
    strategy: cache
    key: [product_id]
    values: [product_name, category]
    unique_key: true

column_sets:
  sap_dictionary:
    source:
      type: delta
      alias: ref.column_dictionary
      from_column: raw_name
      to_column: friendly_name

variables:
  baseline_count:
    type: int
    default: 0

pre_steps:
  - type: quality_gate
    name: check_orders_freshness
    check: source_freshness
    source: raw.orders
    max_age: "24h"
  - type: sql_statement
    name: capture_baseline
    sql: "SELECT count(*) FROM curated.order_summary"
    set_var: baseline_count

threads:
  - ref: staging/load_orders.thread
  - ref: staging/load_customers.thread
  - ref: curated/build_order_summary.thread
    dependencies: [load_orders, load_customers]
  - ref: curated/refresh_snapshot.thread
    dependencies: [build_order_summary]
    condition:
      when: "table_exists('curated.order_summary')"

post_steps:
  - type: quality_gate
    name: verify_row_count
    check: row_count
    target: curated.order_summary
    min_count: 1
  - type: log_message
    message: "Pipeline complete. Baseline was ${var.baseline_count} rows."
    level: info

defaults:
  write:
    mode: merge
  exports:
    - name: audit_archive
      type: parquet
      path: /archive/${thread.name}/${run.timestamp}/

params:
  run_date:
    name: run_date
    type: date
    required: true
    description: "Pipeline execution date"

execution:
  log_level: verbose
  trace: true

naming:
  columns: snake_case
  on_collision: suffix
  reserved_words:
    preset: [ansi, dax]
    strategy: prefix
    prefix: "_"

Shorthand thread syntax

Thread entries can be plain strings for inline definitions that need no overrides:

threads:
  - build_snapshot
  - refresh_index

This is equivalent to:

threads:
  - name: build_snapshot
  - name: refresh_index

For external file references, use ref explicitly:

threads:
  - ref: staging/load_orders.thread
  - ref: staging/load_customers.thread