Skip to content

Release Notes — v1.0

Release date: February 2026

This is the first stable release of weevr, a configuration-driven execution framework for Spark in Microsoft Fabric. All features described below are new.

weevr lets you declare data shaping intent in YAML configuration files. A PySpark engine interprets that intent at runtime and executes optimized, repeatable data transformations — no code generation, no manual notebook orchestration.


Core Execution Model

  • Thread execution — The smallest unit of work. Each thread reads one or more sources, applies an ordered sequence of transforms, and writes the result to a Delta target.
  • Weave execution — Groups related threads into a dependency-aware DAG. Independent threads execute concurrently via ThreadPoolExecutor with topological ordering to respect data dependencies.
  • Loom execution — The deployable unit. Orchestrates one or more weaves in configurable sequential order, providing the top-level entry point for pipeline runs.

Configuration System

  • YAML-driven pipelines — Threads, weaves, and looms are defined entirely in YAML. No Python is required to describe data transformations.
  • Config inheritance — Settings cascade from loom to weave to thread, with the most specific level winning. Define patterns once and let them propagate.
  • Variable injection — Use ${variable_name} syntax to parameterize configs. Variables resolve from runtime parameters passed to Context.
  • Reference resolution — Use ref to point to external config files by path, keeping configurations DRY and maintainable.
  • Schema validation — Pydantic-based validation catches config errors before any data is read, with clear error messages pointing to the problem.
  • Config versioning — The config_version field supports forward compatibility as the schema evolves.

Write Modes

Four write modes cover the most common data landing patterns:

  • overwrite — Full table replacement. The target is rewritten on every run.
  • append — Adds rows to an existing table without modifying existing data.
  • merge — Upsert semantics using match keys. Configurable update and insert behavior with support for soft deletes.
  • insert_only — Inserts only new rows that do not match existing keys. No updates are applied on match.

Data Quality

  • Validation rules — Row-level and aggregate validation with configurable severity actions:
    • info / warn — Log and continue
    • error — Quarantine failing rows to a {target}_quarantine table
    • fatal — Abort execution immediately
  • Post-write assertions — Verify row counts, null checks, uniqueness, and custom expressions after writes complete.
  • Null-safe key handling — Automatic null detection in join and merge keys prevents silent data loss from Spark's default null join behavior.

Incremental Processing

  • Watermark-based loads — Incremental reads using a high-water mark column. Supports timestamp, date, int, and long column types.
  • State persistence — Two built-in stores for watermark state:
    • Table properties — Stores watermarks directly in Delta table properties
    • Metadata table — Centralized metadata table for cross-pipeline state
  • Automatic state management — The engine handles watermark reads before source filtering and watermark writes after successful target commits.
  • CDC support — Change Data Capture merge routing with configurable operation columns, hard/soft delete handling, and a Delta Change Data Feed preset.

Telemetry and Observability

  • Structured JSON logging — All log output uses OTel-compatible field names for integration with observability platforms.
  • Execution spans — Hierarchical trace/span model follows the loom → weave → thread execution tree. Each span captures timing, row counts, and status.
  • Configurable log levels — Four verbosity tiers: MINIMAL, STANDARD, VERBOSE, and DEBUG.
  • Progress tracking — Span events mark each execution phase (read, transform, validate, write) for fine-grained observability.

Python API

  • Context class — Single entry point for all execution. Accepts a SparkSession, optional parameters, and a config path:

    from weevr import Context
    
    ctx = Context(spark, "my-project.weevr", params={"run_date": "2026-02-25"})
    result = ctx.run("daily.loom")
    
  • RunResult — Structured result object with execution status, timing, row counts, and telemetry spans.

  • LoadedConfig — Intermediate representation returned by ctx.load() for inspecting resolved configuration before execution.
  • ExecutionMode — Enum supporting execute, validate, plan, and preview modes for development and production workflows.

Operations

Readers

Read from the most common Fabric and Spark source formats:

  • Delta — Native Delta Lake table reads
  • Parquet — Columnar file reads
  • CSV — Delimited text with configurable options (header, schema inference)
  • JSON — Structured and semi-structured JSON files

Transforms

Nineteen transform types cover standard data shaping needs:

  • select — Choose and reorder columns
  • filter — Row filtering with Spark SQL expressions
  • rename — Column renaming
  • cast — Type casting
  • derive — Computed columns from expressions
  • deduplicate — Row deduplication with configurable ordering
  • aggregate — Group-by aggregations
  • sort — Row ordering
  • drop — Remove columns
  • join — Multi-source joins with null-safe key handling
  • union — Combine datasets
  • window — Window function support
  • pivot — Reshape data with pivot operations
  • Additional utility transforms for surrogate key generation and change detection hashing

Writers

  • Delta writer — Writes to Delta Lake tables with support for all four write modes (overwrite, append, merge, insert_only), partition management, and schema evolution options.

Engine

  • ExecutionPlanner — Builds optimized execution plans from parsed configuration. Resolves the full dependency graph before any data is read.
  • CacheManager — Configurable DataFrame caching for lookup tables shared across multiple threads within a weave. Automatic cache cleanup after execution completes.
  • DAG resolution — Topological sort with cycle detection. Dependencies are inferred from source/target path analysis, so threads do not need to declare explicit dependencies.

Compatibility

Component Version
Python 3.11
PySpark 3.5.x
Delta Lake 3.2.x
Microsoft Fabric Runtime 1.3

Installation

pip install weevr

Or in a Fabric notebook:

%pip install weevr

Getting Started

The fastest way to get up and running is the Your First Loom tutorial, which walks through building a complete pipeline from scratch — defining a thread, composing it into a weave and loom, and running it through the Python API.

For task-oriented recipes, see the How-to Guides.

What's Next

The v1.0 release delivers the core execution engine. Upcoming development will focus on:

  • Naming normalization — Automatic column name standardization
  • Advanced merge patterns — More flexible merge strategies for complex slowly changing dimension scenarios
  • Extensibility — Stitch patterns for reusable transform sequences, helper function registries, and project-level UDF registration
  • Operational resilience — Retry policies, circuit breakers, and mirror targets for fault-tolerant pipelines
  • Developer tooling — CLI-based config validation, dry-run modes, and a test framework for integration projects

Follow the GitHub repository for updates and release announcements.