Release Notes — v1.0¶

Release date: February 2026

This is the first stable release of weevr, a configuration-driven execution framework for Spark in Microsoft Fabric. All features described below are new.

weevr lets you declare data shaping intent in YAML configuration files. A PySpark engine interprets that intent at runtime and executes optimized, repeatable data transformations — no code generation, no manual notebook orchestration.

Core Execution Model¶

Thread execution — The smallest unit of work. Each thread reads one or more sources, applies an ordered sequence of transforms, and writes the result to a Delta target.
Weave execution — Groups related threads into a dependency-aware DAG. Independent threads execute concurrently via ThreadPoolExecutor with topological ordering to respect data dependencies.
Loom execution — The deployable unit. Orchestrates one or more weaves in configurable sequential order, providing the top-level entry point for pipeline runs.

Configuration System¶

YAML-driven pipelines — Threads, weaves, and looms are defined entirely in YAML. No Python is required to describe data transformations.
Config inheritance — Settings cascade from loom to weave to thread, with the most specific level winning. Define patterns once and let them propagate.
Variable injection — Use ${variable_name} syntax to parameterize configs. Variables resolve from runtime parameters passed to Context.
Reference resolution — Use ref to point to external config files by path, keeping configurations DRY and maintainable.
Schema validation — Pydantic-based validation catches config errors before any data is read, with clear error messages pointing to the problem.
Config versioning — The config_version field supports forward compatibility as the schema evolves.

Write Modes¶

Four write modes cover the most common data landing patterns:

overwrite — Full table replacement. The target is rewritten on every run.
append — Adds rows to an existing table without modifying existing data.
merge — Upsert semantics using match keys. Configurable update and insert behavior with support for soft deletes.
insert_only — Inserts only new rows that do not match existing keys. No updates are applied on match.

Data Quality¶

Validation rules — Row-level and aggregate validation with configurable severity actions:
- info / warn — Log and continue
- error — Quarantine failing rows to a {target}_quarantine table
- fatal — Abort execution immediately
Post-write assertions — Verify row counts, null checks, uniqueness, and custom expressions after writes complete.
Null-safe key handling — Automatic null detection in join and merge keys prevents silent data loss from Spark's default null join behavior.

Incremental Processing¶

Watermark-based loads — Incremental reads using a high-water mark column. Supports timestamp, date, int, and long column types.
State persistence — Two built-in stores for watermark state:
- Table properties — Stores watermarks directly in Delta table properties
- Metadata table — Centralized metadata table for cross-pipeline state
Automatic state management — The engine handles watermark reads before source filtering and watermark writes after successful target commits.
CDC support — Change Data Capture merge routing with configurable operation columns, hard/soft delete handling, and a Delta Change Data Feed preset.

Telemetry and Observability¶

Structured JSON logging — All log output uses OTel-compatible field names for integration with observability platforms.
Execution spans — Hierarchical trace/span model follows the loom → weave → thread execution tree. Each span captures timing, row counts, and status.
Configurable log levels — Four verbosity tiers: MINIMAL, STANDARD, VERBOSE, and DEBUG.
Progress tracking — Span events mark each execution phase (read, transform, validate, write) for fine-grained observability.

Python API¶

Context class — Single entry point for all execution. Accepts a SparkSession, optional parameters, and a config path:

from weevr import Context

ctx = Context(spark, "my-project.weevr", params={"run_date": "2026-02-25"})
result = ctx.run("daily.loom")

RunResult — Structured result object with execution status, timing, row counts, and telemetry spans.
LoadedConfig — Intermediate representation returned by ctx.load() for inspecting resolved configuration before execution.
ExecutionMode — Enum supporting execute, validate, plan, and preview modes for development and production workflows.

Operations¶

Readers¶

Read from the most common Fabric and Spark source formats:

Delta — Native Delta Lake table reads
Parquet — Columnar file reads
CSV — Delimited text with configurable options (header, schema inference)
JSON — Structured and semi-structured JSON files

Transforms¶

Nineteen transform types cover standard data shaping needs:

select — Choose and reorder columns
filter — Row filtering with Spark SQL expressions
rename — Column renaming
cast — Type casting
derive — Computed columns from expressions
deduplicate — Row deduplication with configurable ordering
aggregate — Group-by aggregations
sort — Row ordering
drop — Remove columns
join — Multi-source joins with null-safe key handling
union — Combine datasets
window — Window function support
pivot — Reshape data with pivot operations
Additional utility transforms for surrogate key generation and change detection hashing

Writers¶

Delta writer — Writes to Delta Lake tables with support for all four write modes (overwrite, append, merge, insert_only), partition management, and schema evolution options.

Engine¶

ExecutionPlanner — Builds optimized execution plans from parsed configuration. Resolves the full dependency graph before any data is read.
CacheManager — Configurable DataFrame caching for lookup tables shared across multiple threads within a weave. Automatic cache cleanup after execution completes.
DAG resolution — Topological sort with cycle detection. Dependencies are inferred from source/target path analysis, so threads do not need to declare explicit dependencies.

Compatibility¶

Component	Version
Python	3.11
PySpark	3.5.x
Delta Lake	3.2.x
Microsoft Fabric Runtime	1.3

Installation¶

pip install weevr

Or in a Fabric notebook:

%pip install weevr

Getting Started¶

The fastest way to get up and running is the Your First Loom tutorial, which walks through building a complete pipeline from scratch — defining a thread, composing it into a weave and loom, and running it through the Python API.

For task-oriented recipes, see the How-to Guides.

What's Next¶

The v1.0 release delivers the core execution engine. Upcoming development will focus on:

Naming normalization — Automatic column name standardization
Advanced merge patterns — More flexible merge strategies for complex slowly changing dimension scenarios
Extensibility — Stitch patterns for reusable transform sequences, helper function registries, and project-level UDF registration
Operational resilience — Retry policies, circuit breakers, and mirror targets for fault-tolerant pipelines
Developer tooling — CLI-based config validation, dry-run modes, and a test framework for integration projects

Follow the GitHub repository for updates and release announcements.