FAQ¶
Frequently asked questions about weevr.
What is weevr?¶
weevr is a configuration-driven execution framework for PySpark. You declare data shaping intent in YAML -- sources, transforms, targets -- and weevr interprets that configuration at runtime to execute optimized, repeatable data transformations using Spark DataFrame APIs. It is designed for Microsoft Fabric but works in any Spark + Delta Lake environment.
Does weevr generate PySpark code?¶
No. weevr does not generate code from YAML. Configuration is read at runtime and drives execution directly through Spark DataFrame operations. There are no intermediate scripts, notebooks, or code artifacts produced. This makes execution deterministic and inspectable -- what you see in the config is what runs.
What Spark and Delta versions are supported?¶
weevr targets PySpark 3.5.x and Delta Lake 3.2.x, which align with Microsoft Fabric Runtime 1.3 and Python 3.11. See the Compatibility page for the full version matrix.
Can I use weevr outside Microsoft Fabric?¶
Yes. weevr runs on any environment that provides PySpark and Delta Lake. For local development, use a standalone Spark installation with the Delta extension:
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.master("local[*]")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config(
"spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog",
)
.getOrCreate()
)
The same YAML configs work in both local Spark and Fabric. Use parameter files to swap environment-specific paths. See the Fabric Runtime guide for details.
How does configuration inheritance work?¶
Configuration cascades through three levels: Loom > Weave > Thread. The rule is simple -- most specific wins.
A loom can define default write modes, execution settings, and audit
columns. A weave can override any of those. A thread can override
anything the weave set. Scalar values are replaced outright; collections
(lists, maps) are also replaced entirely, not merged. The exceptions are
audit_columns and exports, which use additive merge -- each
level extends the set, and same-named entries at a lower level override
those from the higher level.
Loom default: write.mode = overwrite
└─ Weave: write.mode = merge ← overrides loom
└─ Thread A: (inherits merge)
└─ Thread B: write.mode = append ← overrides weave
See Thread, Weave, Loom for the full explanation.
What write modes are supported?¶
weevr supports three write modes:
| Mode | Behavior |
|---|---|
overwrite |
Replace the entire target table (default) |
append |
Insert new rows without modifying existing data |
merge |
Upsert: match on keys, update/insert/delete as configured |
Merge mode is the most configurable, with separate controls for
on_match, on_no_match_target, and on_no_match_source behaviors
including soft delete support. See
Execution Modes for details.
How do incremental loads work?¶
weevr supports watermark-based incremental loading. You configure a watermark column and type on the thread's load block:
On the first run, all rows are read. On subsequent runs, weevr reads only rows where the watermark column exceeds the stored high-water mark. The watermark state is persisted either as a Delta table property or in a dedicated metadata table.
Supported watermark types: timestamp, date, int, long.
Can I mix Python code with YAML configuration?¶
Yes. weevr supports two extension points for custom logic:
- Helper functions -- Registered Python functions that can be called
from Spark SQL expressions in your YAML config (e.g., in
derive,filter, orconditionexpressions). - UDFs -- User-defined functions registered with Spark that can be used in any expression context.
Both are registered through the project-level configuration and are available across all threads in a loom execution. Custom helpers and UDFs live in your integration project alongside the YAML configs.
How do I debug a failing thread?¶
Start by checking the RunResult:
result = ctx.run("nightly.loom")
print(result.status) # "success", "failure", or "partial"
print(result.summary())
If that does not pinpoint the issue, increase the log level:
At verbose level, weevr logs every step of execution -- source reads,
transform steps, validation outcomes, and write operations. You can also
inspect the execution span on the thread's telemetry object, which
contains timing, status, and error attributes.
For transform logic issues, use preview mode to run against sampled data without writing:
See the Observability guide for the full debugging workflow.
Is weevr thread-safe?¶
Yes, within the execution model. Independent threads within a weave execute
concurrently using a ThreadPoolExecutor on a shared SparkSession. Each
thread operates on its own SpanBuilder and SpanCollector, and
collectors are merged only after thread completion. Spark itself handles
concurrent DataFrame operations on a shared session.
Weaves within a loom execute sequentially in the declared order, so there is no cross-weave concurrency to manage.
What is the difference between a thread and a weave?¶
A thread is the smallest executable unit. It defines a single data flow: read from one or more sources, apply transforms, write to one target. A thread is a self-contained transformation with no awareness of other threads.
A weave is a collection of threads organized into a dependency DAG. The weave is responsible for execution order (which threads depend on which), parallel execution of independent threads, and conditional execution logic. A weave represents a subject area or processing stage (e.g., "dimensions", "facts").
See Thread, Weave, Loom for the full hierarchy.
How do I handle null keys in joins?¶
By default, weevr uses null-safe join semantics to prevent the common
Spark pitfall where NULL = NULL evaluates to NULL (not true),
silently dropping rows from join results.
Null-safe behavior is configured per join step and is enabled by default:
steps:
- join:
source: dim_region
type: left
on: [region_id]
null_safe: true # default — rows with null keys are preserved
Set null_safe: false on a specific join step to use standard Spark join
semantics where null keys do not match.
Can I cache lookup tables?¶
Yes. There are two approaches:
Thread-level auto-caching — The CacheManager analyzes the thread
dependency DAG within a weave. When it detects that a thread's output
feeds two or more downstream consumers, it persists the output at
MEMORY_AND_DISK level automatically. You can disable auto-caching on
a specific thread with cache: false.
Weave-level lookups — Define named lookups in the weave's lookups
block. Lookups can be pre-materialized (read once, cached or broadcast
before threads run) and shared across threads via the lookup source
field:
# In the weave
lookups:
dim_product:
source:
type: delta
alias: silver.dim_product
materialize: true
strategy: cache
key: [product_id]
values: [product_name, category]
# In a thread's sources
sources:
products:
lookup: dim_product
Narrow lookups (key, values, filter) reduce memory by retaining
only the columns needed for the join. See
Cache a Lookup for the full guide.
What are execution hooks?¶
Hooks are steps that run before or after thread execution within a weave.
Define them with pre_steps (run before any thread) and post_steps
(run after all threads complete).
Three hook step types are available:
| Type | Purpose |
|---|---|
quality_gate |
Run predefined quality checks (source freshness, row counts, table existence, expressions) |
sql_statement |
Execute arbitrary Spark SQL. Optionally capture a scalar result into a weave variable. |
log_message |
Emit a log message with variable placeholders |
Each step has an on_failure policy (abort or warn). Pre-steps
default to abort; post-steps default to warn.
pre_steps:
- type: quality_gate
check: source_freshness
source: raw.orders
max_age: "24h"
post_steps:
- type: log_message
message: "Pipeline complete."
See the Weave YAML Schema for the full field reference.
How do I validate my config without running a pipeline?¶
Use validate mode:
result = ctx.run("nightly.loom", mode="validate")
print(result.status)
print(result.validation_errors)
Validate mode parses the config, resolves variables and references, checks the DAG for cycles, and verifies that source paths exist -- all without reading or writing any data. Use it in CI or as a pre-flight check before production runs.
You can also use plan mode to see the execution order without running:
result = ctx.run("nightly.loom", mode="plan")
print(result.summary()) # compact view with cache markers (e.g., dim_product*)
print(result.explain()) # detailed breakdown: dependencies, cache targets, thread detail
Plan mode builds the full execution plan — thread ordering, dependency
analysis, and cache target selection — without reading or writing any data.
The summary() output marks cache targets with an asterisk and includes
footer counts (threads, cached, lookups). explain() provides a
section-by-section breakdown including dependency provenance (inferred vs
explicit), cache consumers, lookup schedule, and per-thread source/target
detail. For loom-scoped results, explain() opens with a loom summary
line showing the loom name and weave count. In notebooks, all result modes render automatically as styled
HTML. Plan mode includes an embedded DAG diagram; execute mode shows an
executive summary with per-thread detail, flow diagrams, and data waterfalls.
See Execution Modes for the full set of available modes.