Execution Modes¶
weevr has two distinct concepts that share the word "mode":
- Run modes — the
modeargument toctx.run(), which controls what the engine does (execute, validate, plan, preview). - Data modes — the
loadandwriteblocks on a thread, which control how data is read and how data is written.
This page covers both.
Run modes¶
The mode argument to Context.run() controls how far the engine
progresses through the plan → execute → aggregate pipeline.
| Mode | What happens | Result contains |
|---|---|---|
execute |
Full execution: read, transform, write (default) | Status, row counts, telemetry |
validate |
Parse config, resolve references, check DAG — no data touched | validation_errors list |
plan |
Build the execution plan and return it without running | execution_plan list, summary() with cache markers, explain() for detailed breakdown |
preview |
Run transforms against sampled data, skip writes | preview_data DataFrames |
# Validate config without touching data
result = ctx.run("nightly.loom", mode="validate")
print(result.validation_errors)
# Inspect the execution plan
result = ctx.run("nightly.loom", mode="plan")
print(result.summary()) # compact: execution groups with cache markers
print(result.explain()) # detailed: dependencies, cache targets, thread detail
# Preview transforms without writing
result = ctx.run("nightly.loom", mode="preview")
In notebooks, results render automatically as styled HTML when you evaluate the result in a cell. Each mode gets a tailored report:
- execute — executive summary, per-thread detail with flow diagrams and data waterfalls, execution timeline, and annotated DAG
- validate — check/error report with color-coded status
- plan — summary table with embedded DAG diagram
- preview — output shape table (columns × rows per thread)
For plan mode, you can also retrieve the DAG diagram directly:
result = ctx.run("nightly.loom", mode="plan")
dag = result.dag() # single-weave DAG or loom-level swimlane
dag.save("plan.svg") # export to file
dag # renders inline in a notebook
For per-weave access in a loom, use result.execution_plan[0].dag().
Note that result.dag() includes full resolved-thread context (sources,
targets, step counts) while the per-plan accessor does not.
Data modes¶
weevr separates how data is read from how data is written. The load mode controls source reading. The write mode controls target writing.
Write modes¶
The write block on a thread controls how shaped data reaches the target
Delta table.
Overwrite¶
Replaces the entire target table with the new data.
Use overwrite for full refreshes, snapshots, and targets where you always want the complete current state. This is the default write mode.
Overwrite is naturally idempotent -- rerunning the same config with the same source data produces the same target state.
Append¶
Inserts new rows into the target without modifying existing rows.
Use append for event logs, audit trails, and any target that accumulates rows over time.
Append is not idempotent
Rerunning an append produces duplicate rows. Pair append with an incremental load mode (watermark or parameter) to prevent reprocessing the same source data. See Idempotency for details.
Merge¶
Performs an upsert: matching rows are updated, unmatched source rows are inserted, and unmatched target rows are optionally deleted.
write:
mode: merge
match_keys: [customer_id, source_system]
on_match: update
on_no_match_target: insert
on_no_match_source: ignore
Merge requires match_keys -- the columns used to match source rows to
target rows. The behavior for each match outcome is independently
configurable:
| Parameter | Options | Default |
|---|---|---|
on_match |
update, ignore |
update |
on_no_match_target |
insert, ignore |
insert |
on_no_match_source |
delete, soft_delete, ignore |
ignore |
For soft deletes, specify the marker column and value:
write:
mode: merge
match_keys: [customer_id]
on_no_match_source: soft_delete
soft_delete_column: is_deleted
soft_delete_value: true
soft_delete_active_value: false # explicit value on retained rows
soft_delete_active_value is optional — when set, retained rows have
the column populated with that value on every merge. When omitted,
active rows keep whatever value the source supplied (often null),
which is the right choice if downstream queries treat null as "not
deleted".
Merge is idempotent by match key -- rerunning with the same data produces the same target state.
SCD patterns
SCD Type 1 and SCD Type 2 are not separate write modes. They are delivered as reusable stitches that compose on top of the core merge mode. See the YAML Schema Reference for stitch usage.
Load modes¶
The load block on a thread controls how source data is bounded on each
execution.
Full¶
Reads all source data on every run.
This is the default. Pair with write.mode: overwrite for a complete
refresh, or with write.mode: merge for a full comparison merge.
Incremental watermark¶
Reads only source rows that have changed since the last successful run. weevr persists a high-water mark and filters subsequent reads automatically.
On the first run, all rows are read (no prior watermark exists). On
subsequent runs, only rows where modified_date exceeds the stored
watermark are read.
The watermark is persisted in a configurable state store -- either as a Delta table property on the target or in a dedicated metadata table.
Incremental parameter¶
Incremental boundaries are passed as runtime parameters. weevr does not manage state -- the caller is responsible for providing the correct range.
load:
mode: incremental_parameter
params:
start_date:
type: date
required: true
end_date:
type: date
required: true
Use this mode when the orchestration layer (Fabric pipeline, Airflow) controls the processing window.
CDC¶
The thread understands change data capture patterns. Source rows carry operation flags (insert, update, delete) that weevr applies as merge operations on the target.
CDC mode supports two configuration styles:
- Preset -- Use
delta_cdfto auto-configure for Delta Change Data Feed conventions. - Explicit -- Declare the operation column and flag values directly.
load:
mode: cdc
cdc:
operation_column: change_type
insert_value: "I"
update_value: "U"
delete_value: "D"
# Optional: compose with a watermark column to narrow reads on
# append-only history tables (see below).
# watermark_column: updated_at
# watermark_type: timestamp
The explicit path may be combined with watermark_column and
watermark_type to narrow the read window for append-only CDC history
tables — for example, SAP data landed by Fabric Open Database Mirror,
where every change row carries a change timestamp like AEDATTM. Without
the watermark, weevr re-reads the full history on every run. The
delta_cdf preset tracks progress via commit versions and rejects
watermark_column.
String-typed watermark columns (watermark_format)¶
When a source lands the watermark column as a string rather than a native
Delta TIMESTAMP or DATE — common for SAP/mainframe extracts and JSON
dumps — declare a Spark DateTimeFormatter pattern via watermark_format:
load:
mode: cdc
cdc:
operation_column: OPFLAG
insert_value: "I"
update_value: "U"
delete_value: "D"
watermark_column: AEDATTM
watermark_type: timestamp
watermark_format: "yyyy-MM-dd HH:mm:ss.SSSSSX"
With the field set, weevr wraps the column in to_timestamp(col, format)
(or to_date(col, format) for watermark_type: date) in both the filter
predicate and the high-water mark aggregate. The persisted HWM is stored
as a canonical ISO string and is independent of the source pattern, so
changing watermark_format on a later run does not require replaying
state. watermark_format is opt-in and only valid with watermark_type
of timestamp or date; pairing it with int or long is rejected at
config load time.
Rows whose values do not parse against the declared pattern are silently
excluded from both the predicate and the HWM — Spark's to_timestamp
and to_date return NULL on parse failure. The practical implication:
if every row fails to parse on the first run, the thread reads zero
rows, captures no HWM, and nothing is persisted. This is the
"zero rows through" diagnostic signal that the format string is wrong;
fix the pattern and re-run. A DEBUG log line is emitted from the reader
whenever watermark_format is active, capturing the column, format,
prior HWM, and new HWM — enable DEBUG on weevr.operations.readers to
see it.
This field assumes the default Fabric Spark 3.5+ time parser policy.
Under the default policy, to_timestamp and to_date with an explicit
format return NULL on parse failure rather than throwing — which is
what the silent-drop contract above relies on. Sessions that override
spark.sql.legacy.timeParserPolicy will observe whatever parse
semantics Spark applies under that policy.
Delta-typed timestamp and date columns should not set watermark_format;
the implicit-cast path is already correct and keeps Delta predicate
pushdown available. Wrapping a Delta-typed column in to_timestamp has
no functional benefit and would defeat file skipping.
Choosing the right combination¶
| Scenario | Load mode | Write mode |
|---|---|---|
| Full snapshot refresh | full |
overwrite |
| Accumulating event log | incremental_watermark |
append |
| Dimension table with updates | full or incremental_watermark |
merge |
| CDC from upstream system | cdc |
merge |
| CDC from append-only history table | cdc + watermark_column |
merge |
| Externally bounded batch | incremental_parameter |
append or merge |
Next steps¶
- Idempotency -- Understand rerun safety per mode
- Artifacts Model -- How targets map to Delta tables
- YAML Schema: Thread -- Full write and load configuration reference