Skip to content

Basic Example

Overview

View Source on GitHub

This example demonstrates how Metaxy automatically detects changes in upstream features and triggers recomputation of downstream features. It shows the core value proposition of Metaxy: avoiding unnecessary recomputation while ensuring data consistency.

We will build a simple two-feature pipeline where a child feature depends on a parent feature. When the parent's algorithm changes (represented by code_version), the child feature is automatically recomputed.

The Pipeline

Let's define a pipeline with two features:

---
title: Feature Graph
---
flowchart LR
    %% Snapshot version: none
    %%{init: {'flowchart': {'htmlLabels': true, 'curve': 'basis'}, 'themeVariables': {'fontSize': '14px'}}}%%
    examples_parent["<div style="text-align:left"><b>examples/parent</b><br/>7de0f5e8<br/><font color="#999">---</font><br/>- embeddings (05e66510)</div>"]
    examples_child["<div style="text-align:left"><b>examples/child</b><br/>b10ea448<br/><font color="#999">---</font><br/>- predictions (9cd1c608)</div>"]
    examples_parent --> examples_child

Defining features: "examples/parent"

The parent feature represents raw embeddings computed from source data. It has a single field embeddings with a code_version that tracks the algorithm version.

src/example_basic/features.py
import metaxy as mx


class ParentFeature(
    mx.BaseFeature,
    spec=mx.FeatureSpec(
        key="examples/parent",
        fields=[
            mx.FieldSpec(
                key="embeddings",
                code_version="1",
            ),
        ],
        id_columns=("sample_uid",),
    ),
):
    """Parent feature that generates embeddings from raw data."""

    pass

Defining features: "examples/child"

The child feature depends on the parent and produces predictions. The key configuration is the FeatureDep which declares that "examples/child" depends on "examples/parent".

src/example_basic/features.py
class ChildFeature(
    mx.BaseFeature,
    spec=mx.FeatureSpec(
        key="examples/child",
        deps=[ParentFeature],
        fields=["predictions"],
        id_columns=("sample_uid",),
    ),
):
    """Child feature that uses parent embeddings to generate predictions."""

    pass

The FeatureDep declaration tells Metaxy:

  1. "examples/child" depends on "examples/parent"
  2. When the parent's field provenance changes, the child must be recomputed
  3. This dependency is tracked automatically, enabling incremental recomputation

Getting Started

Install the example's dependencies:

uv sync

Walkthrough

Step 1: Initial Run

Run the pipeline to create parent embeddings and child predictions:

$ python src/example_basic/pipeline.py
Graph project_version: 490f2c18
Written 3 rows for feature examples/parent
Pipeline
============================================================

[1/2] Computing parent feature...

[2/2] Computing child feature...
Graph project_version: 490f2c18

📊 Computing examples/child...
  feature_version: b10ea448
Identified: 3 new samples, 0 samples with new provenance_by_field
 Materialized 3 new samples

📋 Child provenance_by_field:
  sample_uid=1: {'predictions': '24503967'}
  sample_uid=2: {'predictions': '24458329'}
  sample_uid=3: {'predictions': '26963083'}


 Pipeline complete!

The pipeline materialized 3 samples for the child feature. Each sample has its provenance tracked.

Step 2: Verify Idempotency

Run the pipeline again without any changes:

$ python src/example_basic/pipeline.py
Graph project_version: 490f2c18
Metadata already exists for feature examples/parent (feature_version: 7de0f5e8...)
Skipping write to avoid duplicates
Pipeline
============================================================

[1/2] Computing parent feature...

[2/2] Computing child feature...
Graph project_version: 490f2c18

📊 Computing examples/child...
  feature_version: b10ea448
Identified: 0 new samples, 0 samples with new provenance_by_field

📋 Child provenance_by_field:
  sample_uid=1: {'predictions': '24503967'}
  sample_uid=2: {'predictions': '24458329'}
  sample_uid=3: {'predictions': '26963083'}

No changes detected (idempotent)

 Pipeline complete!

Key observation: No recomputation occurred.

Step 3: Update Parent Algorithm

Now let's simulate an algorithm improvement by changing the parent's code_version from "1" to "2":

patches/01_update_parent_algorithm.patch
--- a/src/example_basic/features.py
+++ b/src/example_basic/features.py
@@ -15,7 +15,7 @@ class ParentFeature(
         fields=[
             FieldSpec(
                 key="embeddings",
-                code_version="1",
+                code_version="2",
             ),
         ],
         id_columns=("sample_uid",),
---
title: Feature Graph Changes
---
flowchart TB
    %% Snapshot version: none
    %%{init: {'flowchart': {'htmlLabels': true, 'curve': 'basis'}, 'themeVariables': {'fontSize': '14px'}}}%%
    examples_parent["<div style="text-align:left"><b>examples/parent</b><br/><font color="#FF0000">7de0f5e8</font> → <font color="#00FF00">68827f3e</font><br/><font color="#999">---</font><br/>- <font color="#FFAA00">embeddings</font> (<font color="#FF0000">05e66510</font> → <font color="#00FF00">3c8d3e9b</font>)</div>"]
    examples_child["<div style="text-align:left"><b>examples/child</b><br/><font color="#FF0000">b10ea448</font> → <font color="#00FF00">e5b92b18</font><br/><font color="#999">---</font><br/>- <font color="#FFAA00">predictions</font> (<font color="#FF0000">9cd1c608</font> → <font color="#00FF00">7cef6acb</font>)</div>"]
    examples_parent --> examples_child


    style examples_child stroke:#FFAA00,stroke-width:2px
    style examples_parent stroke:#FFAA00,stroke-width:2px

This change means that the existing embeddings and the downstream feature have to be recomputed.

Step 4: Observe Automatic Recomputation

Run the pipeline again after the algorithm change:

$ python src/example_basic/pipeline.py
Graph project_version: c423d51a
Written 3 rows for feature examples/parent
Pipeline
============================================================

[1/2] Computing parent feature...

[2/2] Computing child feature...
Graph project_version: c423d51a

📊 Computing examples/child...
  feature_version: e5b92b18
Identified: 3 new samples, 0 samples with new provenance_by_field
 Materialized 3 new samples

📋 Child provenance_by_field:
  sample_uid=1: {'predictions': '24503967'}
  sample_uid=2: {'predictions': '24458329'}
  sample_uid=3: {'predictions': '26963083'}


 Pipeline complete!

Key observation: The child feature was automatically recomputed because:

  1. The parent's code_version changed from "1" to "2"
  2. This changed the parent's metaxy_feature_version
  3. The child's field dependency on embeddings detected the change
  4. All child samples were marked for recomputation

How It Works

Metaxy tracks provenance at the field level using:

  1. Field Version: A hash combining the field's code_version and provenances of upstream fields
  2. Feature Version: A hash combining the field versions of all fields in the feature
  3. Dependency Resolution: When resolving updates, Metaxy computes what the provenance would be and compares it to what's stored

This enables precise, incremental recomputation without re-processing unchanged data.

Conclusion

Metaxy provides automatic change detection and incremental recomputation through:

This mechanism ensures your pipelines are both efficient and keep relevant data up to date.

Learn more about: