Versioning¶

Metaxy calculates a few types of versions at feature, field, and sample levels.

Metaxy's versioning system is declarative, static (1) and deterministic.

Versions can be calculated ahead of time (before the data processing job is executed).

Metaxy uses hashing algorithms to compute all versions. The algorithm and the hash length can be configured.

Here is how these versions are calculated, from bottom to top.

Definitions¶

These versions can be computed from Metaxy definitions (e.g. Python code or historical snapshots of the feature graph). We don't need to access the metadata store in order to calculate them. They exist in Python at runtime, and are also serialized to the metadata store when metaxy push is called.

Field Level¶

Field Code Version is defined on the field and is provided by the user (defaults to "__metaxy_initial__"). Apart from overriding data versions, this is the only input to the versioning system that can be directly modified by the user.

Code Version Value

The value can be an arbitrary string, but in the future we might implement something around semantic versioning.

Field Version is computed from the code version of this field, the fully qualified field path and from the field versions of its parent fields (if any exist, for example, fields on root features do not have dependencies).

Visualization

field-version

Feature Level¶

Feature Version: is computed from the Field Versions of all fields defined on the feature and the key of the feature.

Visualization

feature-version

This version is stored as metaxy_feature_version system column.

Feature Code Version is computed from the Field Code Versions of all fields defined on the feature. Unlike Feature Version, this version does not change when dependencies change. The value of this version is determined entirely by user input.

Project Level¶

Project Version: is computed from the Feature Versions of all features in the Metaxy project.

Visualization

project-version

How is project version used?

This value is used to uniquely encode versioned feature graph topology. metaxy push CLI can be used to keep track of previous versions of the feature graph, enabling features such as data version reconciliation migrations.

This version is stored as metaxy_project_version system column.

Samples¶

These versions are sample-level and require access to the metadata store in order to be computed. They are stored separately for each row in the feature table.

Provenance¶

Provenance By Field is computed from the upstream Provenance By Field (with respect to defined field-level lineage and the code versions of the current fields. This is a dictionary mapping sample field names to their respective versions. This is how this looks like in the metadata store (database):

id	metaxy_provenance_by_field
video_001	`{"audio": "a7f3c2d8", "frames": "b9e1f4a2"}`
video_002	`{"audio": "d4b8e9c1", "frames": "f2a6d7b3"}`
video_003	`{"audio": "c9f2a8e4", "frames": "e7d3b1c5"}`
video_004	`{"audio": "b1e4f9a7", "frames": "a8c2e6d9"}`

Sample Provenance is derived from the Provenance By Field by simply hashing it.

Computing this value is the goal of the entire versioning engine. It's a string value that only changes when versions of the specific upstream fields the sample depends on change. It acts as source of truth for resolving incremental updates for feature metadata.

Most of the time metaxy_provenance_by_field and metaxy_provenance are used for the final data version columns as is, except when the user wants to override the latter. These final versions are then recursively used to compute downstream provenances.

Visualization

sample-version

These versions are stored as metaxy_provenance_by_field and metaxy_provenance system columns.

Data Version¶

Users can override the computed sample-level versions (metaxy_provenance_by_field) by setting metaxy_data_version_by_field on their metadata, effectively providing a Data Version for the sample. This can be used for preventing unnecessary downstream updates, if the computed sample stays the same even after upstream data has changed. metaxy_data_version_by_field is then used to compute metaxy_data_version by hashing all the fields together.

For example, the data version can be calculated by running sha256 over the file, or a perceptual hashing method for images and videos.

This customization only affects how downstream increments are calculated, as the data version cannot be known until the feature is computed.

These versions are stored as metaxy_data_version_by_field and metaxy_data_version system columns.

Provenance Vs Data Version¶

To summarize, metaxy_provenance and metaxy_provenance_by_field are used to determine whether the current feature has to be updated. Usually they are used for metaxy_data_version and metaxy_data_version_by_field, but the user can override this. These columns in turn are used to calculate provenances for downstream features.

Example: Partial Data Updates¶

This example makes use of Metaxy's syntactic sugar.

Consider a video processing pipeline with these features:

name="__codelineno-0-1" href="#__codelineno-0-1">import metaxy as mx class="k">class Video( mx.BaseFeature, spec=mx.FeatureSpec( key="example/video", id_columns=["video_id"], fields=[ mx.FieldSpec(key="audio", code_version="1"), mx.FieldSpec(key="frames", code_version="1"), ], ), class="p">): class="w"> """Video metadata feature (root).""" video_id: str frames: int duration: float size: int class="k">class Crop( mx.BaseFeature, spec=mx.FeatureSpec( key="example/crop", id_columns=["video_id"], deps=[Video], fields=[ mx.FieldSpec(key="audio", code_version="1"), # (1)! mx.FieldSpec(key="frames", code_version="1"), # (2)! ], ), class="p">): video_id: str # ID column class="k">class FaceDetection( mx.BaseFeature, spec=mx.FeatureSpec( key="example/face_detection", id_columns=["video_id"], deps=[Crop], fields=[ mx.FieldSpec( key="faces", code_version="1", deps=[mx.FieldDep(feature=Crop, fields=["frames"])], ), ], ), class="p">): video_id: str class="k">class SpeechToText( mx.BaseFeature, spec=mx.FeatureSpec( key="example/stt", id_columns=["video_id"], deps=[Video], fields=[ mx.FieldSpec( key="transcription", code_version="1", deps=[mx.FieldDep(feature=Video, fields=["audio"])], ), ], ), class="p">): video_id: str

This audio field automatically depends on the audio field of the "example/video" feature, because their names match.
This frames field automatically depends on the frames field of the "example/video" feature, because their names match.

Running metaxy graph render --format mermaid produces this graph:

---
title: Feature Graph
---
flowchart LR
    %% Snapshot version: none
    %%{init: {'flowchart': {'htmlLabels': true, 'curve': 'basis'}, 'themeVariables': {'fontSize': '14px'}}}%%
    example_video["<div style="text-align:left"><b>example/video</b><br/>c2ac395f<br/><font color="#999">---</font><br/>- audio (22742381)<br/>- frames (794116a9)</div>"]
    example_crop["<div style="text-align:left"><b>example/crop</b><br/>34d75856<br/><font color="#999">---</font><br/>- audio (4c726c4b)<br/>- frames (2419e09d)</div>"]
    example_face_detection["<div style="text-align:left"><b>example/face_detection</b><br/>f1526ee0<br/><font color="#999">---</font><br/>- faces (006efeef)</div>"]
    example_stt["<div style="text-align:left"><b>example/stt</b><br/>d953dea4<br/><font color="#999">---</font><br/>- transcription (3ec3826d)</div>"]
    example_video --> example_crop
    example_crop --> example_face_detection
    example_video --> example_stt

Tracking Definitions Changes¶

Imagine the audio field of the "example/video" feature changes (1):

зerhaps, something like denoising has been applied

patches/01_update_audio_version.patch

patches/01_update_audio_version.patch

--- a/src/example_overview/features.py
+++ b/src/example_overview/features.py
@@ -14,7 +14,7 @@ class Video(
         fields=[
             FieldSpec(
                 key="audio",
-                code_version="1",
+                code_version="2",
             ),
             FieldSpec(
                 key="frames",

Here is how the change affects feature and field versions through the feature graph:

---
title: Feature Graph Changes
---
flowchart LR
    %% Snapshot version: none
    %%{init: {'flowchart': {'htmlLabels': true, 'curve': 'basis'}, 'themeVariables': {'fontSize': '14px'}}}%%
    example_video["<div style="text-align:left"><b>example/video</b><br/><font color="#FF0000">c2ac395f</font> → <font color="#00FF00">2faffb98</font><br/><font color="#999">---</font><br/>- frames (794116a9)<br/>- <font color="#FFAA00">audio</font> (<font color="#FF0000">22742381</font> → <font color="#00FF00">09c8398b</font>)</div>"]
    example_crop["<div style="text-align:left"><b>example/crop</b><br/><font color="#FF0000">34d75856</font> → <font color="#00FF00">fe237dc9</font><br/><font color="#999">---</font><br/>- frames (2419e09d)<br/>- <font color="#FFAA00">audio</font> (<font color="#FF0000">4c726c4b</font> → <font color="#00FF00">e2b6ce39</font>)</div>"]
    example_face_detection["<div style="text-align:left"><b>example/face_detection</b><br/>f1526ee0<br/><font color="#999">---</font><br/>- faces (006efeef)</div>"]
    example_stt["<div style="text-align:left"><b>example/stt</b><br/><font color="#FF0000">d953dea4</font> → <font color="#00FF00">e57e7555</font><br/><font color="#999">---</font><br/>- <font color="#FFAA00">transcription</font> (<font color="#FF0000">3ec3826d</font> → <font color="#00FF00">9f7ea40c</font>)</div>"]
    example_video --> example_crop
    example_crop --> example_face_detection
    example_video --> example_stt


    style example_crop stroke:#FFAA00,stroke-width:2px
    style example_face_detection stroke:#808080
    style example_stt stroke:#FFAA00,stroke-width:2px
    style example_video stroke:#FFAA00,stroke-width:2px

Info

"example/video", "example/crop", and "example/stt" have changed
"example/face_detection" remained unchanged (depends only on frames and not on audio)
Audio field versions have changed throughout the graph
Frame field versions have stayed the same

Incremental Computations¶

The single most important piece of code in Metaxy is the resolve_update method. For a given feature, it takes the inputs (1), computes the expected provenances for the given feature, and compares it with the current state in the metadata store. Learn more about this process here.

metadata from the upstream features

The Python pipeline needs to handle the result of resolve_update call:

with store:  # MetadataStore
    # Metaxy computes provenance_by_field and identifies changes
    increment = store.resolve_update(DownstreamFeature)

    # Process only the changed samples

The increment object has attributes for new upstream samples, samples identified as stale, and samples that have been removed from the upstream metadata.