Versioning¶
Metaxy calculates a few types of versions at feature, field, and sample levels.
Metaxy's versioning system is declarative, static (1) and deterministic.
- Versions can be calculated ahead of time (before the data processing job is executed).
Metaxy uses hashing algorithms to compute all versions. The algorithm and the hash length can be configured.
Here is how these versions are calculated, from bottom to top.
Definitions¶
These versions can be computed from Metaxy definitions (e.g. Python code or historical snapshots of the feature graph). We don't need to access the metadata store in order to calculate them. They exist in Python at runtime, and are also serialized to the metadata store when metaxy push is called.
Field Level¶
Field Code Version is defined on the field and is provided by the user (defaults to "__metaxy_initial__"). Apart from overriding data versions, this is the only input to the versioning system that can be directly modified by the user.
Code Version Value
The value can be an arbitrary string, but in the future we might implement something around semantic versioning.
Field Version is computed from the code version of this field, the fully qualified field path and from the field versions of its parent fields (if any exist, for example, fields on root features do not have dependencies).
Visualization
Feature Level¶
Feature Version: is computed from the Field Versions of all fields defined on the feature and the key of the feature.
Visualization
This version is stored as metaxy_feature_version system column.
Feature Code Version is computed from the Field Code Versions of all fields defined on the feature. Unlike Feature Version, this version does not change when dependencies change. The value of this version is determined entirely by user input.
Project Level¶
Project Version: is computed from the Feature Versions of all features in the Metaxy project.
Visualization
How is project version used?
This value is used to uniquely encode versioned feature graph topology. metaxy push CLI can be used to keep track of previous versions of the feature graph, enabling features such as data version reconciliation migrations.
This version is stored as metaxy_project_version system column.
Samples¶
These versions are sample-level and require access to the metadata store in order to be computed. They are stored separately for each row in the feature table.
Provenance¶
Provenance By Field is computed from the upstream Provenance By Field (with respect to defined field-level lineage and the code versions of the current fields. This is a dictionary mapping sample field names to their respective versions. This is how this looks like in the metadata store (database):
| id | metaxy_provenance_by_field |
|---|---|
| video_001 | {"audio": "a7f3c2d8", "frames": "b9e1f4a2"} |
| video_002 | {"audio": "d4b8e9c1", "frames": "f2a6d7b3"} |
| video_003 | {"audio": "c9f2a8e4", "frames": "e7d3b1c5"} |
| video_004 | {"audio": "b1e4f9a7", "frames": "a8c2e6d9"} |
Sample Provenance is derived from the Provenance By Field by simply hashing it.
Computing this value is the goal of the entire versioning engine. It's a string value that only changes when versions of the specific upstream fields the sample depends on change. It acts as source of truth for resolving incremental updates for feature metadata.
Most of the time metaxy_provenance_by_field and metaxy_provenance are used for the final data version columns as is, except when the user wants to override the latter. These final versions are then recursively used to compute downstream provenances.
Visualization
These versions are stored as metaxy_provenance_by_field and metaxy_provenance system columns.
Data Version¶
Users can override the computed sample-level versions (metaxy_provenance_by_field) by setting metaxy_data_version_by_field on their metadata, effectively providing a Data Version for the sample. This can be used for preventing unnecessary downstream updates, if the computed sample stays the same even after upstream data has changed. metaxy_data_version_by_field is then used to compute metaxy_data_version by hashing all the fields together.
For example, the data version can be calculated by running sha256 over the file, or a perceptual hashing method for images and videos.
This customization only affects how downstream increments are calculated, as the data version cannot be known until the feature is computed.
These versions are stored as metaxy_data_version_by_field and metaxy_data_version system columns.
Provenance Vs Data Version¶
To summarize, metaxy_provenance and metaxy_provenance_by_field are used to determine whether the current feature has to be updated. Usually they are used for metaxy_data_version and metaxy_data_version_by_field, but the user can override this. These columns in turn are used to calculate provenances for downstream features.
Example: Partial Data Updates¶
This example makes use of Metaxy's syntactic sugar.
Consider a video processing pipeline with these features:
import metaxy as mx
class Video(
mx.BaseFeature,
spec=mx.FeatureSpec(
key="example/video",
id_columns=["video_id"],
fields=[
mx.FieldSpec(key="audio", code_version="1"),
mx.FieldSpec(key="frames", code_version="1"),
],
),
):
"""Video metadata feature (root)."""
video_id: str
frames: int
duration: float
size: int
class Crop(
mx.BaseFeature,
spec=mx.FeatureSpec(
key="example/crop",
id_columns=["video_id"],
deps=[Video],
fields=[
mx.FieldSpec(key="audio", code_version="1"), # (1)!
mx.FieldSpec(key="frames", code_version="1"), # (2)!
],
),
):
video_id: str # ID column
class FaceDetection(
mx.BaseFeature,
spec=mx.FeatureSpec(
key="example/face_detection",
id_columns=["video_id"],
deps=[Crop],
fields=[
mx.FieldSpec(
key="faces",
code_version="1",
deps=[mx.FieldDep(feature=Crop, fields=["frames"])],
),
],
),
):
video_id: str
class SpeechToText(
mx.BaseFeature,
spec=mx.FeatureSpec(
key="example/stt",
id_columns=["video_id"],
deps=[Video],
fields=[
mx.FieldSpec(
key="transcription",
code_version="1",
deps=[mx.FieldDep(feature=Video, fields=["audio"])],
),
],
),
):
video_id: str
-
This
audiofield automatically depends on theaudiofield of the"example/video"feature, because their names match. -
This
framesfield automatically depends on theframesfield of the"example/video"feature, because their names match.
Running metaxy graph render --format mermaid produces this graph:
---
title: Feature Graph
---
flowchart LR
%% Snapshot version: none
%%{init: {'flowchart': {'htmlLabels': true, 'curve': 'basis'}, 'themeVariables': {'fontSize': '14px'}}}%%
example_video["<div style="text-align:left"><b>example/video</b><br/>c2ac395f<br/><font color="#999">---</font><br/>- audio (22742381)<br/>- frames (794116a9)</div>"]
example_crop["<div style="text-align:left"><b>example/crop</b><br/>34d75856<br/><font color="#999">---</font><br/>- audio (4c726c4b)<br/>- frames (2419e09d)</div>"]
example_face_detection["<div style="text-align:left"><b>example/face_detection</b><br/>f1526ee0<br/><font color="#999">---</font><br/>- faces (006efeef)</div>"]
example_stt["<div style="text-align:left"><b>example/stt</b><br/>d953dea4<br/><font color="#999">---</font><br/>- transcription (3ec3826d)</div>"]
example_video --> example_crop
example_crop --> example_face_detection
example_video --> example_stt
Tracking Definitions Changes¶
Imagine the audio field of the "example/video" feature changes (1):
- Π·erhaps, something like denoising has been applied
patches/01_update_audio_version.patch
Here is how the change affects feature and field versions through the feature graph:
---
title: Feature Graph Changes
---
flowchart LR
%% Snapshot version: none
%%{init: {'flowchart': {'htmlLabels': true, 'curve': 'basis'}, 'themeVariables': {'fontSize': '14px'}}}%%
example_video["<div style="text-align:left"><b>example/video</b><br/><font color="#FF0000">c2ac395f</font> β <font color="#00FF00">2faffb98</font><br/><font color="#999">---</font><br/>- frames (794116a9)<br/>- <font color="#FFAA00">audio</font> (<font color="#FF0000">22742381</font> β <font color="#00FF00">09c8398b</font>)</div>"]
example_crop["<div style="text-align:left"><b>example/crop</b><br/><font color="#FF0000">34d75856</font> β <font color="#00FF00">fe237dc9</font><br/><font color="#999">---</font><br/>- frames (2419e09d)<br/>- <font color="#FFAA00">audio</font> (<font color="#FF0000">4c726c4b</font> β <font color="#00FF00">e2b6ce39</font>)</div>"]
example_face_detection["<div style="text-align:left"><b>example/face_detection</b><br/>f1526ee0<br/><font color="#999">---</font><br/>- faces (006efeef)</div>"]
example_stt["<div style="text-align:left"><b>example/stt</b><br/><font color="#FF0000">d953dea4</font> β <font color="#00FF00">e57e7555</font><br/><font color="#999">---</font><br/>- <font color="#FFAA00">transcription</font> (<font color="#FF0000">3ec3826d</font> β <font color="#00FF00">9f7ea40c</font>)</div>"]
example_video --> example_crop
example_crop --> example_face_detection
example_video --> example_stt
style example_crop stroke:#FFAA00,stroke-width:2px
style example_face_detection stroke:#808080
style example_stt stroke:#FFAA00,stroke-width:2px
style example_video stroke:#FFAA00,stroke-width:2px
Info
-
"example/video","example/crop", and"example/stt"have changed -
"example/face_detection"remained unchanged (depends only onframesand not onaudio) -
Audio field versions have changed throughout the graph
-
Frame field versions have stayed the same
Incremental Computations¶
The single most important piece of code in Metaxy is the resolve_update method. For a given feature, it takes the inputs (1), computes the expected provenances for the given feature, and compares it with the current state in the metadata store. Learn more about this process here.
- metadata from the upstream features
The Python pipeline needs to handle the result of resolve_update call:
with store: # MetadataStore
# Metaxy computes provenance_by_field and identifies changes
increment = store.resolve_update(DownstreamFeature)
# Process only the changed samples
The increment object has attributes for new upstream samples, samples identified as stale, and samples that have been removed from the upstream metadata.