Design Choices¶
As discussed on the front page and the Pitch, Metaxy aims to be pluggable, reliable, scalable, and developer-friendly. Here are some of the design choices we made to achieve these goals.
Storage¶
Data vs Metadata Clarifications
Metaxy features represent tabular metadata, typically containing references to external multimodal data such as files, images, or videos.
| Subject | Description |
|---|---|
| Data | The actual multimodal data itself, such as images, audio files, video files, text documents, and other raw content that your pipelines process and transform. |
| Metadata | Information about the data, typically including references to where data is stored (e.g., object store keys), plus additional descriptive entries such as video length, file size, format, version, and other attributes. |
Metaxy does not interact with data and is not responsible for its content. As an edge case, Metaxy may also manage pure metadata tables that do not reference any external data.
Metaxy is designed to be compatible with storage systems which satisfy the following requirements:
-
has an append operation
-
can store map-like elements (e.g. dictionaries)
Lifting This Requirement
Unfortunately, the most popular database - PostgreSQL - does not satisfy it. While PostgreSQL is not an ideal choice for a Metaxy Metadata Store for other reasons (mainly being analytical queries performance), we recognize the need to support it and are exploring a solution in
anam-org/metaxy#223. This requirement may not be necessary in the future.
This allows Metaxy to target modern data warehouses (e.g. ClickHouse, BigQuery, Snowflake, or the more minimalistic DuckDB) and storage formats such as DeltaLake, Iceberg, DuckLake, and anything compatible with Apache Arrow.
The Metaxy abstraction that implements these design choices and is used to interact with storage systems is known as Metadata Store.
Table Schema¶
Metaxy uses the same storage layout for all storage systems. Each feature gets a separate table.
Here is how a typical Metaxy feature table looks like:
| id | metaxy_feature_version | metaxy_data_version | metaxy_data_version_by_field | metaxy_provenance | metaxy_provenance_by_field | metaxy_created_at | metaxy_updated_at | metaxy_deleted_at |
|---|---|---|---|---|---|---|---|---|
| video_001 | a1b2c3d4 | e7f8a9b0 | {"audio": "a7f3c2d8", "frames": "b9e1f4a2"} |
e7f8a9b0 | {"audio": "a7f3c2d8", "frames": "b9e1f4a2"} |
2024-01-15T10:30:00Z | 2024-01-15T10:30:00Z | null |
| video_002 | a1b2c3d4 | c1e4b9d8 | {"audio": "d4b8e9c1", "frames": "f2a6d7b3"} |
c1e4b9d8 | {"audio": "d4b8e9c1", "frames": "f2a6d7b3"} |
2024-01-15T10:31:00Z | 2024-01-15T10:31:00Z | null |
| video_003 | a1b2c3d4 | k1j2ah7v | {"audio": "custom01", "frames": "custom02"} |
a8e2f4c9 | {"audio": "c9f2a8e4", "frames": "e7d3b1c5"} |
2024-01-15T10:32:00Z | 2024-01-16T14:20:00Z | null |
| video_001 | f5d6e7c8 | b2c3d4e5 | {"audio": "b1e4f9a7", "frames": "a8c2e6d9"} |
b2c3d4e5 | {"audio": "b1e4f9a7", "frames": "a8c2e6d9"} |
2024-01-18T09:00:00Z | 2024-01-18T09:00:00Z | null |
It can also contain custom user-defined columns (1).
- and in fact,
idis such a column, because ID columns are customizable
Info
metaxy_data_version/metaxy_data_version_by_field and metaxy_provenance/metaxy_provenance_by_field serve a slightly different purpose.
Provenance columns hold static versioning information entirely defined by the Metaxy framework. Data version defaults to the same value as provenance, but can be customized by the user at runtime, for example by deriving it from the contents of the computed sample. Learn more here.
All historical records for a given feature are stored in the same table. They can be separated by the following system columns:
-
metaxy_feature_versionis shared among multiple rows and is changed on any of the feature or upstream featurecode_versionchanges -
metaxy_data_version,metaxy_data_version_by_field,metaxy_provenance,metaxy_provenance_by_fieldcarry versioning and provenance information about the specific row -
metaxy_created_at,metaxy_updated_at,metaxy_deleted_atallow to identify the latest active row for a given feature version
Metadata Operations¶
Metaxy tables are immutable. Once written, a row is never modified or deleted (1).
- but users can delete rows manually if needed
As discussed earlier, writing metadata in Metaxy is done by appending to a feature table. Subsequent writes with the same feature version effectively act as overwrites. This is achieved by filtering out older rows using the metaxy_updated_at columns (1). Soft-deletes are implemented as appends as well.
- also known as merge-on-read
The append-only design choice has a few significant benefits:
-
unlocks easier and lock-free setups for multiple writers
-
ensures existing and historical metadata can never be lost or corrupted
Tip
Users can implement storage cleanup based on their specific needs and constraints.
-
avoids additional write-time checks or operations, which has performance benefits
-
allows Metaxy to be used with storage systems which lack ACID guarantees and do not support transactions
Info
These design choices come with a cost of increased storage usage. But storage is cheap while mistakes aren't.
DataFrame API¶
In order to be versatile and support different compute engines, Metaxy uses Narwhals for DataFrame manipulations.
This allows metadata store implementations to reuse the same code. Currently only a thin subset requires storage-specific implementations (1).
(1) such as providing database-specific hashing syntax and some other operations
Compute¶
Increment resolution in Metaxy involves running computations: every time the user requests an increment for a given feature, Metaxy has to join upstream features, hash their versions, and filter out samples that have already been processed. This can be performed either locally (typically favored in development environments) or remotely (achieves better performance in production). Metaxy supports both options: databases for remote compute and storage-only metadata stores for embedded compute (1).
- e.g. Polars or DuckDB
When resolving incremental updates for a feature, Metaxy attempts to perform all computations such as sample version calculations within the metadata store.
When can local computations happen instead
Metaxy's versioning engine runs on the local Polars versioning engine if:
-
The metadata store does not have a compute engine at all: for example, DeltaLake is just a storage format.
-
The user explicitly requested to keep the computations local by setting
versioning_engine="polars"when instantiating the metadata store. -
A fallback store had to be used to retrieve one of the parent features missing in the current store.
All 3 cases cannot be accidental and require preconfigured settings or explicit user action. In the third case, Metaxy will also issue a warning just in case the user has accidentally configured a fallback store in production.
All metadata store implementations are guaranteed to return equivalent results. They are continuously tested against the reference Polars implementation.
🚀 What's Next?¶
- Itching to write some Metaxy code? Jump to Quickstart.
- Learn more about Metaxy concepts
- View complete, end-to-end examples
- Explore Metaxy integrations
- Invoke
mxCLI from your terminal - Learn how to configure Metaxy
- Get lost in our API Reference