Skip to content

Design Choices

As discussed on the front page and the Pitch, Metaxy aims to be pluggable, reliable, scalable, and developer-friendly. Here are some of the design choices we made to achieve these goals.

Storage

Data vs Metadata Clarifications

Metaxy features represent tabular metadata, typically containing references to external multimodal data such as files, images, or videos.

Subject Description
Data The actual multimodal data itself, such as images, audio files, video files, text documents, and other raw content that your pipelines process and transform.
Metadata Information about the data, typically including references to where data is stored (e.g., object store keys), plus additional descriptive entries such as video length, file size, format, version, and other attributes.

Metaxy does not interact with data and is not responsible for its content. As an edge case, Metaxy may also manage pure metadata tables that do not reference any external data.

Metaxy is designed to be compatible with storage systems which satisfy the following requirements:

  • has an append operation

  • can store map-like elements (e.g. dictionaries)

    Lifting This Requirement

    Unfortunately, the most popular database - PostgreSQL - does not satisfy it. While PostgreSQL is not an ideal choice for a Metaxy Metadata Store for other reasons (mainly being analytical queries performance), we recognize the need to support it and are exploring a solution in anam-org/metaxy#223. This requirement may not be necessary in the future.

This allows Metaxy to target modern data warehouses (e.g. ClickHouse, BigQuery, Snowflake, or the more minimalistic DuckDB) and storage formats such as DeltaLake, Iceberg, DuckLake, and anything compatible with Apache Arrow.

The Metaxy abstraction that implements these design choices and is used to interact with storage systems is known as Metadata Store.

Table Schema

Metaxy uses the same storage layout for all storage systems. Each feature gets a separate table.

Here is how a typical Metaxy feature table looks like:

id metaxy_feature_version metaxy_data_version metaxy_data_version_by_field metaxy_provenance metaxy_provenance_by_field metaxy_created_at metaxy_updated_at metaxy_deleted_at
video_001 a1b2c3d4 e7f8a9b0 {"audio": "a7f3c2d8", "frames": "b9e1f4a2"} e7f8a9b0 {"audio": "a7f3c2d8", "frames": "b9e1f4a2"} 2024-01-15T10:30:00Z 2024-01-15T10:30:00Z null
video_002 a1b2c3d4 c1e4b9d8 {"audio": "d4b8e9c1", "frames": "f2a6d7b3"} c1e4b9d8 {"audio": "d4b8e9c1", "frames": "f2a6d7b3"} 2024-01-15T10:31:00Z 2024-01-15T10:31:00Z null
video_003 a1b2c3d4 k1j2ah7v {"audio": "custom01", "frames": "custom02"} a8e2f4c9 {"audio": "c9f2a8e4", "frames": "e7d3b1c5"} 2024-01-15T10:32:00Z 2024-01-16T14:20:00Z null
video_001 f5d6e7c8 b2c3d4e5 {"audio": "b1e4f9a7", "frames": "a8c2e6d9"} b2c3d4e5 {"audio": "b1e4f9a7", "frames": "a8c2e6d9"} 2024-01-18T09:00:00Z 2024-01-18T09:00:00Z null

It can also contain custom user-defined columns (1).

  1. and in fact, id is such a column, because ID columns are customizable

Info

metaxy_data_version/metaxy_data_version_by_field and metaxy_provenance/metaxy_provenance_by_field serve a slightly different purpose. Provenance columns hold static versioning information entirely defined by the Metaxy framework. Data version defaults to the same value as provenance, but can be customized by the user at runtime, for example by deriving it from the contents of the computed sample. Learn more here.

All historical records for a given feature are stored in the same table. They can be separated by the following system columns:

  • metaxy_feature_version is shared among multiple rows and is changed on any of the feature or upstream feature code_version changes

  • metaxy_data_version, metaxy_data_version_by_field, metaxy_provenance, metaxy_provenance_by_field carry versioning and provenance information about the specific row

  • metaxy_created_at, metaxy_updated_at, metaxy_deleted_at allow to identify the latest active row for a given feature version

Metadata Operations

Metaxy tables are immutable. Once written, a row is never modified or deleted (1).

  1. but users can delete rows manually if needed

As discussed earlier, writing metadata in Metaxy is done by appending to a feature table. Subsequent writes with the same feature version effectively act as overwrites. This is achieved by filtering out older rows using the metaxy_updated_at columns (1). Soft-deletes are implemented as appends as well.

  1. also known as merge-on-read

The append-only design choice has a few significant benefits:

  • unlocks easier and lock-free setups for multiple writers

  • ensures existing and historical metadata can never be lost or corrupted

Tip

Users can implement storage cleanup based on their specific needs and constraints.

  • avoids additional write-time checks or operations, which has performance benefits

  • allows Metaxy to be used with storage systems which lack ACID guarantees and do not support transactions


Info

These design choices come with a cost of increased storage usage. But storage is cheap while mistakes aren't.

DataFrame API

In order to be versatile and support different compute engines, Metaxy uses Narwhals for DataFrame manipulations.

This allows metadata store implementations to reuse the same code. Currently only a thin subset requires storage-specific implementations (1).

(1) such as providing database-specific hashing syntax and some other operations

Compute

Increment resolution in Metaxy involves running computations: every time the user requests an increment for a given feature, Metaxy has to join upstream features, hash their versions, and filter out samples that have already been processed. This can be performed either locally (typically favored in development environments) or remotely (achieves better performance in production). Metaxy supports both options: databases for remote compute and storage-only metadata stores for embedded compute (1).

  1. e.g. Polars or DuckDB

When resolving incremental updates for a feature, Metaxy attempts to perform all computations such as sample version calculations within the metadata store.

When can local computations happen instead

Metaxy's versioning engine runs on the local Polars versioning engine if:

  1. The metadata store does not have a compute engine at all: for example, DeltaLake is just a storage format.

  2. The user explicitly requested to keep the computations local by setting versioning_engine="polars" when instantiating the metadata store.

  3. A fallback store had to be used to retrieve one of the parent features missing in the current store.

All 3 cases cannot be accidental and require preconfigured settings or explicit user action. In the third case, Metaxy will also issue a warning just in case the user has accidentally configured a fallback store in production.

All metadata store implementations are guaranteed to return equivalent results. They are continuously tested against the reference Polars implementation.

🚀 What's Next?