Skip to content

Metaxy Logo

Metaxy

PyPI version Python versions PyPI downloads CI codecov Ruff Ty prek


Metaxy

Metaxy is a pluggable metadata layer for building multimodal Data and ML pipelines. Metaxy manages and tracks metadata across complex computational graphs, implements sample and sub-sample versioning, allowing the codebase to evolve over time without friction.

The problem: sample-level versioning

Info

Data, ML and AI workloads processing large amounts of images, videos, audios, or texts (1) can be very expensive to run. In contrast to traditional data engineering, re-running the whole pipeline on changes is no longer an option. Therefore, it becomes crucially important to correctly implement incremental processing, sample-level versioning and prunable updates.

  1. or really any kind of data

These workloads evolve all the time, with new data being shipped, bugfixes or algorithm changes introduced, and new features added to the pipeline. This means the pipeline has to be re-computed frequently, but at the same time it's important to avoid unnecessary recomputations for individual data samples.

Info

Unnecessary recomputations can waste dozens of thousands of dollars on compute, and battling sample-level orchestration complexity can cost even more in engineering efforts.

Data vs Metadata Clarifications

Metaxy features represent tabular metadata, typically containing references to external multimodal data such as files, images, or videos.

Subject Description
Data The actual multimodal data itself, such as images, audio files, video files, text documents, and other raw content that your pipelines process and transform.
Metadata Information about the data, typically including references to where data is stored (e.g., object store keys), plus additional descriptive entries such as video length, file size, format, version, and other attributes.

Metaxy does not interact with data and is not responsible for its content. As an edge case, Metaxy may also manage pure metadata tables that do not reference any external data.

In contrast to what one might expect, spinning up a thousand compute nodes is a much easier task with established solutions, while sample-level versioning remains a challenging problem (1).

  1. it is hard to overestimate the amount of pain @danielgafni has endured before building Metaxy

The solution

Until recently, a general solution for this problem did not exist, but not anymore 🎉 !

Metaxy allows creating and updating feature definitions which can independently version different fields of the same data sample and express granular field-level lineage.

Just Use Metaxy

Metaxy has quite a few superpowers:

  • Cache every single sample in the data pipeline. Millions of cache keys can be calculated in under a second (1). Benefit from prunable partial updates.
  • Freedom from storage lock-in. Swap storage backends in development and production environments without breaking a sweat (2).
  • Metaxy is pluggable, declarative, composable and extensible (3): use it to build custom integrations and workflows, benefit from emergent capabilities that enable tooling, visualizations and optimizations you didn't even plan for.
  1. Our experience at Anam with ClickHouse
  2. For example, develop against DeltaLake and scale production with ClickHouse without code changes.
  3. See our official integrations here

And now the killer feature:

Super Granular Data Versioning

The feature that makes Metaxy really stand out is the ability to identify prunable partial data updates (1) and skip unnecessary downstream computations. At the moment of writing, Metaxy is the only available tool that tackles these problems.

  1. which are very common in multimodal pipelines

Read The Pitch to be impressed even more.

All of this is possible thanks to (1) Narwhals, Ibis, and a few clever tricks.

  1. we really do stand on the shoulders of giants

Reliability

Metaxy was designed to handle large amounts of big metadata in distributed environments, makes very few assumptions about usage patterns and is non-invasive to the rest of the data pipeline.

Metaxy is fanatically tested across all supported metadata stores, Python versions and platforms 1. We guarantee versioning consistency across the supported metadata stores.

We have been dogfooding Metaxy at Anam since December 2025. We are running it in production with ClickHouse, Dagster, and Ray (1), and it's powering all our pipelines that prepare training data for our video generation models.

  1. and integrations with these tools are probably the most complete at the moment

That being said, Metaxy is still an early project, so while the core functionality is rock solid, some rough edges with other parts of Metaxy are expected.

Installation

Install Metaxy from PyPI:

uv add metaxy

Quickstart

Tip

Itching to get your hands dirty? Head to Quickstart.

What's Next?

Here are a few more useful links:


  1. The CLI is not tested on Windows yet.