Skip to content

Metadata Stores

Metaxy abstracts interactions with metadata stored in external systems such as databases, files, or object stores, through a unified interface: MetadataStore. MetadataStore is implemented to satisfy storage design choices.

All operations with metadata stores may reference features as one of the supported syntactic sugar alternatives. In practice, it is typically convenient to either use feature classes or stringified feature keys.

Metadata accept Narwhals-compatible dataframes and return Narwhals dataframes. In practice, we have tested Metaxy with Pandas, Polars and Ibis dataframes.

Instantiation

There are generally two ways to create a MetadataStore. We are going to demonstrate both with DeltaLake as an example.

  1. Using the Python API directly:

    from metaxy.ext.polars.handlers.delta import DeltaMetadataStore
    
    store = DeltaMetadataStore(root_path="/path/to/directory")
    
  2. Via Metaxy configuration:

    First, create a metaxy.toml file:

    metaxy.toml
    [stores.dev]
    type = "metaxy.ext.polars.DeltaMetadataStore"
    root_path = "/path/to/directory"
    

    Now the metadata store can be constructed from a MetaxyConfig instance.

    import metaxy as mx
    
    config = mx.MetaxyConfig.get()
    store = config.get_store("dev")
    

    import metaxy as mx
    
    config = mx.init()
    store = config.get_store("dev")
    

Now the store is ready to be used. We'll also assume there is a MyFeature feature class (1) prepared.

  1. with "my/feature" key

Writing Metadata

In order to write metadata to a metadata store, you can use the MetadataStore.write method:

Example

with store.open("w"):
    store.write(MyFeature, df)

Subsequent writes effectively overwrite the previous metadata, while actually appending rows to the same table.

Flushing Metadata In The Background

Usually it's desired to write metadata to the metadata store as soon as it becomes available. This ensures the pipeline can resume processing after a failure and no data is lost. BufferedMetadataWriter can be used to achieve this: it writes metadata in real-time from a background thread.

Reading Metadata

Metadata can be retrieved using the MetadataStore.read method:

Example

with store.open("w"):
    df = store.write("my/feature", df)  # string keys work as well

with store:
    df = store.read("my/feature")

By default, Metaxy drops historical records with the same feature version, which makes the write-read sequence idempotent for an outside observer.

Resolving Incremental Updates

Increments can be computed using the MetadataStore.resolve_update method:

Example

with store.open("w"):
    inc = store.resolve_update("my/feature")

The returned Increment (or LazyIncrement) holds fresh samples that haven't been processed yet, stale samples which require to be processed again, and orphaned samples which are no longer present in upstream features and may be deleted.

Tip

Root features (1) require the samples argument to be set as well, since Metaxy would not be able to load upstream metadata automatically.

  1. features that do not have upstream features

It is up to the caller to decide how to handle the processing and potential deletion of orphaned samples.

Once processing is complete, the caller is expected to call MetadataStore.write to record metadata about the processed samples.

Custom staleness conditions

By default, resolve_update only marks samples as stale when their upstream provenance has changed. The staleness_predicates parameter allows marking additional records as stale based on arbitrary conditions. This is useful for forcing reprocessing after a bug fix that affected metadata (1), backfilling records that were processed with incomplete data, or invalidating samples that meet certain quality criteria.

  1. and not data, because in this case you should be changing the feature version

Predicates are Narwhals expressions evaluated against stored metadata. When multiple predicates are provided, a sample is considered stale if it matches any of them.

Reprocess failed or incomplete samples

import narwhals as nw

with store.open("w"):
    inc = store.resolve_update(
        "my/feature",
        staleness_predicates=[
            nw.col("status") == "failed",
            nw.col("embedding_dim").is_null(),
        ],
    )

Any record where status is "failed" or embedding is null will appear in inc.stale, even if its upstream provenance has not changed.

Where are increments computed?

Learn more here.

How are increments computed?

Learn more here.

Deleting Metadata

To delete rows from a metadata store, call MetadataStore.delete and provide conditions to identify rows to be deleted:

from datetime import datetime, timedelta, timezone

import narwhals as nw

with store.open("w"):
    store.delete(
        MyFeature,
        filters=[nw.col("metaxy_created_at") < datetime.now(timezone.utc) - timedelta(days=30)],
    )

Metaxy supports two deletion modes: soft deletes that only mark records as deleted, and hard deletes that physically remove records from storage. Soft deletion is enabled by default.

Soft deletes

Soft deletes mark records as deleted without physically removing them. This is achieved by appending new rows with metaxy_deleted_at system column set to the deletion timestamp. These records are still available and can be queried for if needed.

By default, MetadataStore.read filters out soft-deleted records. In order to disable this filtering, set include_soft_deleted to True:

import narwhals as nw

with store.open("w"):
    store.delete(
        MyFeature,
        filters=nw.col("status") == "pending",
    )

with store:
    active = store.read(MyFeature)
    all_rows = store.read(MyFeature, include_soft_deleted=True)

Hard deletes

Hard deletes permanently remove rows from storage and can be enabled by setting soft to False:

import narwhals as nw

with store.open("w"):
    store.delete(
        MyFeature,
        filters=nw.col("quality") < 0.8,
        soft=False,
    )

Deleting metadata from CLI

It is possible to delete metadata from the command line:

# Soft delete by default
metaxy metadata delete --feature predictions --filter "confidence < 0.3"

# Hard delete
metaxy metadata delete --feature predictions --filter "created_at < '2024-01-01'" --soft=false

Learn more in the CLI reference

Rebasing Metadata Versions

When a feature definition changes but the underlying computation stays the same (e.g., dependency graph refactoring, field renaming, code reorganization), existing metadata can be rebased onto the new feature version using MetadataStore.rebase. This recalculates provenance based on the target feature graph while preserving all user data columns.

rebase takes a dataframe of existing metadata, typically acquired with MetadataStore.read (1). The returned frame includes the target feature and project version columns, so pass preserve_feature_version=True to MetadataStore.write to retain them.

  1. With non-default version filtering. See an example below.

Example

from metaxy.models.constants import METAXY_FEATURE_VERSION

with store.open("w"):
    existing = store.read(
        "example/child",
        with_feature_history=True,
        # an older feature version to be rebased
        filters=[nw.col(METAXY_FEATURE_VERSION) == "abc123"],
    )
    rebased = store.rebase(
        "example/child",
        existing,
        to_feature_version="def456",  # new feature version
    )
    store.write("example/child", rebased, preserve_feature_version=True)

Rebasing is also available as a CLI command with simplified options:

$ metaxy metadata rebase example/child --from abc123 --to def456

Tip

Use --dry-run to preview how many rows would be affected without writing the result.

Fallback Stores

Metaxy metadata stores can be configured to pull missing metadata from another store. This is very useful for local and testing workflows, because it allows to avoid materializing the entire data pipeline locally. Instead, Metaxy stores can automatically pull missing metadata from production.

Example Metaxy configuration:

metaxy.toml
[stores.dev]
type = "metaxy.ext.polars.DeltaMetadataStore"
root_path = "${HOME}/.metaxy/dev"
fallback_stores = ["prod"]

[stores.prod]
type = "metaxy.ext.polars.DeltaMetadataStore"
root_path = "s3://my-prod-bucket/metaxy"

Warning

Currently, the "missing metadata" detection works by checking whether the feature table exists in the store. This works in conjunction with automatic table creation, but doesn't work if empty tables are pre-created by e.g. migration tooling or some kind of CI/CD workflows. This will be improved in the future.

Metaxy doesn't mix metadata from different stores: either the entire feature is going to be pulled from the fallback store, or the primary store will be used.

Fallback stores can be chained at arbitrary depth.

Map Datatype

Metaxy uses dictionary-like columns internally for Metaxy's field-level versioning columns. Most storage systems and Apache Arrow represent have native support for the Map type, but Polars doesn't. Polars converts Map columns to List(Struct(key, value)) (physically equivalent to Map). This means that:

  1. user-defined Map columns lose their type when round-tripped through Polars

  2. Metaxy has to represent field-versioning columns as Struct instead, which is very much not ideal as the fields will change over time, causing problems with some storage systems.

Note

This is also known as "The Map Hell" problem (the term invented by me).

Experimental Map Datatype Support

These problems can be solved with the enable_map_datatype configuration option:

metaxy.toml
enable_map_datatype = true

Experimental

Map datatype support is experimental and requires the narwhals-map and polars-map packages to be installed. They are shipped with metaxy[map] extra.

When enabled, Metaxy uses narwhals-map to represent Map columns as narwhals_map.Map across all dataframe backends. Metadata stores consume and return Narwhals frames with narwhals_map.Map and polars_map.Map columns, keeping Map columns intact across operations and preserving user-defined Map columns.

The following metadata stores support Map columns when enable_map_datatype is enabled:

Map support in Narwhals

Narwhals does not have native Map support yet (see the tracking issue). Metaxy uses narwhals-map to bridge this gap, providing a narwhals_map.Map datatype and the nw.Expr.map namespace for working with Map columns across backends.

Collecting results with Map columns

Standard narwhals.DataFrame.to_polars() and other conversion methods are not aware of narwhals_map.Map columns and will lose them when converting from other dataframe backends. Use collect_to_polars or collect_to_arrow to materialize lazy frames while preserving Map columns.

from metaxy.utils import collect_to_polars, collect_to_arrow

df = collect_to_polars(lazy_frame)    # -> pl.DataFrame with `polars_map.Map` columns
table = collect_to_arrow(lazy_frame)  # -> pa.Table with native `MapArray` columns

Metadata Store Implementations

Metaxy provides ready MetadataStore implementations for popular databases and storage systems.