Metadata Stores¶
Metaxy abstracts interactions with metadata stored in external systems such as databases, files, or object stores, through a unified interface: MetadataStore. MetadataStore is implemented to satisfy storage design choices.
All operations with metadata stores may reference features as one of the supported syntactic sugar alternatives. In practice, it is typically convenient to either use feature classes or stringified feature keys.
Metadata accept Narwhals-compatible dataframes and return Narwhals dataframes. In practice, we have tested Metaxy with Pandas, Polars and Ibis dataframes.
Instantiation¶
There are generally two ways to create a MetadataStore. We are going to demonstrate both with DeltaLake as an example.
-
Using the Python API directly:
-
Via Metaxy configuration:
First, create a
metaxy.tomlfile:metaxy.toml[stores.dev] type = "metaxy.ext.polars.DeltaMetadataStore" root_path = "/path/to/directory"Now the metadata store can be constructed from a
MetaxyConfiginstance.
Now the store is ready to be used. We'll also assume there is a MyFeature feature class (1) prepared.
- with
"my/feature"key
Writing Metadata¶
In order to write metadata to a metadata store, you can use the MetadataStore.write method:
Subsequent writes effectively overwrite the previous metadata, while actually appending rows to the same table.
Flushing Metadata In The Background
Usually it's desired to write metadata to the metadata store as soon as it becomes available.
This ensures the pipeline can resume processing after a failure and no data is lost.
BufferedMetadataWriter can be used to achieve this: it writes metadata in real-time from a background thread.
Reading Metadata¶
Metadata can be retrieved using the MetadataStore.read method:
Example
By default, Metaxy drops historical records with the same feature version, which makes the write-read sequence idempotent for an outside observer.
Resolving Incremental Updates¶
Increments can be computed using the MetadataStore.resolve_update method:
The returned Increment (or LazyIncrement) holds fresh samples that haven't been processed yet, stale samples which require to be processed again, and orphaned samples which are no longer present in upstream features and may be deleted.
Tip
Root features (1) require the samples argument to be set as well, since Metaxy would not be able to load upstream metadata automatically.
- features that do not have upstream features
It is up to the caller to decide how to handle the processing and potential deletion of orphaned samples.
Once processing is complete, the caller is expected to call MetadataStore.write to record metadata about the processed samples.
Custom staleness conditions¶
By default, resolve_update only marks samples as stale when their upstream provenance has changed. The staleness_predicates parameter allows marking additional records as stale based on arbitrary conditions.
This is useful for forcing reprocessing after a bug fix that affected metadata (1), backfilling records that were processed with incomplete data, or invalidating samples that meet certain quality criteria.
- and not data, because in this case you should be changing the feature version
Predicates are Narwhals expressions evaluated against stored metadata. When multiple predicates are provided, a sample is considered stale if it matches any of them.
Reprocess failed or incomplete samples
import narwhals as nw
with store.open("w"):
inc = store.resolve_update(
"my/feature",
staleness_predicates=[
nw.col("status") == "failed",
nw.col("embedding_dim").is_null(),
],
)
Any record where status is "failed" or embedding is null will appear in inc.stale, even if its upstream provenance has not changed.
Where are increments computed?
Learn more here.
How are increments computed?
Learn more here.
Deleting Metadata¶
To delete rows from a metadata store, call MetadataStore.delete and provide conditions to identify rows to be deleted:
from datetime import datetime, timedelta, timezone
import narwhals as nw
with store.open("w"):
store.delete(
MyFeature,
filters=[nw.col("metaxy_created_at") < datetime.now(timezone.utc) - timedelta(days=30)],
)
Metaxy supports two deletion modes: soft deletes that only mark records as deleted, and hard deletes that physically remove records from storage. Soft deletion is enabled by default.
Soft deletes¶
Soft deletes mark records as deleted without physically removing them. This is achieved by appending new rows with metaxy_deleted_at system column set to the deletion timestamp. These records are still available and can be queried for if needed.
By default, MetadataStore.read filters out soft-deleted records. In order to disable this filtering, set include_soft_deleted to True:
import narwhals as nw
with store.open("w"):
store.delete(
MyFeature,
filters=nw.col("status") == "pending",
)
with store:
active = store.read(MyFeature)
all_rows = store.read(MyFeature, include_soft_deleted=True)
Hard deletes¶
Hard deletes permanently remove rows from storage and can be enabled by setting soft to False:
import narwhals as nw
with store.open("w"):
store.delete(
MyFeature,
filters=nw.col("quality") < 0.8,
soft=False,
)
Deleting metadata from CLI¶
It is possible to delete metadata from the command line:
# Soft delete by default
metaxy metadata delete --feature predictions --filter "confidence < 0.3"
# Hard delete
metaxy metadata delete --feature predictions --filter "created_at < '2024-01-01'" --soft=false
Learn more in the CLI reference
Rebasing Metadata Versions¶
When a feature definition changes but the underlying computation stays the same (e.g., dependency graph refactoring, field renaming, code reorganization), existing metadata can be rebased onto the new feature version using MetadataStore.rebase. This recalculates provenance based on the target feature graph while preserving all user data columns.
rebase takes a dataframe of existing metadata, typically acquired with MetadataStore.read (1). The returned frame includes the target feature and project version columns, so pass preserve_feature_version=True to MetadataStore.write to retain them.
- With non-default version filtering. See an example below.
Example
from metaxy.models.constants import METAXY_FEATURE_VERSION
with store.open("w"):
existing = store.read(
"example/child",
with_feature_history=True,
# an older feature version to be rebased
filters=[nw.col(METAXY_FEATURE_VERSION) == "abc123"],
)
rebased = store.rebase(
"example/child",
existing,
to_feature_version="def456", # new feature version
)
store.write("example/child", rebased, preserve_feature_version=True)
Rebasing is also available as a CLI command with simplified options:
Tip
Use --dry-run to preview how many rows would be affected without writing the result.
Fallback Stores¶
Metaxy metadata stores can be configured to pull missing metadata from another store. This is very useful for local and testing workflows, because it allows to avoid materializing the entire data pipeline locally. Instead, Metaxy stores can automatically pull missing metadata from production.
Example Metaxy configuration:
[stores.dev]
type = "metaxy.ext.polars.DeltaMetadataStore"
root_path = "${HOME}/.metaxy/dev"
fallback_stores = ["prod"]
[stores.prod]
type = "metaxy.ext.polars.DeltaMetadataStore"
root_path = "s3://my-prod-bucket/metaxy"
Warning
Currently, the "missing metadata" detection works by checking whether the feature table exists in the store. This works in conjunction with automatic table creation, but doesn't work if empty tables are pre-created by e.g. migration tooling or some kind of CI/CD workflows. This will be improved in the future.
Metaxy doesn't mix metadata from different stores: either the entire feature is going to be pulled from the fallback store, or the primary store will be used.
Fallback stores can be chained at arbitrary depth.
Map Datatype¶
Metaxy uses dictionary-like columns internally for Metaxy's field-level versioning columns. Most storage systems and Apache Arrow represent have native support for the Map type, but Polars doesn't. Polars converts Map columns to List(Struct(key, value)) (physically equivalent to Map). This means that:
-
user-defined
Mapcolumns lose their type when round-tripped through Polars -
Metaxy has to represent field-versioning columns as
Structinstead, which is very much not ideal as the fields will change over time, causing problems with some storage systems.
Note
This is also known as "The Map Hell" problem (the term invented by me).
Experimental Map Datatype Support¶
These problems can be solved with the enable_map_datatype configuration option:
Experimental
Map datatype support is experimental and requires the narwhals-map and polars-map packages to be installed. They are shipped with metaxy[map] extra.
When enabled, Metaxy uses narwhals-map to represent Map columns as narwhals_map.Map across all dataframe backends. Metadata stores consume and return Narwhals frames with narwhals_map.Map and polars_map.Map columns, keeping Map columns intact across operations and preserving user-defined Map columns.
The following metadata stores support Map columns when enable_map_datatype is enabled:
Map support in Narwhals
Narwhals does not have native Map support yet (see the tracking issue). Metaxy uses narwhals-map to bridge this gap, providing a narwhals_map.Map datatype and the nw.Expr.map namespace for working with Map columns across backends.
Collecting results with Map columns
Standard narwhals.DataFrame.to_polars() and other conversion methods are not aware of narwhals_map.Map columns and will lose them when converting from other dataframe backends.
Use collect_to_polars or collect_to_arrow to materialize lazy frames while preserving Map columns.
Metadata Store Implementations¶
Metaxy provides ready MetadataStore implementations for popular databases and storage systems.