DuckLake¶
Overview¶
This example demonstrates how to run a small Metaxy pipeline against a DuckLake attachment on the DuckDB metadata store. DuckLake is an open lakehouse format that separates the metadata catalog (table definitions, schema evolution, and transaction history) from data file storage. This lets you choose independent backends for each layer, for example PostgreSQL for the catalog and S3 for data files. The example keeps the workflow Metaxy-native by writing a small feature dataset and then reading it back through the public API.
Available backend combinations:
| Catalog backend | Storage backend |
|---|---|
| DuckDB, SQLite, PostgreSQL | local filesystem, S3, Cloudflare R2, Google Cloud Storage |
| MotherDuck | managed (no storage backend needed), or BYOB with S3/R2/GCS |
Tip
To use the credential chain (IAM roles, environment variables, etc.) instead of static credentials, set secret_parameters = { provider = "credential_chain" } on S3, R2, or GCS storage backends.
Note
MotherDuck supports a "Bring Your Own Bucket" (BYOB) mode where MotherDuck manages the DuckLake catalog while you provide your own S3-compatible storage. Storage secrets are created IN MOTHERDUCK so that MotherDuck compute can access your bucket.
Recommended: enable Map datatype
DuckLake has native Map type support. Enabling enable_map_datatype preserves Map columns across read and write operations.
Getting Started¶
Install the example's dependencies:
For the full list of backend combinations and advanced options, see the DuckLake integration reference.
Step 1: Configure DuckLake¶
DuckLake is configured with two parts:
- Catalog backend: transaction log and metadata
- Storage backend: data files
The example below is intentionally minimal and runnable out of the box.
project = "example_ducklake"
entrypoints = ["example_ducklake.definitions"]
auto_create_tables = true # Enable for development/examples
[stores.dev]
type = "metaxy.ext.duckdb.DuckDBMetadataStore"
[stores.dev.config]
database = "${DUCKLAKE_DEMO_DB:-:memory:}"
[stores.dev.config.ducklake]
alias = "ducklake"
[stores.dev.config.ducklake.attach_options]
override_data_path = true
# Local workflow used by this example:
[stores.dev.config.ducklake.catalog]
type = "sqlite"
uri = "${DUCKLAKE_META_DB:-/tmp/example_ducklake_meta.db}"
[stores.dev.config.ducklake.storage]
type = "local"
path = "${DUCKLAKE_STORAGE_PATH:-/tmp/example_ducklake_storage}"
# Example for real deployments (Postgres catalog + S3 storage):
# [stores.dev.config.ducklake.catalog]
# type = "postgres"
# secret_name = "ducklake_pg"
# host = "${DUCKLAKE_PG_HOST}"
# port = "${DUCKLAKE_PG_PORT:-5432}"
# database = "${DUCKLAKE_PG_DATABASE}"
# user = "${DUCKLAKE_PG_USER}"
# password = "${DUCKLAKE_PG_PASSWORD}"
#
# [stores.dev.config.ducklake.storage]
# type = "s3"
# secret_name = "ducklake_s3"
# bucket = "${DUCKLAKE_S3_BUCKET}"
# region = "${DUCKLAKE_S3_REGION:-eu-central-1}"
# key_id = "${DUCKLAKE_S3_KEY_ID}"
# secret = "${DUCKLAKE_S3_SECRET}"
Step 2: Initial Run¶
Let's prepare a small Metaxy pipeline using the configured DuckLake metadata store:
"""Minimal Metaxy pipeline backed by DuckLake."""
import metaxy as mx
import polars as pl
from metaxy.ext.duckdb import DuckDBMetadataStore
from metaxy.models.constants import METAXY_PROVENANCE_BY_FIELD
from example_ducklake.definitions import DuckLakeDemoFeature
def build_demo_rows() -> pl.DataFrame:
"""Create deterministic sample metadata for the example pipeline."""
return pl.DataFrame(
{
"sample_uid": ["clip_001", "clip_002"],
"path": [
"s3://demo-bucket/processed/clip_001.parquet",
"s3://demo-bucket/processed/clip_002.parquet",
],
METAXY_PROVENANCE_BY_FIELD: [
{"path": "path_hash_clip_001_v1"},
{"path": "path_hash_clip_002_v1"},
],
}
)
def load_feature_rows(store: DuckDBMetadataStore) -> list[tuple[str, str]]:
"""Read back a few rows through Metaxy's public read API."""
feature_df = (
store.read(DuckLakeDemoFeature, columns=["sample_uid", "path"])
.collect()
.to_polars()
.sort("sample_uid")
)
return [
(str(row["sample_uid"]), str(row["path"]))
for row in feature_df.iter_rows(named=True)
]
def discover_feature_keys() -> list[str]:
"""List feature keys discovered in the active Metaxy project."""
graph = mx.current_graph()
return sorted(feature_key.to_string() for feature_key in graph.list_features())
if __name__ == "__main__":
config = mx.init()
store = config.get_store()
assert isinstance(store, DuckDBMetadataStore), (
"DuckLake example misconfigured: expected DuckDBMetadataStore."
)
demo_rows = build_demo_rows()
feature_table_name = store.get_table_name(DuckLakeDemoFeature.spec().key)
feature_rows: list[tuple[str, str]] = []
print("DuckLake pipeline")
print(f" Store class: {store.__class__.__name__}")
print(f" Database: {store.database}")
with store.open("w"):
store.write(DuckLakeDemoFeature, demo_rows)
feature_rows = load_feature_rows(store)
print(f" Wrote {len(demo_rows)} rows for {DuckLakeDemoFeature.spec().key}")
print()
print("Discovered Metaxy features:")
for feature_key in discover_feature_keys():
print(f" {feature_key}")
print()
print(f"Created feature table: {feature_table_name}")
print("Files:")
for sample_uid, path in feature_rows:
print(f" {sample_uid}: {path}")
Step 3: Inspect Recorded Metadata¶
DuckLake pipeline
Store class: DuckDBMetadataStore
Database: /private/var/folders/s8/1bk6bx6d1l53q944ls244sdw0000gn/T/pytest-of-geoheil/pytest-1/test_example_snapshot_example_0/example-ducklake.db
Wrote 2 rows for examples/ducklake_demo
Discovered Metaxy features:
examples/ducklake_demo
Created feature table: examples__ducklake_demo
Files:
clip_001: s3://demo-bucket/processed/clip_001.parquet
clip_002: s3://demo-bucket/processed/clip_002.parquet
You should see:
- A successful Metaxy write
- The physical DuckLake-backed feature table name Metaxy created
- The rows read back for
examples/ducklake_demo