Skip to content

DuckLake

Overview

View Source on GitHub

This example demonstrates how to run a small Metaxy pipeline against a DuckLake attachment on the DuckDB metadata store. DuckLake is an open lakehouse format that separates the metadata catalog (table definitions, schema evolution, and transaction history) from data file storage. This lets you choose independent backends for each layer, for example PostgreSQL for the catalog and S3 for data files. The example keeps the workflow Metaxy-native by writing a small feature dataset and then reading it back through the public API.

Available backend combinations:

Catalog backend Storage backend
DuckDB, SQLite, PostgreSQL local filesystem, S3, Cloudflare R2, Google Cloud Storage
MotherDuck managed (no storage backend needed), or BYOB with S3/R2/GCS

Tip

To use the credential chain (IAM roles, environment variables, etc.) instead of static credentials, set secret_parameters = { provider = "credential_chain" } on S3, R2, or GCS storage backends.

Note

MotherDuck supports a "Bring Your Own Bucket" (BYOB) mode where MotherDuck manages the DuckLake catalog while you provide your own S3-compatible storage. Storage secrets are created IN MOTHERDUCK so that MotherDuck compute can access your bucket.

Recommended: enable Map datatype

DuckLake has native Map type support. Enabling enable_map_datatype preserves Map columns across read and write operations.

Getting Started

Install the example's dependencies:

uv sync

For the full list of backend combinations and advanced options, see the DuckLake integration reference.

Step 1: Configure DuckLake

DuckLake is configured with two parts:

  1. Catalog backend: transaction log and metadata
  2. Storage backend: data files

The example below is intentionally minimal and runnable out of the box.

metaxy.toml
project = "example_ducklake"
entrypoints = ["example_ducklake.definitions"]
auto_create_tables = true # Enable for development/examples

[stores.dev]
type = "metaxy.ext.duckdb.DuckDBMetadataStore"

[stores.dev.config]
database = "${DUCKLAKE_DEMO_DB:-:memory:}"

[stores.dev.config.ducklake]
alias = "ducklake"

[stores.dev.config.ducklake.attach_options]
override_data_path = true

# Local workflow used by this example:
[stores.dev.config.ducklake.catalog]
type = "sqlite"
uri = "${DUCKLAKE_META_DB:-/tmp/example_ducklake_meta.db}"

[stores.dev.config.ducklake.storage]
type = "local"
path = "${DUCKLAKE_STORAGE_PATH:-/tmp/example_ducklake_storage}"

# Example for real deployments (Postgres catalog + S3 storage):
# [stores.dev.config.ducklake.catalog]
# type = "postgres"
# secret_name = "ducklake_pg"
# host = "${DUCKLAKE_PG_HOST}"
# port = "${DUCKLAKE_PG_PORT:-5432}"
# database = "${DUCKLAKE_PG_DATABASE}"
# user = "${DUCKLAKE_PG_USER}"
# password = "${DUCKLAKE_PG_PASSWORD}"
#
# [stores.dev.config.ducklake.storage]
# type = "s3"
# secret_name = "ducklake_s3"
# bucket = "${DUCKLAKE_S3_BUCKET}"
# region = "${DUCKLAKE_S3_REGION:-eu-central-1}"
# key_id = "${DUCKLAKE_S3_KEY_ID}"
# secret = "${DUCKLAKE_S3_SECRET}"

Step 2: Initial Run

Let's prepare a small Metaxy pipeline using the configured DuckLake metadata store:

src/example_ducklake/pipeline.py
"""Minimal Metaxy pipeline backed by DuckLake."""

import metaxy as mx
import polars as pl
from metaxy.ext.duckdb import DuckDBMetadataStore
from metaxy.models.constants import METAXY_PROVENANCE_BY_FIELD

from example_ducklake.definitions import DuckLakeDemoFeature


def build_demo_rows() -> pl.DataFrame:
    """Create deterministic sample metadata for the example pipeline."""
    return pl.DataFrame(
        {
            "sample_uid": ["clip_001", "clip_002"],
            "path": [
                "s3://demo-bucket/processed/clip_001.parquet",
                "s3://demo-bucket/processed/clip_002.parquet",
            ],
            METAXY_PROVENANCE_BY_FIELD: [
                {"path": "path_hash_clip_001_v1"},
                {"path": "path_hash_clip_002_v1"},
            ],
        }
    )


def load_feature_rows(store: DuckDBMetadataStore) -> list[tuple[str, str]]:
    """Read back a few rows through Metaxy's public read API."""
    feature_df = (
        store.read(DuckLakeDemoFeature, columns=["sample_uid", "path"])
        .collect()
        .to_polars()
        .sort("sample_uid")
    )
    return [
        (str(row["sample_uid"]), str(row["path"]))
        for row in feature_df.iter_rows(named=True)
    ]


def discover_feature_keys() -> list[str]:
    """List feature keys discovered in the active Metaxy project."""
    graph = mx.current_graph()
    return sorted(feature_key.to_string() for feature_key in graph.list_features())


if __name__ == "__main__":
    config = mx.init()
    store = config.get_store()
    assert isinstance(store, DuckDBMetadataStore), (
        "DuckLake example misconfigured: expected DuckDBMetadataStore."
    )
    demo_rows = build_demo_rows()
    feature_table_name = store.get_table_name(DuckLakeDemoFeature.spec().key)
    feature_rows: list[tuple[str, str]] = []

    print("DuckLake pipeline")
    print(f"  Store class: {store.__class__.__name__}")
    print(f"  Database: {store.database}")

    with store.open("w"):
        store.write(DuckLakeDemoFeature, demo_rows)
        feature_rows = load_feature_rows(store)

    print(f"  Wrote {len(demo_rows)} rows for {DuckLakeDemoFeature.spec().key}")
    print()
    print("Discovered Metaxy features:")
    for feature_key in discover_feature_keys():
        print(f"  {feature_key}")
    print()
    print(f"Created feature table: {feature_table_name}")
    print("Files:")
    for sample_uid, path in feature_rows:
        print(f"  {sample_uid}: {path}")

Step 3: Inspect Recorded Metadata

$ python src/example_ducklake/pipeline.py
DuckLake pipeline
  Store class: DuckDBMetadataStore
  Database: /private/var/folders/s8/1bk6bx6d1l53q944ls244sdw0000gn/T/pytest-of-geoheil/pytest-1/test_example_snapshot_example_0/example-ducklake.db
  Wrote 2 rows for examples/ducklake_demo

Discovered Metaxy features:
  examples/ducklake_demo

Created feature table: examples__ducklake_demo
Files:
  clip_001: s3://demo-bucket/processed/clip_001.parquet
  clip_002: s3://demo-bucket/processed/clip_002.parquet

You should see:

  1. A successful Metaxy write
  2. The physical DuckLake-backed feature table name Metaxy created
  3. The rows read back for examples/ducklake_demo