Skip to content

Metaxy + BigQuery

Experimental

This functionality is experimental.

BigQuery is a serverless data warehouse managed by Google Cloud. To use Metaxy with BigQuery, configure BigQueryMetadataStore. Versioning computations run natively in BigQuery.

Installation

pip install 'metaxy[bigquery]'

API Reference

metaxy.ext.metadata_stores.bigquery

BigQuery metadata store - thin wrapper around IbisMetadataStore.

metaxy.ext.metadata_stores.bigquery.BigQueryMetadataStore

BigQueryMetadataStore(
    project_id: str | None = None,
    dataset_id: str | None = None,
    *,
    credentials_path: str | None = None,
    credentials: Any | None = None,
    location: str | None = None,
    connection_params: dict[str, Any] | None = None,
    fallback_stores: list[MetadataStore] | None = None,
    **kwargs: Any,
)

Bases: IbisMetadataStore

BigQuery metadata store using Ibis backend.

Warning

It's on the user to set up infrastructure for Metaxy correctly. Make sure to have large tables partitioned as appropriate for your use case.

Note

BigQuery automatically optimizes queries on partitioned tables. When tables are partitioned (e.g., by date or ingestion time with _PARTITIONTIME), BigQuery will automatically prune partitions based on WHERE clauses in queries, without needing explicit configuration in the metadata store. Make sure to use appropriate filters when calling BigQueryMetadataStore.read.

Basic Connection
store = BigQueryMetadataStore(
    project_id="my-project",
    dataset_id="my_dataset",
)
With Service Account
store = BigQueryMetadataStore(
    project_id="my-project",
    dataset_id="my_dataset",
    credentials_path="/path/to/service-account.json",
)
With Location Configuration
store = BigQueryMetadataStore(
    project_id="my-project",
    dataset_id="my_dataset",
    location="EU",  # Specify data location
)
With Custom Hash Algorithm
store = BigQueryMetadataStore(
    project_id="my-project",
    dataset_id="my_dataset",
    hash_algorithm=HashAlgorithm.SHA256,  # Use SHA256 instead of default FARMHASH
)

Parameters:

  • project_id (str | None, default: None ) –

    Google Cloud project ID containing the dataset. Can also be set via GOOGLE_CLOUD_PROJECT environment variable.

  • dataset_id (str | None, default: None ) –

    BigQuery dataset name for storing metadata tables. If not provided, uses the default dataset for the project.

  • credentials_path (str | None, default: None ) –

    Path to service account JSON file. Alternative to passing credentials object directly.

  • credentials (Any | None, default: None ) –

    Google Cloud credentials object. If not provided, uses default credentials from environment.

  • location (str | None, default: None ) –

    Default location for BigQuery resources (e.g., "US", "EU"). If not specified, BigQuery determines based on dataset location.

  • connection_params (dict[str, Any] | None, default: None ) –

    Additional Ibis BigQuery connection parameters. Overrides individual parameters if provided.

  • fallback_stores (list[MetadataStore] | None, default: None ) –

    Ordered list of read-only fallback stores.

  • **kwargs (Any, default: {} ) –

    Passed to IbisMetadataStore

Raises:

  • ImportError

    If ibis-bigquery not installed

  • ValueError

    If neither project_id nor connection_params provided

Note

Authentication priority: 1. Explicit credentials or credentials_path 2. Application Default Credentials (ADC) 3. Google Cloud SDK credentials

BigQuery automatically handles partition pruning when querying partitioned tables. If your tables are partitioned (e.g., by date or ingestion time), BigQuery will automatically optimize queries with appropriate WHERE clauses on the partition column.

Example
# Using environment authentication
store = BigQueryMetadataStore(
    project_id="my-project",
    dataset_id="ml_metadata",
)

# Using service account
store = BigQueryMetadataStore(
    project_id="my-project",
    dataset_id="ml_metadata",
    credentials_path="/path/to/key.json",
)

# With location specification
store = BigQueryMetadataStore(
    project_id="my-project",
    dataset_id="ml_metadata",
    location="EU",
)
Source code in src/metaxy/ext/metadata_stores/bigquery.py
def __init__(
    self,
    project_id: str | None = None,
    dataset_id: str | None = None,
    *,
    credentials_path: str | None = None,
    credentials: Any | None = None,
    location: str | None = None,
    connection_params: dict[str, Any] | None = None,
    fallback_stores: list["MetadataStore"] | None = None,
    **kwargs: Any,
):
    """
    Initialize [BigQuery](https://cloud.google.com/bigquery) metadata store.

    Args:
        project_id: Google Cloud project ID containing the dataset.
            Can also be set via GOOGLE_CLOUD_PROJECT environment variable.
        dataset_id: BigQuery dataset name for storing metadata tables.
            If not provided, uses the default dataset for the project.
        credentials_path: Path to service account JSON file.
            Alternative to passing credentials object directly.
        credentials: Google Cloud credentials object.
            If not provided, uses default credentials from environment.
        location: Default location for BigQuery resources (e.g., "US", "EU").
            If not specified, BigQuery determines based on dataset location.
        connection_params: Additional Ibis BigQuery connection parameters.
            Overrides individual parameters if provided.
        fallback_stores: Ordered list of read-only fallback stores.
        **kwargs: Passed to [`IbisMetadataStore`][metaxy.metadata_store.ibis.IbisMetadataStore]

    Raises:
        ImportError: If ibis-bigquery not installed
        ValueError: If neither project_id nor connection_params provided

    Note:
        Authentication priority:
        1. Explicit credentials or credentials_path
        2. Application Default Credentials (ADC)
        3. Google Cloud SDK credentials

        BigQuery automatically handles partition pruning when querying partitioned tables.
        If your tables are partitioned (e.g., by date or ingestion time), BigQuery will
        automatically optimize queries with appropriate WHERE clauses on the partition column.

    Example:
        <!-- skip next -->
        ```py
        # Using environment authentication
        store = BigQueryMetadataStore(
            project_id="my-project",
            dataset_id="ml_metadata",
        )

        # Using service account
        store = BigQueryMetadataStore(
            project_id="my-project",
            dataset_id="ml_metadata",
            credentials_path="/path/to/key.json",
        )

        # With location specification
        store = BigQueryMetadataStore(
            project_id="my-project",
            dataset_id="ml_metadata",
            location="EU",
        )
        ```
    """
    # Build connection parameters if not provided
    if connection_params is None:
        connection_params = self._build_connection_params(
            project_id=project_id,
            dataset_id=dataset_id,
            credentials_path=credentials_path,
            credentials=credentials,
            location=location,
        )

    # Validate we have minimum required parameters
    if "project_id" not in connection_params and project_id is None:
        raise ValueError(
            "Must provide either project_id or connection_params with project_id. Example: project_id='my-project'"
        )

    # Store parameters for display
    self.project_id = project_id or connection_params.get("project_id")
    self.dataset_id = dataset_id or connection_params.get("dataset_id", "")

    # Initialize Ibis store with BigQuery backend
    super().__init__(
        backend="bigquery",
        connection_params=connection_params,
        fallback_stores=fallback_stores,
        **kwargs,
    )

Configuration

Configuration for BigQueryMetadataStore.

Example
metaxy.toml
[stores.dev]
type = "metaxy.ext.metadata_stores.bigquery.BigQueryMetadataStore"

[stores.dev.config]
project_id = "my-project"
dataset_id = "my_dataset"
credentials_path = "/path/to/service-account.json"
Show JSON schema:
{
  "$defs": {
    "HashAlgorithm": {
      "description": "Supported hash algorithms for field provenance calculation.\n\nThese algorithms are chosen for:\n- Speed (non-cryptographic hashes preferred)\n- Cross-database availability\n- Good collision resistance for field provenance calculation",
      "enum": [
        "xxhash64",
        "xxhash32",
        "wyhash",
        "sha256",
        "md5",
        "farmhash"
      ],
      "title": "HashAlgorithm",
      "type": "string"
    }
  },
  "additionalProperties": false,
  "description": "Configuration for BigQueryMetadataStore.\n\nExample:\n    ```toml title=\"metaxy.toml\"\n    [stores.dev]\n    type = \"metaxy.ext.metadata_stores.bigquery.BigQueryMetadataStore\"\n\n    [stores.dev.config]\n    project_id = \"my-project\"\n    dataset_id = \"my_dataset\"\n    credentials_path = \"/path/to/service-account.json\"\n    ```",
  "properties": {
    "fallback_stores": {
      "description": "List of fallback store names to search when features are not found in the current store.",
      "items": {
        "type": "string"
      },
      "title": "Fallback Stores",
      "type": "array"
    },
    "hash_algorithm": {
      "anyOf": [
        {
          "$ref": "#/$defs/HashAlgorithm"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Hash algorithm for versioning. If None, uses store's default."
    },
    "versioning_engine": {
      "default": "auto",
      "description": "Which versioning engine to use: 'auto' (prefer native), 'native', or 'polars'.",
      "enum": [
        "auto",
        "native",
        "polars"
      ],
      "title": "Versioning Engine",
      "type": "string"
    },
    "connection_string": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Ibis connection string (e.g., 'clickhouse://host:9000/db').",
      "title": "Connection String"
    },
    "backend": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Ibis backend name (e.g., 'clickhouse', 'postgres', 'duckdb').",
      "mkdocs_metaxy_hide": true,
      "title": "Backend"
    },
    "connection_params": {
      "anyOf": [
        {
          "additionalProperties": true,
          "type": "object"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Backend-specific connection parameters.",
      "title": "Connection Params"
    },
    "table_prefix": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Optional prefix for all table names.",
      "title": "Table Prefix"
    },
    "auto_create_tables": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "If True, create tables on open. For development/testing only.",
      "title": "Auto Create Tables"
    },
    "project_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Google Cloud project ID containing the dataset.",
      "title": "Project Id"
    },
    "dataset_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "BigQuery dataset name for storing metadata tables.",
      "title": "Dataset Id"
    },
    "credentials_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Path to service account JSON file.",
      "title": "Credentials Path"
    },
    "credentials": {
      "anyOf": [
        {},
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Google Cloud credentials object.",
      "title": "Credentials"
    },
    "location": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Default location for BigQuery resources (e.g., 'US', 'EU').",
      "title": "Location"
    }
  },
  "title": "BigQueryMetadataStoreConfig",
  "type": "object"
}

metaxy.ext.metadata_stores.bigquery.BigQueryMetadataStoreConfig.fallback_stores pydantic-field

fallback_stores: list[str]

List of fallback store names to search when features are not found in the current store.

[stores.dev.config]
fallback_stores = []
[tool.metaxy.stores.dev.config]
fallback_stores = []
export METAXY_STORES__DEV__CONFIG__FALLBACK_STORES=[]

metaxy.ext.metadata_stores.bigquery.BigQueryMetadataStoreConfig.hash_algorithm pydantic-field

hash_algorithm: HashAlgorithm | None = None

Hash algorithm for versioning. If None, uses store's default.

[stores.dev.config]
hash_algorithm = "..."
[tool.metaxy.stores.dev.config]
hash_algorithm = "..."
export METAXY_STORES__DEV__CONFIG__HASH_ALGORITHM=...

metaxy.ext.metadata_stores.bigquery.BigQueryMetadataStoreConfig.versioning_engine pydantic-field

versioning_engine: Literal["auto", "native", "polars"] = (
    "auto"
)

Which versioning engine to use: 'auto' (prefer native), 'native', or 'polars'.

[stores.dev.config]
versioning_engine = "auto"
[tool.metaxy.stores.dev.config]
versioning_engine = "auto"
export METAXY_STORES__DEV__CONFIG__VERSIONING_ENGINE=auto

metaxy.ext.metadata_stores.bigquery.BigQueryMetadataStoreConfig.connection_string pydantic-field

connection_string: str | None = None

Ibis connection string (e.g., 'clickhouse://host:9000/db').

[stores.dev.config]
connection_string = "..."
[tool.metaxy.stores.dev.config]
connection_string = "..."
export METAXY_STORES__DEV__CONFIG__CONNECTION_STRING=...

metaxy.ext.metadata_stores.bigquery.BigQueryMetadataStoreConfig.connection_params pydantic-field

connection_params: dict[str, Any] | None = None

Backend-specific connection parameters.

[stores.dev.config]
connection_params = {}
[tool.metaxy.stores.dev.config]
connection_params = {}
export METAXY_STORES__DEV__CONFIG__CONNECTION_PARAMS=...

metaxy.ext.metadata_stores.bigquery.BigQueryMetadataStoreConfig.table_prefix pydantic-field

table_prefix: str | None = None

Optional prefix for all table names.

[stores.dev.config]
table_prefix = "..."
[tool.metaxy.stores.dev.config]
table_prefix = "..."
export METAXY_STORES__DEV__CONFIG__TABLE_PREFIX=...

metaxy.ext.metadata_stores.bigquery.BigQueryMetadataStoreConfig.auto_create_tables pydantic-field

auto_create_tables: bool | None = None

If True, create tables on open. For development/testing only.

[stores.dev.config]
auto_create_tables = false
[tool.metaxy.stores.dev.config]
auto_create_tables = false
export METAXY_STORES__DEV__CONFIG__AUTO_CREATE_TABLES=...

metaxy.ext.metadata_stores.bigquery.BigQueryMetadataStoreConfig.project_id pydantic-field

project_id: str | None = None

Google Cloud project ID containing the dataset.

[stores.dev.config]
project_id = "..."
[tool.metaxy.stores.dev.config]
project_id = "..."
export METAXY_STORES__DEV__CONFIG__PROJECT_ID=...

metaxy.ext.metadata_stores.bigquery.BigQueryMetadataStoreConfig.dataset_id pydantic-field

dataset_id: str | None = None

BigQuery dataset name for storing metadata tables.

[stores.dev.config]
dataset_id = "..."
[tool.metaxy.stores.dev.config]
dataset_id = "..."
export METAXY_STORES__DEV__CONFIG__DATASET_ID=...

metaxy.ext.metadata_stores.bigquery.BigQueryMetadataStoreConfig.credentials_path pydantic-field

credentials_path: str | None = None

Path to service account JSON file.

[stores.dev.config]
credentials_path = "..."
[tool.metaxy.stores.dev.config]
credentials_path = "..."
export METAXY_STORES__DEV__CONFIG__CREDENTIALS_PATH=...

metaxy.ext.metadata_stores.bigquery.BigQueryMetadataStoreConfig.credentials pydantic-field

credentials: Any | None = None

Google Cloud credentials object.

[stores.dev.config]
credentials = "..."
[tool.metaxy.stores.dev.config]
credentials = "..."
export METAXY_STORES__DEV__CONFIG__CREDENTIALS=...

metaxy.ext.metadata_stores.bigquery.BigQueryMetadataStoreConfig.location pydantic-field

location: str | None = None

Default location for BigQuery resources (e.g., 'US', 'EU').

[stores.dev.config]
location = "..."
[tool.metaxy.stores.dev.config]
location = "..."
export METAXY_STORES__DEV__CONFIG__LOCATION=...