BigQuery¶
Experimental
This functionality is experimental.
BigQuery is a serverless data warehouse managed by Google Cloud. To use Metaxy with BigQuery, configure BigQueryMetadataStore. Versioning computations run natively in BigQuery.
Installation¶
metaxy.ext.metadata_stores.bigquery
¶
BigQuery metadata store - thin wrapper around IbisMetadataStore.
metaxy.ext.metadata_stores.bigquery.BigQueryMetadataStore
¶
BigQueryMetadataStore(
project_id: str | None = None,
dataset_id: str | None = None,
*,
credentials_path: str | None = None,
credentials: Any | None = None,
location: str | None = None,
connection_params: dict[str, Any] | None = None,
fallback_stores: list[MetadataStore] | None = None,
**kwargs: Any,
)
Bases: IbisMetadataStore
BigQuery metadata store using Ibis backend.
Warning
It's on the user to set up infrastructure for Metaxy correctly. Make sure to have large tables partitioned as appropriate for your use case.
Note
BigQuery automatically optimizes queries on partitioned tables.
When tables are partitioned (e.g., by date or ingestion time with _PARTITIONTIME), BigQuery will
automatically prune partitions based on WHERE clauses in queries, without needing
explicit configuration in the metadata store.
Make sure to use appropriate filters when calling BigQueryMetadataStore.read.
With Service Account
With Location Configuration
With Custom Hash Algorithm
Parameters:
-
project_id(str | None, default:None) βGoogle Cloud project ID containing the dataset. Can also be set via GOOGLE_CLOUD_PROJECT environment variable.
-
dataset_id(str | None, default:None) βBigQuery dataset name for storing metadata tables. If not provided, uses the default dataset for the project.
-
credentials_path(str | None, default:None) βPath to service account JSON file. Alternative to passing credentials object directly.
-
credentials(Any | None, default:None) βGoogle Cloud credentials object. If not provided, uses default credentials from environment.
-
location(str | None, default:None) βDefault location for BigQuery resources (e.g., "US", "EU"). If not specified, BigQuery determines based on dataset location.
-
connection_params(dict[str, Any] | None, default:None) βAdditional Ibis BigQuery connection parameters. Overrides individual parameters if provided.
-
fallback_stores(list[MetadataStore] | None, default:None) βOrdered list of read-only fallback stores.
-
**kwargs(Any, default:{}) βPassed to
IbisMetadataStore
Raises:
-
ImportErrorβIf ibis-bigquery not installed
-
ValueErrorβIf neither project_id nor connection_params provided
Note
Authentication priority: 1. Explicit credentials or credentials_path 2. Application Default Credentials (ADC) 3. Google Cloud SDK credentials
BigQuery automatically handles partition pruning when querying partitioned tables. If your tables are partitioned (e.g., by date or ingestion time), BigQuery will automatically optimize queries with appropriate WHERE clauses on the partition column.
Example
# Using environment authentication
store = BigQueryMetadataStore(
project_id="my-project",
dataset_id="ml_metadata",
)
# Using service account
store = BigQueryMetadataStore(
project_id="my-project",
dataset_id="ml_metadata",
credentials_path="/path/to/key.json",
)
# With location specification
store = BigQueryMetadataStore(
project_id="my-project",
dataset_id="ml_metadata",
location="EU",
)
Source code in src/metaxy/ext/metadata_stores/bigquery.py
def __init__(
self,
project_id: str | None = None,
dataset_id: str | None = None,
*,
credentials_path: str | None = None,
credentials: Any | None = None,
location: str | None = None,
connection_params: dict[str, Any] | None = None,
fallback_stores: list["MetadataStore"] | None = None,
**kwargs: Any,
):
"""
Initialize [BigQuery](https://cloud.google.com/bigquery) metadata store.
Args:
project_id: Google Cloud project ID containing the dataset.
Can also be set via GOOGLE_CLOUD_PROJECT environment variable.
dataset_id: BigQuery dataset name for storing metadata tables.
If not provided, uses the default dataset for the project.
credentials_path: Path to service account JSON file.
Alternative to passing credentials object directly.
credentials: Google Cloud credentials object.
If not provided, uses default credentials from environment.
location: Default location for BigQuery resources (e.g., "US", "EU").
If not specified, BigQuery determines based on dataset location.
connection_params: Additional Ibis BigQuery connection parameters.
Overrides individual parameters if provided.
fallback_stores: Ordered list of read-only fallback stores.
**kwargs: Passed to [`IbisMetadataStore`][metaxy.metadata_store.ibis.IbisMetadataStore]
Raises:
ImportError: If ibis-bigquery not installed
ValueError: If neither project_id nor connection_params provided
Note:
Authentication priority:
1. Explicit credentials or credentials_path
2. Application Default Credentials (ADC)
3. Google Cloud SDK credentials
BigQuery automatically handles partition pruning when querying partitioned tables.
If your tables are partitioned (e.g., by date or ingestion time), BigQuery will
automatically optimize queries with appropriate WHERE clauses on the partition column.
Example:
<!-- skip next -->
```py
# Using environment authentication
store = BigQueryMetadataStore(
project_id="my-project",
dataset_id="ml_metadata",
)
# Using service account
store = BigQueryMetadataStore(
project_id="my-project",
dataset_id="ml_metadata",
credentials_path="/path/to/key.json",
)
# With location specification
store = BigQueryMetadataStore(
project_id="my-project",
dataset_id="ml_metadata",
location="EU",
)
```
"""
# Build connection parameters if not provided
if connection_params is None:
connection_params = self._build_connection_params(
project_id=project_id,
dataset_id=dataset_id,
credentials_path=credentials_path,
credentials=credentials,
location=location,
)
# Validate we have minimum required parameters
if "project_id" not in connection_params and project_id is None:
raise ValueError(
"Must provide either project_id or connection_params with project_id. Example: project_id='my-project'"
)
# Store parameters for display
self.project_id = project_id or connection_params.get("project_id")
self.dataset_id = dataset_id or connection_params.get("dataset_id", "")
# Initialize Ibis store with BigQuery backend
super().__init__(
backend="bigquery",
connection_params=connection_params,
fallback_stores=fallback_stores,
**kwargs,
)
Configuration¶
fallback_stores¶
List of fallback store names to search when features are not found in the current store.
Type: list[str]
hash_algorithm¶
Hash algorithm for versioning. If None, uses store's default.
Type: metaxy.versioning.types.HashAlgorithm | None
versioning_engine¶
Which versioning engine to use: 'auto' (prefer native), 'native', or 'polars'.
Type: Literal['auto', 'native', 'polars'] | Default: "auto"
connection_string¶
Ibis connection string (e.g., 'clickhouse://host:9000/db').
Type: str | None
connection_params¶
Backend-specific connection parameters.
Type: dict[str, Any | None
table_prefix¶
Optional prefix for all table names.
Type: str | None
auto_create_tables¶
If True, create tables on open. For development/testing only.
Type: bool | None
project_id¶
Google Cloud project ID containing the dataset.
Type: str | None
dataset_id¶
BigQuery dataset name for storing metadata tables.
Type: str | None
credentials_path¶
Path to service account JSON file.
Type: str | None
credentials¶
Google Cloud credentials object.
Type: Optional[Any]
location¶
Default location for BigQuery resources (e.g., 'US', 'EU').
Type: str | None