Skip to content

Feat (bigquery): store resource schema in the table's own OPTIONS, drop the _table_metadata store#3

Merged
sagargg merged 6 commits into
mainfrom
feat/bigquery-schema-in-table-options
May 27, 2026
Merged

Feat (bigquery): store resource schema in the table's own OPTIONS, drop the _table_metadata store#3
sagargg merged 6 commits into
mainfrom
feat/bigquery-schema-in-table-options

Conversation

@sagargg
Copy link
Copy Markdown
Member

@sagargg sagargg commented May 27, 2026

What

Stores the resource's Frictionless schema inside each data table's own metadata instead of a separate per-resource _table_metadata table.

The schema (fields + the info data dictionary + primaryKey/unique_key) is JSON-encoded under a datastore key in the table-level description OPTION:

  • written via OPTIONS(...) on CREATE TABLE and SET OPTIONS(...) on ALTER,
  • read back with a single tables.get call (no query job, ~200ms),
  • tables created outside the engine fall back to BigQuery-column inference.

This removes the need for the separate metadata store entirely.

Changes

  • Drop the _table_metadata store: delete bigquery/metadata.py (BigQueryMetadataStore) and the MetadataStore Protocol from engines/base.py. Encode/parse helpers now live in bigquery/lib.py (table_options_clause / set_table_options_sql / table_to_schema).
  • Write-path refactor (readability): collapse the duplicated insert/merge/update plumbing into _build_dml (builder SQL; ValueError → 400) and _write_rows (run a @rows write; BQ errors → 400); remove _insert_records_sql. Share normalize_pk() / _user_fields() in lib.py. Extract search_sql's COUNT-path selection into _search_sql_count_query.
  • Tests: replace the obsolete test_metadata.py (it tested the deleted store and broke collection) with unit coverage for the native-metadata helpers; update test_tables.py.
  • Also: add PERF_PROFILE_ENABLED config flag; minor ckan_client docstring tidy.

Testing

Notes

  • This branch was built on a pre-merge base; only the schema-in-OPTIONS commit is included here (the duplicate "CodeRabbit review" commit on the old local branch is already on main as cc4cf95).
  • mypy --strict has pre-existing failures across the engine package (bare list/dict generics, Config | None access, WriteResult-as-dict) that this PR neither introduces nor fixes.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Table schemas and resource metadata are now embedded in the storage engine’s native table options, and new resource creation may submit CREATE+INSERT as a single operation.
  • Chores

    • Removed the separate metadata table and centralized schema/DDL handling in the engine for simpler, more consistent behavior.
    • Placeholder (misconfigured) mode now cleanly returns empty/echo responses.
  • Documentation

    • Updated docs to describe the new native-table schema storage.
  • Tests

    • Test suite and test setup adjusted to reflect these changes.

Review Change Stack

…op the _table_metadata store

The Frictionless schema (fields + the `info` data dictionary + primaryKey/unique_key) is now JSON-encoded into each data table's own table-level `description` OPTION under a `datastore` key — written via `OPTIONS(...)` on CREATE and `SET OPTIONS(...)` on ALTER, and read back with a single `tables.get` call (no query job, ~200ms). Tables created outside the engine fall back to BigQuery-column inference. This removes the need for a separate per-resource metadata table.

- Delete bigquery/metadata.py (BigQueryMetadataStore) and the MetadataStore Protocol from engines/base.py; encode/parse helpers now live in bigquery/lib.py (table_options_clause / set_table_options_sql / table_to_schema).
- Drop the `_table_metadata` mention from the BIGQUERY_DATASET config description.

Write-path refactor (readability):
- Collapse the duplicated insert/merge/update plumbing into two helpers — _build_dml (builder SQL; ValueError -> 400) and _write_rows (run a @rows write; BQ errors -> 400); remove _insert_records_sql.
- Share normalize_pk() and _user_fields() in lib.py (used by merge_sql/update_sql/info/_drop_columns).
- Extract search_sql's COUNT-path selection into _search_sql_count_query.

Tests: replace the obsolete test_metadata.py (which tested the deleted store and broke collection) with unit coverage for the native-metadata helpers; update test_tables.py. Full suite: 292 passing, ruff clean.

Also: add PERF_PROFILE_ENABLED config flag (on-demand pyinstrument profiling via X-Profile header); minor ckan_client docstring tidy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Warning

Review limit reached

@sagargg, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 25 minutes and 57 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: da0e846b-adf2-4533-84e3-c6e77f361d84

📥 Commits

Reviewing files that changed from the base of the PR and between ffd5311 and 503dfcc.

📒 Files selected for processing (12)
  • CLAUDE.md
  • datastore/api/auth.py
  • datastore/api/endpoints/datastore.py
  • datastore/infrastructure/engines/base.py
  • datastore/infrastructure/engines/bigquery/backend.py
  • datastore/schemas/responses.py
  • datastore/services/write.py
  • tests/auth/test_orchestration.py
  • tests/engines/bigquery/test_tables.py
  • tests/test_datastore_create.py
  • tests/test_datastore_delete.py
  • tests/test_datastore_upsert.py
📝 Walkthrough

Walkthrough

Remove the BigQuery MetadataStore and _table_metadata; store Frictionless schemas JSON-encoded in BigQuery table descriptions, read schemas via direct table metadata, refactor BigQuery backend DDL/DML into shared helpers, and update tests to exercise native-metadata helpers and new orchestration.

Changes

BigQuery Metadata Storage Migration

Layer / File(s) Summary
Configuration and Interface Cleanup
datastore/core/config.py, datastore/infrastructure/ckan_client.py, datastore/infrastructure/engines/base.py, tests/conftest.py, CLAUDE.md
Simplify BIGQUERY_DATASET description, trim docstring whitespace, remove MetadataStore protocol from engine interface, default test CKAN_URL, and update docs to reflect table-description metadata storage.
Native Metadata Encoding/Decoding Helpers
datastore/infrastructure/engines/bigquery/lib.py
Add normalize_pk(), _user_fields, table_to_schema(), table_options_clause(), set_table_options_sql(), drop_columns_sql, and other SQL helpers; centralize DDL/DML clause generation and system-field stripping.
Backend Initialization and Schema Read
datastore/infrastructure/engines/bigquery/backend.py
Make BigQuery client optional when config incomplete; add _read_schema(resource_id) using client.get_table() (NotFound→None, other errors→ServerError); update initialization and imports.
DDL & DML Helpers
datastore/infrastructure/engines/bigquery/backend.py
Introduce _create_table_sql and _refresh_table_options; centralize parameterized writes with _rows_job_config, _build_dml, and _write_rows; map builder errors to ValidationError and BigQuery write failures to translated errors.
Resource Declaration & Create Orchestration
datastore/infrastructure/engines/bigquery/backend.py
New resources run combined CREATE TABLE (+ optional INSERT) script; existing resources run ALTER column diffs then SET OPTIONS refresh followed by inserts; create() and upsert() use _read_schema() and placeholder mode is client is None.
Search and Total Count Computation
datastore/infrastructure/engines/bigquery/backend.py
search() uses _read_schema() and returns a placeholder iterator in client==None mode; _search_sql_count_query() chooses INFORMATION_SCHEMA metadata or COUNT(*) and treats COUNT failures as non-fatal.
Delete, Info, and Column Management
datastore/infrastructure/engines/bigquery/backend.py
Full delete drops the BigQuery data table only; _drop_columns issues ALTER DROP COLUMN then refreshes OPTIONS; info() derives primary_key via normalize_pk(schema); dump() maps BigQuery NotFound to NotFoundError.
Test Native Metadata Helpers
tests/engines/bigquery/test_metadata.py
Replace BigQueryMetadataStore SQL/client tests with tests for native helpers: round-trip table description encoding/decoding, set_table_options_sql shape, table_to_schema managed/unmanaged/malformed behavior, and normalize_pk handling.
Test DDL Orchestration & Create Paths
tests/engines/bigquery/test_tables.py
Add tests asserting CREATE TABLE includes table-level OPTIONS with schema, ALTER performs column changes then OPTIONS refresh, combined CREATE+INSERT script for new resources (or CREATE-only when no rows), existing-resource ALTER+OPTIONS no-insert path, and placeholder-mode tests using backend.client = None.
Test Schema Loading, Undeclared Paths & Delete Semantics
tests/engines/bigquery/test_tables.py, tests/test_datastore_dump.py
Update tests to stub _read_schema() instead of metadata.get, assert undeclared-resource returns None and no BigQuery calls, update delete tests to expect DROP TABLE only (no metadata cleanup) or ALTER+SET OPTIONS when dropping columns, and remove explicit backend.metadata=None test setup.

Sequence Diagram

sequenceDiagram
  participant Client
  participant Backend
  participant LibHelpers
  participant BigQuery
  Client->>Backend: create(resource_id, schema, rows)
  Backend->>BigQuery: get_table(resource_id)
  alt exists
    BigQuery-->>Backend: table metadata
    Backend->>LibHelpers: table_to_schema
    LibHelpers-->>Backend: schema
    Backend->>BigQuery: client.query(ALTER statements or INSERT)
  else missing
    BigQuery-->>Backend: NotFound
    Backend->>LibHelpers: _create_table_sql
    LibHelpers-->>Backend: CREATE TABLE with OPTIONS and optional INSERT
    Backend->>BigQuery: client.query(CREATE and optional INSERT)
  end
  BigQuery-->>Backend: job result
  Backend-->>Client: success or error
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • datopian/datastore#1: Introduced the BigQueryMetadataStore and _table_metadata implementation that this PR removes and replaces with native table-description metadata.

Suggested reviewers

  • luccasmmg

Poem

🐰 I wrapped schemas in descriptions so neat,

no shadow table hiding what tables keep.
One read finds the schema, one script writes the rows,
hop on — tidy metadata now grows!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title directly and clearly describes the main change: moving resource schema storage from a separate _table_metadata store to BigQuery table OPTIONS.
Docstring Coverage ✅ Passed Docstring coverage is 85.23% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/bigquery-schema-in-table-options

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

…F_PROFILE_ENABLED

`create_app()` runs at import (module-level `app`), and the auth refactor's validator rejects AUTH_TYPE=ckan with an empty CKAN_URL — so `pytest` failed at collection in CI (no .env, no CKAN_URL env var). conftest now sets a dummy CKAN_URL in os.environ before importing datastore.main (pydantic prioritises env over .env); tests override the CKAN client via DI, so it is never contacted. This had left main red since the auth/bigquery merge.

Also drop the unused PERF_PROFILE_ENABLED config field (no consumer; the profiling middleware that would read it isn't in this branch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
datastore/infrastructure/engines/bigquery/backend.py (1)

78-96: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't make missing BigQuery config fall through to placeholder CRUD.

Leaving client=None here makes the no-op/empty-data branches below reachable for ordinary misconfiguration. That means create / upsert / delete can acknowledge success without persisting anything, while search / info fabricate empty results. Please fail fast here, or gate placeholder mode behind an explicit test-only switch.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@datastore/infrastructure/engines/bigquery/backend.py` around lines 78 - 96,
The current client-builder returns early with self.client left as None, which
allows normal CRUD methods (create, upsert, delete, search, info) to silently
run in placeholder/no-op mode on misconfiguration; change the build logic in the
BigQueryBackend client construction method so it fails fast instead of falling
through: either raise a clear configuration error (e.g., RuntimeError) when
BIGQUERY_PROJECT or BIGQUERY_DATASET are missing, or explicitly set a guarded
placeholder flag (e.g., self._placeholder_mode) that is only enabled via an
explicit test-only config switch and is checked by
create/upsert/delete/search/info; update the code paths that currently assume
client may be None to rely on the explicit flag or to never be reached because
an exception was raised.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@datastore/infrastructure/engines/bigquery/backend.py`:
- Around line 105-126: The dump() implementation still calls
self.metadata.get(...), but self.metadata was removed; update dump() to call the
existing _read_schema(resource_id) for the existence check instead: replace the
self.metadata.get(...) call in dump() with schema =
self._read_schema(resource_id) and if schema is None raise the same not-found
error type used elsewhere in this module (e.g., ResourceNotFound or equivalent)
or return early, then proceed using the schema/result; this removes the
dependency on self.metadata and reuses _read_schema(resource_id) for presence
checking.
- Around line 105-126: Several public and helper signatures in BigQuery backend
use bare generics (e.g., `_read_schema(...)-> dict | None`, variables `schema:
dict`, `records: list`, and return types like `-> tuple[str | None, list]`)
which fails mypy --strict; update these to parameterized container types: use
dict[str, Any] for schema maps, list[dict[str, Any]] for record lists, and
explicit element types for any parameter lists (or list[Any] if heterogeneous).
Locate `_read_schema`, plus the functions/variables around the noted ranges
(including create, upsert, search and any helper that returns (maybe_resource,
records)), and replace bare dict/list annotations with the concrete typed forms
(import Any from typing if needed) so all signatures and local variable type
hints are parameterized. Ensure return annotations (e.g., Optional[dict[str,
Any]] or tuple[str | None, list[dict[str, Any]]]) match the new types.
- Around line 173-178: The search() method currently calls
self.client.query(...) for both COUNT and page queries outside of the
_run_query() wrapper so submit-time exceptions aren't converted to ServerError;
update search() to wrap each self.client.query(...) call (both the COUNT submit
and the page/data query submit) in the same try/except logic used by
_run_query() or call _run_query() for submission so any exception from
client.query is caught and re-raised as ServerError (use the same op +
resource_id format), and ensure search_sql() remains consistent by treating the
non-fatal COUNT submit the same way.

In `@datastore/infrastructure/engines/bigquery/lib.py`:
- Around line 518-529: The fallback schema returned by _infer_schema_from_bq
currently uses raw BigQuery names (e.g., "INT64", "STRING") which downstream
helpers like table_to_schema() / delete_sql() and _FILTER_PARAM_TYPE don't
recognize; update _infer_schema_from_bq to map col.field_type to the module's
canonical Frictionless types before returning (e.g., use the existing
BigQuery->Frictionless mapping or add one in this module) so the returned
{"fields": ...} uses the normalized type names that _FILTER_PARAM_TYPE expects,
or alternatively extend the downstream type map to accept both representations.

---

Outside diff comments:
In `@datastore/infrastructure/engines/bigquery/backend.py`:
- Around line 78-96: The current client-builder returns early with self.client
left as None, which allows normal CRUD methods (create, upsert, delete, search,
info) to silently run in placeholder/no-op mode on misconfiguration; change the
build logic in the BigQueryBackend client construction method so it fails fast
instead of falling through: either raise a clear configuration error (e.g.,
RuntimeError) when BIGQUERY_PROJECT or BIGQUERY_DATASET are missing, or
explicitly set a guarded placeholder flag (e.g., self._placeholder_mode) that is
only enabled via an explicit test-only config switch and is checked by
create/upsert/delete/search/info; update the code paths that currently assume
client may be None to rely on the explicit flag or to never be reached because
an exception was raised.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7bf7b408-d04a-44e1-9d06-a80243d01326

📥 Commits

Reviewing files that changed from the base of the PR and between fdd3dc3 and 503381c.

📒 Files selected for processing (8)
  • datastore/core/config.py
  • datastore/infrastructure/ckan_client.py
  • datastore/infrastructure/engines/base.py
  • datastore/infrastructure/engines/bigquery/backend.py
  • datastore/infrastructure/engines/bigquery/lib.py
  • datastore/infrastructure/engines/bigquery/metadata.py
  • tests/engines/bigquery/test_metadata.py
  • tests/engines/bigquery/test_tables.py
💤 Files with no reviewable changes (1)
  • datastore/infrastructure/engines/bigquery/metadata.py

Comment thread datastore/infrastructure/engines/bigquery/backend.py
Comment thread datastore/infrastructure/engines/bigquery/backend.py
Comment thread datastore/infrastructure/engines/bigquery/lib.py
Comment on lines +532 to +552
def table_options_clause(schema: dict[str, Any]) -> str:
"""Return ` OPTIONS(description = '…', labels = […])` for a table DDL."""
desc = _encode_table_description(schema)
return (
f" OPTIONS(description = {_sql_literal(desc)}, "
f"labels = [(\"datastore_managed\", \"true\")])"
)


def set_table_options_sql(table_ref: str, schema: dict[str, Any]) -> str:
"""Stand-alone `ALTER TABLE … SET OPTIONS(...)` statement.

Separate from column ALTERs because BQ refuses to mix column
actions and SET OPTIONS in one statement.
"""
desc = _encode_table_description(schema)
return (
f"ALTER TABLE {table_ref} "
f"SET OPTIONS(description = {_sql_literal(desc)}, "
f"labels = [(\"datastore_managed\", \"true\")])"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

This metadata writer still overwrites unrelated table metadata.

_encode_table_description() and set_table_options_sql() synthesize description and labels from schema alone, so every refresh drops any existing human-authored description content and non-datastore labels. That contradicts the preservation claim at Lines 41-44, and unless the backend contract is being updated in lockstep, it also moves schema metadata off the repo's required _table_metadata round-trip path. As per coding guidelines, "Store field schema metadata (info dict with title, description, comment, example, unit, custom keys) verbatim via _table_metadata table in BigQuery; round-trip via datastore_info."

Also applies to: 566-575

sagargg added 4 commits May 27, 2026 15:30
- dump(): drop the dead `self.metadata` existence check (the metadata store was removed in this PR, so it would AttributeError on the first dump). Rely on the existing get_table call, mapping NotFound -> 404. [Critical]
- search(): move the COUNT + page query submits inside the try so submit-time failures map to ServerError, not raw google exceptions. [Major]
- _infer_schema_from_bq(): map BigQuery column types to canonical Frictionless names so downstream filter type maps work for unmanaged tables. [Major]
- lib.py: correct the table-description note — the description is engine-owned and rewritten wholesale; manual edits / non-datastore labels are NOT preserved. [Major]
- CLAUDE.md: update the stale `_table_metadata` references to the native table-level metadata design (schema stored in the table's own `description` OPTION).
- tests: expect Frictionless types for unmanaged-table inference; drop the now-dead `backend.metadata` setup.
When datastore_delete drops columns (the `fields` path), the response now
includes `result.schema` — the Frictionless Table Schema after the listed
columns were removed — so callers can confirm the table's new shape without a
follow-up datastore_info.

Threaded through the existing `new_schema` that `_drop_columns` already builds:
_drop_columns now returns it, delete() carries it on WriteResult.schema, and the
service maps it onto the response. Row-delete and whole-table-drop paths leave
schema unset (omitted from the envelope).
…y force guard

- datastore_create's resource-dict path now sends `url_type="datastore"` in the
  resource_create payload, so the new CKAN resource is marked datastore-managed.
- datastore_create / _upsert / _delete refuse to write a CKAN resource whose
  record carries `url_type="datastore"` unless the request sets `force: true`,
  raising a Validation Error ("Cannot update a read-only resource. Use \"force\"
  to force update."). Mirrors CKAN's guard against clobbering managed data;
  no-op under non-CKAN auth (no resource record).
- Guard lives in api/auth.py (ensure_resource_writable), called after authorize
  in each write endpoint.
The guard already no-op'd under non-CKAN auth (only the CKAN provider returns a
resource record with url_type), but make the scope explicit: ensure_resource_writable
now takes auth_type and returns early unless it's 'ckan', so a custom non-CKAN
provider that happens to surface a url_type can't trip it. Adds unit tests for the
gate (ckan blocks/forces, anonymous/jwt skip, non-datastore url_type ignored).
@sagargg sagargg merged commit d9a3430 into main May 27, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant