Feat (dump): /datastore/dump/<rid> with 302-or-stream-concat download#2
Conversation
- Drop JWT `jti` cache-key derivation in default_key_id — an unverified `jti` could collide a forged token with a real user's cached decision. Always sha256 the full credential. - CKAN provider: treat a corrupt cache entry as a miss (fall back to CKAN) instead of failing the request. - CKAN provider: store the hashed key in Decision.subject instead of the raw credential, so the cache never carries plaintext keys. - Postman generator: fix OUT_FILE typo (`ollection.json` → `postman/collection.json`), URL-encode query-string values, and fix the docstring's run-from-root path. - README: document `BIGQUERY_DATASET` in the env-var table. - Tests: the "without API key" assertions actually drop the Authorization header now (the conftest fixture installs one by default). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implements `GET /datastore/dump/{resource_id}?format=csv|ndjson|parquet`
as a single user-facing URL that yields one file regardless of how
BigQuery shards the underlying EXPORT DATA job.
Pipeline (engine):
1. Read `table.modified`, derive a content-addressed prefix
`dumps/<rid>/<fmt>/<rev>` where rev = hex(microsec epoch).
2. List GCS at that prefix — non-empty → cache hit, skip the export.
3. On miss, submit `EXPORT DATA` (Parquet → single URI, CSV/NDJSON →
wildcard URI so BQ shards >1 GB exports).
4. After a successful extract, sweep stale revisions for the (rid, fmt)
pair — storage is bounded to one rev per format.
5. Generate V4 signed URLs with `response-content-disposition` so the
downloaded filename is `<rid>.<ext>` regardless of GCS object name.
Why asyncio (and not threads / sync httpx):
- **One URL → one file** is the user-visible contract. For shards
>1 GB BQ refuses single-file CSV/NDJSON, so we have to stream-concat
on the server. The byte path is GCS → us → client.
- With sync httpx + StreamingResponse FastAPI would offload each shard
read to the threadpool. With multi-GB dumps that means a worker
thread held for the entire transfer — a handful of concurrent dumps
starves the pool and blocks unrelated requests.
- Async httpx + `aiter_bytes(64 KiB)` keeps the whole flow on the
event loop: ~64 KiB resident per active dump, zero threads, and
client disconnect cancels the coroutine which propagates through
httpx to release the GCS connection immediately.
- Same reasoning drives `asyncio.to_thread(job.reload)` + `await
asyncio.sleep(...)` in the engine's poll loop — the worker is held
only for the brief `reload` round-trip, not during the wait.
CSV stream-concat strips the duplicate header off non-first shards via
`_skip_first_line` (memory bound: header bytes + one chunk). NDJSON is
pure byte concatenation. Parquet >1 GB is refused upstream with 413
(footers can't be concatenated); caller switches to CSV/NDJSON.
IAM follows the ro-for-reading, rw-for-writing split documented in
CLAUDE.md §5.3 — `BIGQUERY_CREDENTIALS_RO` for `get_table` +
`list_blobs` cache lookup, `BIGQUERY_CREDENTIALS` for
`EXPORT DATA` + post-extract list + signing + GC.
Includes 592-line test suite covering single-shard redirect,
multi-shard stream-concat (CSV header-dedup, NDJSON concat), cache
hit/miss, stale-rev GC, Parquet >1 GB rejection, and IAM split.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (24)
📝 WalkthroughWalkthroughThis PR adds a new ChangesDump Export Feature & Foundational Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
GET /datastore/dump/{resource_id}?format=csv|ndjson|parquet— one URL → one file from the caller's POV.EXPORT DATAproduces shards in GCS; the endpoint stream-concats them through async httpx (StreamingResponse) so peak memory ≈ one 64 KiB chunk and no worker threads are held.413 PayloadTooLargeError(Parquet footers can't be byte-concatenated); caller switches to CSV/NDJSON.table.modified(dumps/<rid>/<fmt>/<rev>); a GCSlist_blobslookup short-circuits re-export. After a successful extract, stale revisions for the (rid, fmt) pair are swept so storage is bounded to one rev per format.BIGQUERY_CREDENTIALS_ROfor metadata reads + cache lookup,BIGQUERY_CREDENTIALSfor the export job, post-extract list, signing, and GC.Why asyncio (not threads / sync httpx)
aiter_bytes(64 KiB)keeps the whole flow on the event loop: ~64 KiB resident per active dump, zero threads, and client disconnect cancels the coroutine which propagates through httpx → GCS connection released immediately.asyncio.to_thread(job.reload)+await asyncio.sleep(...)in the engine's poll loop — the worker is held only during the briefreloadRPC, not during the wait.Test plan
response-content-disposition._skip_first_line).EXPORT DATAand sweeps prior revs.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
/datastore/dump/{resource_id}endpoint for exporting resources in CSV, NDJSON, or Parquet formats with GCS caching and multi-shard streaming supportImprovements
Configuration
BIGQUERY_EXPORT_BUCKET,BIGQUERY_EXPORT_URL_EXPIRY_HOURSDependencies