Skip to content

fix(db): tolerate missing workers column in benchmark read queries#407

Open
arygupt wants to merge 1 commit into
masterfrom
fix/benchmarks-workers-column-tolerant
Open

fix(db): tolerate missing workers column in benchmark read queries#407
arygupt wants to merge 1 commit into
masterfrom
fix/benchmarks-workers-column-tolerant

Conversation

@arygupt
Copy link
Copy Markdown
Collaborator

@arygupt arygupt commented May 29, 2026

Problem

Production error rate on /api/v1/benchmarks spiked to ~63% immediately after #405 merged (Vercel error chart). Root cause:

  • feat(power): measured-power multinode support (workers[] + per-stage joules) #405 changed the read queries to SELECT ... br.workers / lb.workers — a column added by migration 006_benchmark_results_workers.sql.
  • Migrations in this repo are applied manually (pnpm admin:db:migrate), separately from the Vercel app deploy. The app deploy shipped the new query, but migration 006 was not applied to prod, so the column doesn't exist there.
  • Postgres fails to plan a query that names a non-existent column (column "workers" does not exist), so the statement throws and the route returns 500.

Why ~63% and not 100%: /api/v1/benchmarks is wrapped in cachedQuery({ blobOnly: true }). Cache hits return pre-#405 blobs (200); only cache misses run the SQL → 500. The Neon panel shows 0% connection errors because the connection succeeds — only the query fails.

Fix

Read workers via to_jsonb(row) -> 'workers' in all three read paths:

  • getLatestBenchmarks — base-table (date) branch
  • getLatestBenchmarks — materialized-view (no-date) branch
  • getAllBenchmarksForHistory

A JSON key lookup is a runtime operation that returns NULL for an absent key, instead of failing at parse time like a bare column reference. So the endpoint keeps serving every other field whether or not migration 006 has been applied. Behavior is identical once the column exists (jsonb in → jsonb out).

This makes the read path resilient to migration/deploy ordering — a forgotten or not-yet-run migration degrades to "workers is null" instead of taking down the endpoint.

Note on the operational fix

This PR stops the 500s on its own. Migration 006 should still be applied to prod (pnpm admin:db:migrate) so the measured-power workers data is actually populated — and the cache purged afterward. This change just removes the hard dependency on migration-before-deploy ordering.

Tests

  • New packages/db/src/queries/benchmarks.test.ts — recording-mock DbClient asserts none of the three read queries reference the workers column directly (regression guard).
  • pnpm typecheck, pnpm lint, pnpm fmt, and full db suite (226 tests) pass.

Note

Low Risk
Targeted SQL selection change on read-only benchmark queries plus unit tests; no auth or write-path changes, with identical results once migration 006 exists.

Overview
Fixes production 500s on /api/v1/benchmarks when the app ships before migration 006 adds the workers column: bare br.workers / lb.workers in SELECT fails at plan time, while to_jsonb(row) -> 'workers' still plans and yields null until the column exists.

All three benchmark read paths in packages/db/src/queries/benchmarks.ts are updated—getLatestBenchmarks (date-filtered base table and materialized-view branches) and getAllBenchmarksForHistory. Behavior is unchanged once migration 006 is applied.

Adds benchmarks.test.ts with a recording mock DbClient so generated SQL must use the to_jsonb form and must not match \b(br|lb)\.workers\b.

Reviewed by Cursor Bugbot for commit d9f2a93. Bugbot is set up for automated code reviews on this repo. Configure here.

#405's read queries select the `workers` column added by migration 006.
Migrations are applied manually (pnpm admin:db:migrate), separately from the
Vercel app deploy, so the deploy shipped a query referencing a column the
production DB didn't have yet. Postgres fails to plan a bare reference to a
non-existent column ("column \"workers\" does not exist"), so every cache-miss
request to /api/v1/benchmarks 500'd — surfacing as the ~63% error rate after
#405 (cache hits returned stale blobs; only misses hit the DB).

Read `workers` via `to_jsonb(row) -> 'workers'` in all three read paths
(getLatestBenchmarks base-table + view branches, getAllBenchmarksForHistory).
JSON key lookup is a runtime op that returns NULL for an absent key instead of
failing at parse time, so the endpoint keeps serving every other field whether
or not migration 006 has run. Identical behavior once the column exists.

Adds a regression test asserting the read queries never reference the workers
column directly.
@arygupt arygupt requested a review from adibarra as a code owner May 29, 2026 19:00
@vercel
Copy link
Copy Markdown

vercel Bot commented May 29, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
inferencemax-app Ready Ready Preview, Comment May 29, 2026 7:01pm

Request Review

@claude
Copy link
Copy Markdown

claude Bot commented May 29, 2026

Claude finished @arygupt's task in 1m 6s —— View job


Claude Code Review

  • Read changed files (benchmarks.ts, benchmarks.test.ts)
  • Verify approach against migration 006 + materialized view
  • Check test coverage
  • Post review

🟢 LGTM — no blocking issues found

The fix correctly addresses the parse-time column lookup problem. Verified against migrations/006_benchmark_results_workers.sql:

  • Migration 006 adds workers to benchmark_results and drops/recreates latest_benchmarks with br.* — so without 006 applied, neither br.workers nor lb.workers exists. The PR correctly patches all three read paths.
  • to_jsonb(row) -> 'workers' is the right primitive: row-to-jsonb is a runtime op and -> returns NULL for missing keys (vs. bare column refs which fail at parse time). Behavior matches br.workers once the column exists (jsonb in → jsonb out).
  • Per-row to_jsonb(br) adds some overhead in the history path, but it's negligible compared to the existing per-row work and not worth optimizing for a resilience fix.
  • The regression test correctly reconstructs the SQL template by joining static segments with ? and asserts both the new form is present and the bare form is absent — covers all three branches (date / no-date / history).
  • Comments in benchmarks.ts:93-96, 136, 184 explain the why (migration ordering, parse-time vs. runtime), which is exactly the non-obvious context that belongs in code.

Operationally, as the PR body notes: migration 006 still needs to be applied to populate the column. This change just removes the hard ordering dependency between migration and deploy.
• Branch: fix/benchmarks-workers-column-tolerant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant