feat(model-eval-ingest): [WIP] cloud-side worker for bench → kiloBench ingest#3028
Draft
lambertjosh wants to merge 1 commit intomainfrom
Draft
feat(model-eval-ingest): [WIP] cloud-side worker for bench → kiloBench ingest#3028lambertjosh wants to merge 1 commit intomainfrom
lambertjosh wants to merge 1 commit intomainfrom
Conversation
…h ingest
DRAFT / WIP — do not merge until the bench-side promotion flow is
finished and end-to-end tested against this worker. This PR is opened
to preserve the cloud-side work in progress; the kilo-bench side is
still being built (see kilo-bench/.plans/dashboard-v2.md Track 2).
Adds:
- `model_eval_ingest` table — immutable append-only record of eval
results promoted from bench.s1lv.com. Primary partitioning key is
(provider, model, variant, task_source); a later row supersedes by
`promoted_at` at query time.
- `KiloBenchEvalSchema` + extended `ModelStatsBenchmarksSchema.kiloBench`
— denormalised read cache for public model pages. Only public-safe
fields (no bench URL, no ingest id, no promoter email).
- `services/model-eval-ingest/` Cloudflare Worker:
- HMAC-SHA-256 signature verification (`X-Ingest-Signature` +
`X-Ingest-Timestamp`, 5-min skew), portable constant-time compare.
- `POST /api/model-eval-ingest/submit` — insert + recompute the
`modelStats.benchmarks.kiloBench` JSONB via a DISTINCT-ON-by-
timestamp rollup (JSONB merge preserves other benchmark sources).
- `GET /api/model-eval-ingest/:id` and `/latest/:model` for bench-side
status + delta preview.
- 33 vitest unit tests (19 HMAC middleware + 14 submission schema).
Operational notes for reviewers:
- `MODEL_EVAL_INGEST_SECRET_PROD` and `_DEV` secrets need to be
created in the shared Secrets Store (342a86d9e3a94da698e82d0c6e2a36f0)
before deploy. Rotate by bumping both sides simultaneously.
- No admin UI yet (`/admin/model-eval-ingest` pages); tracked as Phase
2 of the plan.
- No public `KiloBenchSection` component yet in kilocode-landing;
tracked as Phase 3 of the plan.
- No bench-side promotion endpoint yet; tracked as Phase D of the
kilo-bench rebuild (Track 2).
Verification run locally against origin/main (0c8c06b):
pnpm --filter cloudflare-model-eval-ingest test → 33 / 33 pass
pnpm --filter cloudflare-model-eval-ingest typecheck → clean
pnpm --filter @kilocode/db typecheck → clean
pnpm --filter web typecheck → 0 errors
pnpm --filter web test -- schema.test.ts → 6 / 6 pass
Design: `.plans/dashboard-v2.md` in the kilo-bench repo.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cloud-side plumbing for a new promotion workflow: human-reviewed eval results from
bench.s1lv.com(internal, oauth2-proxy gated) get ingested into the publicmodelStats.benchmarks.kiloBenchso kilo.ai/models/[slug] can show real-world benchmark numbers (success rate, avg cost per task, avg token usage) alongside the existing Artificial Analysis scores.The design explicitly keeps bench and cloud separate: normal users never access bench, and internal identifiers (bench URLs, ingest IDs, promoter emails) stay out of the public JSONB.
What's in this PR
Schema (
packages/db/src/schema.ts)model_eval_ingesttable — immutable append-only record of every promotion act. Partitioned by(provider, model, variant, task_source); latest wins at query time viaDISTINCT ON … ORDER BY promoted_at DESC(nois_activeflag, no supersede links).KiloBenchEvalSchemaZod type andkiloBenchextension onModelStatsBenchmarksSchema. Only public-safe fields —successRate,avgCostUsd,avgInputTokens?,avgOutputTokens?,avgCacheReadTokens?,avgExecutionMs?,nTrials,lastPromotedAt. No bench URL, no ingest id, no promoter email.Migration (
0109_watery_freak.sql) — generated bypnpm drizzle generateagainst current main; contains only the new table + indexes.Worker (
services/model-eval-ingest/)session-ingest/webhook-agent-ingesttemplate (same-ingestnaming convention). Single Hono app, single Hyperdrive binding, one Secrets Store secret.src/middleware/hmac-auth.ts— HMAC-SHA-256 signature verification (X-Ingest-Signature+X-Ingest-Timestamp, 5-min skew window), portable constant-time compare (works under both Workers runtime and Node for unit tests).src/routes/api.tsPOST /api/model-eval-ingest/submit— validate + insert + recompute themodelStats.benchmarks.kiloBenchJSONB. Rollup usesDISTINCT ON (task_source) … ORDER BY promoted_at DESC; JSONB||merge preservesartificialAnalysisand other benchmark sources.GET /api/model-eval-ingest/latest/:model— bench-side delta preview.GET /api/model-eval-ingest/:id— bench-side status / admin audit.Tests (33 passing) — 19 HMAC middleware (hex parsing, determinism, missing/invalid headers, skew in both directions, wrong secret, missing cloud secret, happy path) + 14 submission schema (valid shapes, invalid URL/email, n_trials=0, negative cost, absurd success_rate, float tokens, zero-cost-is-fine).
Verification
Ran locally against
origin/main(0c8c06b40):Pre-push hook (lint + typecheck) passes without
--no-verify.Visual Changes
N/A — no UI in this PR. The public
KiloBenchSectioncomponent forkilocode-landingmodel pages is a separate follow-up.Reviewer Notes
Still needed before this can ship to production:
MODEL_EVAL_INGEST_SECRET_PRODand_DEVneed to be created in the shared Secrets Store (342a86d9e3a94da698e82d0c6e2a36f0) before first deploy. Rotate by bumping both sides simultaneously./admin/model-eval-ingestlist + detail pages (Phase 2). Can land in a separate PR after this one.KiloBenchSectiononkilocode-landing/src/app/models/[...slug]/page.tsx(Phase 3). Same — separate PR.Risk areas:
||merge inrecomputeKiloBenchrelies on thekiloBenchkey being a single top-level field. If someone later adds akiloBench.evalscross-reference that needs deeper merging, that's the place to revisit.variantis nullable. TheDISTINCT ONSQL uses a conditionalvariant IS NULLvsvariant = $variantbranch. Covered by the plan's latest-wins-per-tuple semantics but worth a second set of eyes.c.req.text()exactly once and caches it onc.var.rawBody. Downstream handlers MUST usec.var.rawBody+JSON.parserather thanc.req.json()— they won't be able to re-read the stream. Current handler does this correctly; any future handler on the same middleware needs to as well.Rollout order:
Design doc:
.plans/dashboard-v2.mdin the kilo-bench repo.