feat(telemetry): Cloudflare Worker + KV backend for opt-in relay telemetry (#34 infra)#42
Draft
Narrator wants to merge 1 commit into
Draft
feat(telemetry): Cloudflare Worker + KV backend for opt-in relay telemetry (#34 infra)#42Narrator wants to merge 1 commit into
Narrator wants to merge 1 commit into
Conversation
…metry Implements the infra half of #34 per RFC 0002 §(4): POST https://telemetry.domscribe.dev/v1/session storing one record per (ISO week, session_id) in Cloudflare KV with a 12-week TTL. Strict Zod schema rejects any unknown field — that is the privacy guarantee. Lives under `infra/telemetry-worker/` (outside `packages/*`) so it does not touch the protocol package's SemVer surface or the release flow. Key design choices: - `ctx.waitUntil` for fire-and-forget — matches the relay client contract (issue-34-client). Response returns 204 in <50ms; KV write completes in the background. - Key partitioned by ISO week so the WAU readout (`GET /v1/wau/:week`) is a single prefix-scan. Session_id is the partition, not a stored field — duplicate POSTs are idempotent overwrites without read-modify. - 1 KB body cap enforced two ways (declared content-length + streaming reader) so an oversized body cannot exhaust the worker before the cap fires. - WAU readout gated on a constant-time token comparison; disabled (404) when the secret is unset. - Rollback target at `rollback-vercel/` ports the same behaviour to a Vercel Edge Function backed by Vercel KV — activate per sprint 2491 replanning trigger #2 if Cloudflare provisioning blocks > 1 day. Coverage: - 26 tests pass under @cloudflare/vitest-pool-workers (real workerd runtime + in-memory KV). Cover schema strictness, body cap, KV write semantics, idempotency, routing, read-endpoint auth, and ISO-week boundary cases. - Load-test harness at `scripts/load-test.ts` drives the live endpoint with concurrent POSTs and asserts p95 + error-rate ceilings. README documents the headroom calculation: even at 100k WAU we are 14x under the 1000 writes/s KV ceiling. The deployment runbook (README.md) walks through wrangler login, KV provisioning, secret seeding, custom-domain binding, and smoke-test curl commands. Operator-readable end-to-end. Out of scope for this PR (delivered by issue-34-client): - Relay startup POST and 24h interval - `.domscribe/config.json` `telemetry.enabled` flag - `npx domscribe init` opt-in prompt Refs: RFC 0002 §(4), sprint 2491 issue-34-infra, falsifier signal (c)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the infra half of #34 per RFC 0002 §(4). Companion client work (relay startup POST,
.domscribe/config.jsonflag,npx domscribe initprompt) is tracked separately as issue-34-client.This PR delivers the server-side endpoint the relay client will POST to:
POST /v1/session— single fire-and-forget endpoint, returns204on success, strict-schema validation, 1 KB body cap,ctx.waitUntilfor background KV writeGET /v1/wau/<YYYY-Www>— operator-only read of the WAU readout the DOP falsifier needs (≥10 weekly-active sessions by 2026-08-20). Gated on a constant-time token check; returns404when the secret is unset.GET /healthz— liveness checkLives under
infra/telemetry-worker/(outsidepackages/*) so it does not touch the protocol package's SemVer surface or the release flow.Acceptance — sprint 2491 issue-34-infra
wrangler.tomldeclares theSESSIONSKV binding; deployment runbook in README walks operator throughwrangler login→wrangler kv:namespace create→wrangler deploy→ smokecurl. Productionidandpreview_idare placeholder values to be filled by the deploying operator (they're scoped to a Cloudflare account and only meaningful with an auth token, so safe to commit).scripts/load-test.tsdrives concurrent POSTs against the live endpoint and assertsp95 < 500ms,error_rate < 1%. README documents the headroom calculation: even at 100k WAU we are 14× under the 1000 writes/s KV ceiling. Per-key throttling is a non-issue because everysession_idproduces a unique key.204immediately while the KV write completes viactx.waitUntil. Worst case for the relay client is the 2 s socket timeout it owns; worst case for the Worker is the workerd CPU limit. Neither blocks the relay.https://telemetry.domscribe.dev/v1/session. Therollback-vercel/directory contains a behaviour-identical Vercel Edge Function backed by Vercel KV, activated per sprint 2491 replanning trigger chore(deps-dev): bump next from 16.1.6 to 16.1.7 #2 if Cloudflare provisioning blocks > 1 day. The CNAME swap is a DNS change; no code change in the relay.Test plan
pnpm test— 26 tests pass under@cloudflare/vitest-pool-workers(real workerd runtime + in-memory KV)content-lengthand streaming readerfirst_seen_atpopulated server-side)Allow: POST//healthz)pnpm typecheck— clean across bothtsconfig.json(Worker code under workers-types) andtsconfig.scripts.json(load-test under node types)pnpm deployagainst the live Cloudflare account (out-of-band; this PR cannot run wrangler against production from CI). Companion verification: issue-34-client smoke after deploy must produce ≥1 row inGET /v1/wau/....Design notes worth surfacing
ctx.waitUntilfor fire-and-forget. Vercel Edge Functions don't have a direct equivalent — the rollback path awaits the write and relies on the relay's 2 s client-side timeout to bound latency. README spells out this trade-off.first_seen_atresets on each POST. If we need true first-contact semantics we'd add one KV read per request. I opted out of that to keep the per-request cost at exactly one KV write.z.object().strict()). This is the privacy guarantee. A relay that adds a new field is rejected with 400 until the schema is bumped here in lockstep — accidental leakage by silent field-addition is structurally impossible.Out of scope (delivered by issue-34-client)
.domscribe/config.jsontelemetry.enabledflag (default off)npx domscribe initopt-in prompt with payload shown inlineRefs: RFC 0002 §(4), sprint 2491 plan, DOP falsifier signal (c)