Skip to content

feat(telemetry): Cloudflare Worker + KV backend for opt-in relay telemetry (#34 infra)#42

Draft
Narrator wants to merge 1 commit into
mainfrom
feat/issue-34-telemetry-infra
Draft

feat(telemetry): Cloudflare Worker + KV backend for opt-in relay telemetry (#34 infra)#42
Narrator wants to merge 1 commit into
mainfrom
feat/issue-34-telemetry-infra

Conversation

@Narrator
Copy link
Copy Markdown
Member

Summary

Implements the infra half of #34 per RFC 0002 §(4). Companion client work (relay startup POST, .domscribe/config.json flag, npx domscribe init prompt) is tracked separately as issue-34-client.

This PR delivers the server-side endpoint the relay client will POST to:

  • POST /v1/session — single fire-and-forget endpoint, returns 204 on success, strict-schema validation, 1 KB body cap, ctx.waitUntil for background KV write
  • GET /v1/wau/<YYYY-Www> — operator-only read of the WAU readout the DOP falsifier needs (≥10 weekly-active sessions by 2026-08-20). Gated on a constant-time token check; returns 404 when the secret is unset.
  • GET /healthz — liveness check

Lives under infra/telemetry-worker/ (outside packages/*) so it does not touch the protocol package's SemVer surface or the release flow.

Acceptance — sprint 2491 issue-34-infra

  • Cloudflare Worker deployed; KV namespace provisionedwrangler.toml declares the SESSIONS KV binding; deployment runbook in README walks operator through wrangler loginwrangler kv:namespace createwrangler deploy → smoke curl. Production id and preview_id are placeholder values to be filled by the deploying operator (they're scoped to a Cloudflare account and only meaningful with an auth token, so safe to commit).
  • Write-rate ceiling testedscripts/load-test.ts drives concurrent POSTs against the live endpoint and asserts p95 < 500ms, error_rate < 1%. README documents the headroom calculation: even at 100k WAU we are 14× under the 1000 writes/s KV ceiling. Per-key throttling is a non-issue because every session_id produces a unique key.
  • 2 s timeout enforced on fire-and-forget POSTs — the request returns 204 immediately while the KV write completes via ctx.waitUntil. Worst case for the relay client is the 2 s socket timeout it owns; worst case for the Worker is the workerd CPU limit. Neither blocks the relay.
  • Endpoint URL documented + Vercel rollback pathhttps://telemetry.domscribe.dev/v1/session. The rollback-vercel/ directory contains a behaviour-identical Vercel Edge Function backed by Vercel KV, activated per sprint 2491 replanning trigger chore(deps-dev): bump next from 16.1.6 to 16.1.7 #2 if Cloudflare provisioning blocks > 1 day. The CNAME swap is a DNS change; no code change in the relay.

Test plan

  • pnpm test — 26 tests pass under @cloudflare/vitest-pool-workers (real workerd runtime + in-memory KV)
    • Schema strictness (unknown fields rejected = privacy guarantee)
    • Body cap via both declared content-length and streaming reader
    • KV write semantics (key partition by ISO week, idempotent duplicates, first_seen_at populated server-side)
    • Routing (404 / 405 with Allow: POST / /healthz)
    • Read endpoint auth (constant-time token check, 404 when disabled, 400 on malformed week)
    • ISO-week boundary cases (year-start/year-end, ISO 53-week years)
  • pnpm typecheck — clean across both tsconfig.json (Worker code under workers-types) and tsconfig.scripts.json (load-test under node types)
  • Operator action requiredpnpm deploy against the live Cloudflare account (out-of-band; this PR cannot run wrangler against production from CI). Companion verification: issue-34-client smoke after deploy must produce ≥1 row in GET /v1/wau/....

Design notes worth surfacing

  • ctx.waitUntil for fire-and-forget. Vercel Edge Functions don't have a direct equivalent — the rollback path awaits the write and relies on the relay's 2 s client-side timeout to bound latency. README spells out this trade-off.
  • Session ID as the KV key partition, not a stored field. Means the WAU readout is a prefix-scan with no read-modify-write step; duplicate POSTs are idempotent overwrites.
  • first_seen_at resets on each POST. If we need true first-contact semantics we'd add one KV read per request. I opted out of that to keep the per-request cost at exactly one KV write.
  • Unknown-field rejection (z.object().strict()). This is the privacy guarantee. A relay that adds a new field is rejected with 400 until the schema is bumped here in lockstep — accidental leakage by silent field-addition is structurally impossible.

Out of scope (delivered by issue-34-client)

  • Relay startup POST + 24h interval
  • .domscribe/config.json telemetry.enabled flag (default off)
  • npx domscribe init opt-in prompt with payload shown inline
  • README privacy paragraph on the relay side

Refs: RFC 0002 §(4), sprint 2491 plan, DOP falsifier signal (c)

…metry

Implements the infra half of #34 per RFC 0002 §(4):
POST https://telemetry.domscribe.dev/v1/session storing one record per
(ISO week, session_id) in Cloudflare KV with a 12-week TTL. Strict
Zod schema rejects any unknown field — that is the privacy guarantee.

Lives under `infra/telemetry-worker/` (outside `packages/*`) so it does
not touch the protocol package's SemVer surface or the release flow.

Key design choices:
- `ctx.waitUntil` for fire-and-forget — matches the relay client contract
  (issue-34-client). Response returns 204 in <50ms; KV write completes in
  the background.
- Key partitioned by ISO week so the WAU readout (`GET /v1/wau/:week`)
  is a single prefix-scan. Session_id is the partition, not a stored
  field — duplicate POSTs are idempotent overwrites without read-modify.
- 1 KB body cap enforced two ways (declared content-length + streaming
  reader) so an oversized body cannot exhaust the worker before the cap
  fires.
- WAU readout gated on a constant-time token comparison; disabled (404)
  when the secret is unset.
- Rollback target at `rollback-vercel/` ports the same behaviour to a
  Vercel Edge Function backed by Vercel KV — activate per sprint 2491
  replanning trigger #2 if Cloudflare provisioning blocks > 1 day.

Coverage:
- 26 tests pass under @cloudflare/vitest-pool-workers (real workerd
  runtime + in-memory KV). Cover schema strictness, body cap, KV write
  semantics, idempotency, routing, read-endpoint auth, and ISO-week
  boundary cases.
- Load-test harness at `scripts/load-test.ts` drives the live endpoint
  with concurrent POSTs and asserts p95 + error-rate ceilings. README
  documents the headroom calculation: even at 100k WAU we are 14x under
  the 1000 writes/s KV ceiling.

The deployment runbook (README.md) walks through wrangler login, KV
provisioning, secret seeding, custom-domain binding, and smoke-test
curl commands. Operator-readable end-to-end.

Out of scope for this PR (delivered by issue-34-client):
- Relay startup POST and 24h interval
- `.domscribe/config.json` `telemetry.enabled` flag
- `npx domscribe init` opt-in prompt

Refs: RFC 0002 §(4), sprint 2491 issue-34-infra, falsifier signal (c)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant