Skip to content

Commit 0204a37

Browse files
jahoomaclaude
andcommitted
Replace Fireworks Prometheus monitor with reachability probe
Delete the 1.5k-LOC fireworks-monitor package (Prometheus scrape, health computation, admin endpoint, CLI scripts) in favor of a single-function reachability probe inline in free-session/admission.ts: GET the account metrics endpoint with a 5s timeout and fail closed on non-OK. The full-health-scoring machinery was load-bearing on nothing — admission only ever read the boolean gate, and reachability is what actually matters for halting during an outage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent febb263 commit 0204a37

File tree

16 files changed

+46
-1707
lines changed

16 files changed

+46
-1707
lines changed

docs/freebuff-waiting-room.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ flowchart LR
3535
Gate[checkSessionAdmissible]
3636
Ticker[Admission Ticker<br/>every 5s, 1 pod]
3737
Store[(free_session<br/>Postgres)]
38-
Monitor[FireworksMonitor<br/>isFireworksAdmissible]
38+
Probe[isFireworksAdmissible<br/>Fireworks metrics GET]
3939
4040
CLI -- "POST on startup<br/>(gets instance_id)" --> SessionAPI
4141
CLI -- "GET to poll state" --> SessionAPI
@@ -44,7 +44,7 @@ flowchart LR
4444
ChatAPI --> Gate
4545
Gate --> Store
4646
Ticker --> Store
47-
Ticker --> Monitor
47+
Ticker --> Probe
4848
```
4949

5050
### Components
@@ -123,7 +123,7 @@ The rotation is important: it happens even if the caller is already in the `acti
123123
### What this does NOT prevent
124124

125125
- A single user manually syncing `instance_id` between two CLIs (e.g. editing a config file). This is possible but requires them to re-sync after every startup call, so it's high-friction. We accept this.
126-
- A user creating multiple accounts. That is covered by other gates (MIN_ACCOUNT_AGE_FOR_PAID_MS, geo check) and the Fireworks monitor's overall throttle.
126+
- A user creating multiple accounts. That is covered by other gates (MIN_ACCOUNT_AGE_FOR_PAID_MS, geo check) and the overall drip-admission rate.
127127

128128
## Admission Loop
129129

@@ -132,8 +132,8 @@ One pod runs the admission loop at a time, coordinated via Postgres advisory loc
132132
Each tick does (in order):
133133

134134
1. **Sweep expired.** `DELETE FROM free_session WHERE status='active' AND expires_at < now()`. Runs regardless of upstream health so zombie sessions are cleaned up even during an outage.
135-
2. **Check upstream health.** `isFireworksAdmissible()` from the monitor. If not `healthy`, skip admission for this tick (queue grows; users see `status: 'queued'` with increasing position).
136-
3. **Admit.** `SELECT ... WHERE status='queued' ORDER BY queued_at, user_id LIMIT MAX_ADMITS_PER_TICK FOR UPDATE SKIP LOCKED`, then `UPDATE` those rows to `status='active'` with `admitted_at=now()`, `expires_at=now()+sessionLength`. Staggering the queue at `MAX_ADMITS_PER_TICK=1` / 15s keeps Fireworks from getting hit by a thundering herd of newly-admitted CLIs; once metrics show the deployment is saturated, step 2 halts further admissions.
135+
2. **Check upstream reachability.** `isFireworksAdmissible()` does a short-timeout GET against the Fireworks account metrics endpoint. If it doesn't respond OK, skip admission for this tick (queue grows; users see `status: 'queued'` with increasing position).
136+
3. **Admit.** `SELECT ... WHERE status='queued' ORDER BY queued_at, user_id LIMIT MAX_ADMITS_PER_TICK FOR UPDATE SKIP LOCKED`, then `UPDATE` those rows to `status='active'` with `admitted_at=now()`, `expires_at=now()+sessionLength`. Staggering the queue at `MAX_ADMITS_PER_TICK=1` / 15s keeps Fireworks from getting hit by a thundering herd of newly-admitted CLIs; if the probe starts failing, step 2 halts further admissions.
137137

138138
### Tunables
139139

scripts/check-fireworks-health.ts

Lines changed: 0 additions & 142 deletions
This file was deleted.

web/instrumentation.ts

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,6 @@
88
* causing Render's proxy to return 502 Bad Gateway errors.
99
*/
1010

11-
import { startFireworksMonitor } from '@/server/fireworks-monitor/monitor'
1211
import { logger } from '@/util/logger'
1312

1413
export async function register() {
@@ -47,8 +46,6 @@ export async function register() {
4746

4847
logger.info({}, '[Instrumentation] Global error handlers registered')
4948

50-
startFireworksMonitor()
51-
5249
// DB-touching admission module uses `postgres`, which imports Node built-ins
5350
// like `crypto`. Gate on NEXT_RUNTIME so the edge bundle doesn't try to
5451
// resolve them.

web/scripts/scrape-check.ts

Lines changed: 0 additions & 54 deletions
This file was deleted.

web/src/app/api/admin/fireworks-health/__tests__/fireworks-health.test.ts

Lines changed: 0 additions & 66 deletions
This file was deleted.

web/src/app/api/admin/fireworks-health/_get.ts

Lines changed: 0 additions & 22 deletions
This file was deleted.

web/src/app/api/admin/fireworks-health/route.ts

Lines changed: 0 additions & 11 deletions
This file was deleted.

0 commit comments

Comments
 (0)