You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Replace Fireworks Prometheus monitor with reachability probe
Delete the 1.5k-LOC fireworks-monitor package (Prometheus scrape, health
computation, admin endpoint, CLI scripts) in favor of a single-function
reachability probe inline in free-session/admission.ts: GET the account
metrics endpoint with a 5s timeout and fail closed on non-OK. The
full-health-scoring machinery was load-bearing on nothing — admission only
ever read the boolean gate, and reachability is what actually matters for
halting during an outage.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CLI -- "POST on startup<br/>(gets instance_id)" --> SessionAPI
41
41
CLI -- "GET to poll state" --> SessionAPI
@@ -44,7 +44,7 @@ flowchart LR
44
44
ChatAPI --> Gate
45
45
Gate --> Store
46
46
Ticker --> Store
47
-
Ticker --> Monitor
47
+
Ticker --> Probe
48
48
```
49
49
50
50
### Components
@@ -123,7 +123,7 @@ The rotation is important: it happens even if the caller is already in the `acti
123
123
### What this does NOT prevent
124
124
125
125
- A single user manually syncing `instance_id` between two CLIs (e.g. editing a config file). This is possible but requires them to re-sync after every startup call, so it's high-friction. We accept this.
126
-
- A user creating multiple accounts. That is covered by other gates (MIN_ACCOUNT_AGE_FOR_PAID_MS, geo check) and the Fireworks monitor's overall throttle.
126
+
- A user creating multiple accounts. That is covered by other gates (MIN_ACCOUNT_AGE_FOR_PAID_MS, geo check) and the overall drip-admission rate.
127
127
128
128
## Admission Loop
129
129
@@ -132,8 +132,8 @@ One pod runs the admission loop at a time, coordinated via Postgres advisory loc
132
132
Each tick does (in order):
133
133
134
134
1.**Sweep expired.**`DELETE FROM free_session WHERE status='active' AND expires_at < now()`. Runs regardless of upstream health so zombie sessions are cleaned up even during an outage.
135
-
2.**Check upstream health.**`isFireworksAdmissible()`from the monitor. If not `healthy`, skip admission for this tick (queue grows; users see `status: 'queued'` with increasing position).
136
-
3.**Admit.**`SELECT ... WHERE status='queued' ORDER BY queued_at, user_id LIMIT MAX_ADMITS_PER_TICK FOR UPDATE SKIP LOCKED`, then `UPDATE` those rows to `status='active'` with `admitted_at=now()`, `expires_at=now()+sessionLength`. Staggering the queue at `MAX_ADMITS_PER_TICK=1` / 15s keeps Fireworks from getting hit by a thundering herd of newly-admitted CLIs; once metrics show the deployment is saturated, step 2 halts further admissions.
135
+
2.**Check upstream reachability.**`isFireworksAdmissible()`does a short-timeout GET against the Fireworks account metrics endpoint. If it doesn't respond OK, skip admission for this tick (queue grows; users see `status: 'queued'` with increasing position).
136
+
3.**Admit.**`SELECT ... WHERE status='queued' ORDER BY queued_at, user_id LIMIT MAX_ADMITS_PER_TICK FOR UPDATE SKIP LOCKED`, then `UPDATE` those rows to `status='active'` with `admitted_at=now()`, `expires_at=now()+sessionLength`. Staggering the queue at `MAX_ADMITS_PER_TICK=1` / 15s keeps Fireworks from getting hit by a thundering herd of newly-admitted CLIs; if the probe starts failing, step 2 halts further admissions.
0 commit comments