phantom: Socket Mode lifecycle metrics + slack_app_token in AllowedSecretNames by mcheemaa · Pull Request #110 · ghostwright/phantom

mcheemaa · 2026-05-01T05:42:35Z

Summary

Phase 8a of the multi-agent Slack flow (R7 spec dated 2026-04-30) hardens the existing Socket Mode receiver with Prometheus-grade observability and locks the cross-repo allowlist mirror against phantomd's AllowedSecretNames.

The Socket Mode receiver itself was already in place (R7 §3 confirmed the lift was hardening, not implementation). What this PR adds:

Lifecycle metrics on a new /metrics Prometheus endpoint, hooked via the underlying SocketModeClient event emitter.
A frozen AllowedSecretNamesMirror constant pinning the three Slack-related secret names that traverse the metadata gateway, paired 1-1 with phantomd's AllowedSecretNames.

Metrics families

phantom_slack_socket_state{state} (gauge, 7 series). Exactly one series is 1.0 at any instant. Powers 1 - max_over_time(...{state="connected"}[1m]) "tenant offline" alerts.
phantom_slack_socket_reconnects_total (counter). Bolt auto-reconnects under the hood; this measures wobble rate. Alert at sustained > 5/min.
phantom_slack_socket_connection_seconds (histogram, custom buckets 1s -> 24h). Lifetime of a single Socket Mode connection. p99 should hold above 1 hour.
phantom_slack_event_dispatch_seconds{event_type} (histogram, custom buckets 5ms -> 30s). End-to-end Bolt middleware time. Slack ack deadline is 3s.

The /metrics endpoint follows the existing /health precedent: unauthenticated, content-type from prom-client. Per-tenant isolation is at the Caddy URL boundary, not per-route.

Implementation notes

Switched from new App({socketMode: true, ...}) to explicit new SocketModeReceiver({...}) so we can reach receiver.client and subscribe lifecycle events. The State enum is verbatim from @slack/socket-mode/dist/src/SocketModeClient.js: connecting, authenticated, connected, reconnecting, disconnecting, disconnected. R7 §3.3 mentioned unable_to_socket_mode_start; the SDK does not actually emit that event (it surfaces as UnrecoverableSocketModeStartError thrown from start()), so we map the unrecoverable path to a synthetic error state on the gauge in connect().
Dispatch timing via app.use() global middleware: try/await next() / finally observe(). Captures handler completion including async LLM/memory work the agent triggers in response to a Slack event.
prom-client backed. The metrics module owns its own Registry so future Telegram/email channels can plug in via parallel registries without name collisions on the global registry.

Cross-repo invariant

slack-channel-factory.ts exports a frozen AllowedSecretNamesMirror = ["slack_bot_token", "slack_app_token", "slack_gateway_signing_secret"]. The phantomd side lands in PR #28 (TestIsAllowedName_AcceptsSlackAppToken + _RejectsSlackAppTokenVariants). Drift on either side breaks tenant boot with HTTP 404 (the metadata gateway maps ErrInvalidName to 404 to defeat name enumeration). The factory test pins the Mirror against the SECRET_RESPONSES test fixture so a future production-side rename fails loud in CI on both repos.

R7 §3 documents the four-way invariant; this PR closes phantom's two of the four.

Tests

29 net-new:

24 in slack-metrics.test.ts: every metric family, the noop emitter, dispatch middleware including throw-path, lifecycle hooks including disconnect-without-prior-connect.
5 in slack-channel-factory.test.ts: AllowedSecretNamesMirror contains all three names, matches SECRET_RESPONSES, is frozen.

Bun-mock hygiene: every test file that mocks @slack/bolt now also exports SocketModeReceiver in its mock object. Without this, partial mocks leaked across test files and broke any subsequent test that imported slack.ts (caught by running the full channels/ suite together).

Verification

bun run typecheck clean
bun run lint clean
bun test green: 2111 pass / 0 fail / 0 errors / 4769 expects across 156 files
Cross-repo invariant verified manually: slack_app_token appears in 4 places (phantomd types.go + phantomd types_test.go + phantom slack-channel-factory.ts + phantom slack-channel-factory.test.ts).
Codex review (requested below)
Orchestrator/Cheema merge (per feedback_agent_canary_authority, agents do not merge phantom)

Sequencing

The phantomd half (PR #28) is the pre-req: without phantomd's allowlist update, a Phase 8b sidekick that calls PutSecret(name="slack_app_token", ...) is rejected with ErrInvalidName. Recommended merge order: phantomd #28 first (after Codex-clean), then this PR.

Test plan

Manual smoke against the canary tenant: deploy this build, scrape <slug>.phantom.ghostwright.dev/metrics, confirm the four metric families render with non-zero data.
Force a reconnect (kill the upstream WSS endpoint via iptables -A OUTPUT -p tcp --dport 443 -j DROP for 30s), verify phantom_slack_socket_reconnects_total increments and phantom_slack_socket_connection_seconds observes the lifetime.
Send a Slack message, verify phantom_slack_event_dispatch_seconds{event_type="app_mention"}_count increments.
Confirm Phase 8b sidekick install path can PutSecret(slack_app_token) against phantomd built with PR Replace Qdrant memory stack with ClawMem #28 (cross-repo smoke).

Net diff: 11 files, ~390 LOC. Follows R7 §3.1's "150 net-new" floor; the rest is test mirror coverage and observability documentation in CLAUDE.md.

…ecretNamesMirror R7 Phase 8a (multi-agent Slack flow, dated 2026-04-30) hardens the existing Socket Mode receiver with Prometheus-grade observability and locks the cross-repo allowlist mirror against phantomd's AllowedSecretNames. Metrics surface (new /metrics endpoint): - phantom_slack_socket_state{state} (gauge, 7 series). Exactly one series is 1.0 at any instant. Powers "tenant offline" alerts. - phantom_slack_socket_reconnects_total (counter). Bolt's auto- reconnect runs under the hood; this measures wobble rate. - phantom_slack_socket_connection_seconds (histogram). Connection lifetime from connect to disconnect; long-tail SLO. - phantom_slack_event_dispatch_seconds{event_type} (histogram). End-to-end Bolt middleware time. Slack ack deadline is 3s. Implementation: - New SocketModeReceiver hookup (vs. socketMode: true shorthand) to reach receiver.client and subscribe lifecycle events. The State enum is verbatim from @slack/socket-mode SocketModeClient.js: connecting, authenticated, connected, reconnecting, disconnecting, disconnected. - The unrecoverable start() error path maps to a synthetic "error" state on the gauge so the fleet view does not silently flatline when the receiver never gets past handshake. - Dispatch timing via app.use() global middleware: try/await next() / finally observe(). Captures handler completion including any async LLM/memory work the agent does in response. - prom-client backed; the metrics module owns its own Registry so Telegram/email can plug in via parallel registries without name collisions on the global registry. AllowedSecretNamesMirror cross-repo invariant: - New frozen const exported from slack-channel-factory.ts listing slack_bot_token + slack_app_token + slack_gateway_signing_secret. - The phantomd side ships in the symmetric PR (phantomd #28, TestIsAllowedName_AcceptsSlackAppToken pins the assertion). - Drift on either side breaks tenant boot with HTTP 404 (the metadata gateway maps ErrInvalidName to 404 to defeat name enumeration). The factory test pins the Mirror against the SECRET_RESPONSES test fixture so a future production-side rename fails loud in CI. Tests (29 net-new): - 24 in slack-metrics.test.ts: each metric family, the noop emitter, the dispatch middleware including throw-path, the lifecycle hooks including disconnect-without-prior-connect. - 5 in slack-channel-factory.test.ts: AllowedSecretNamesMirror contains all three names, matches SECRET_RESPONSES, is frozen. - All 2111 existing tests still green; full suite: 2111 pass / 0 fail / 0 errors. Bun-mock cross-suite hygiene: every test that previously mocked @slack/bolt now also exports SocketModeReceiver in its mock object. Without this, a test file that ran AFTER one of those partial mocks would import the real module and hit "Export named 'SocketModeReceiver' not found" at module load time. Caught by running the full channels/ suite together. Net diff: 11 files, ~390 LOC. Follows R7 §3.1's "150 net-new" floor; the rest is test mirror coverage and documentation.

mcheemaa · 2026-05-01T05:43:11Z

Ready for Codex review (feedback_review_protocol_codex_first).

Failure-path + refresh-survival track to focus on (feedback_reviewer_failure_path_track):

State-left-bad-after-failure: dispatch middleware throw path. The try/await next() / finally observe() block records timing even when the handler throws. Verify the metric still emits cleanly across a sequence: success -> throw -> success -> throw. The slack_event_dispatch_seconds{event_type="message"}_count should be exactly the number of dispatches, not the number of successes. Test exists (the dispatch middleware still records on handler throw); confirm the test's assertion shape catches a regression that "swallows the throw without rethrowing", which would silently flip the channel into a non-fail-loud state.
State-left-bad-after-failure: connection lifetime double-observe. observeConnectionLifetime() zeros connectedAtMs on observation. A pathological Bolt sequence (connected -> reconnecting -> disconnected) would call observeConnectionLifetime() twice; the second call short-circuits on the null guard. But there's a subtle race: if reconnecting and disconnected fire in the same tick, the second observation sees null and the histogram count stays at 1 even though two reconnect cycles happened. Is the alternative (observe twice with a half-zero second observation) better? My read: current behavior is correct because Bolt's reconnect path conceptually treats the prior connection as one continuous span until the reconnect succeeds. Worth a second look.
Refresh-survival: prom-client Registry across process restart. The Registry is in-memory; phantomd restart re-adopts the running CH process via /proc//cmdline (KillMode=process). The in-VM Phantom process itself does NOT restart in that path. But: when the in-VM Phantom DOES restart (post-evolution-bump or operator-driven), every counter resets to 0 and every histogram bucket resets. Prometheus handles the counter reset via rate() automatically. The histogram reset is fine. The gauge re-initializes to 0/0/0/0/0/0/1 (disconnected) before any lifecycle event lands; this is the constructor's pre-emit. Correct.
Class of mistake worth checking: prom-client name collisions. prom-client throws Metric with name X has already been registered when the same registry is constructed twice. The factory pattern (one SlackMetrics per process) avoids this in production. Tests pass new Registry() per case, also clean. But: if a future contributor adds a second new SlackMetrics() somewhere (Telegram metrics module copy-pasted from this one with shared metric names), the second one throws on construction. Catch: phantom_slack_* prefix on every metric name limits collision risk to within-Slack metrics. The CLAUDE.md note about per-channel registries is the documented contract.
Mock hygiene cross-suite (Bun-bug per feedback_codex_loop_finite + B.1.4 history): the four test files that mock @slack/bolt now all export SocketModeReceiver. If a future contributor adds a 5th test that mocks @slack/bolt and forgets SocketModeReceiver, the test file passes in isolation but breaks the full suite. Worth a comment in the slack-metrics.ts module header pointing at the mock.module requirement, or worth a CI gate that runs bun test src/ (full suite) before merge.
Bolt 4.6 app.use middleware contract. I'm asserting that await next() returns when the matching listener finishes. Cross-check Bolt 4.6 source to confirm the middleware chain doesn't short-circuit on a no-match (e.g. an event Bolt has not classified might bypass the user listener and next() returns immediately, leading to a sub-microsecond observation). Such observations are fine for the histogram but would skew _count upward. Worth a note in the middleware comment if confirmed.
Cross-repo P0 from the phantomd PR Replace Qdrant memory stack with ClawMem #28 comment: is the AllowedSecretNamesMirror constant + the SECRET_RESPONSES fixture pin enough for the four-way invariant? Or do we need a third pin (e.g. a CI script that diffs phantomd's AllowedSecretNames keys against this Mirror)? Defer to the orchestrator.

Beyond Codex, the cross-repo P1 question: the test file mocks the SocketModeReceiver as a no-op object with client.on() = () => {}. The slack-metrics.test.ts integration tests use a real recording mock that captures listeners. If a future SocketModeReceiver mock in an unrelated test file diverges (e.g. forgets client.on), the SlackChannel constructor throws a confusing error. Worth pinning the mock shape via a shared test helper? Defer to follow-up cleanup PR.

mcheemaa · 2026-05-01T05:44:33Z

@codex review

Per feedback_reviewer_failure_path_track. Specific risks for review:

Dispatch-middleware throw-path correctness (does next() throw propagate to the outer dispatch_seconds histogram correctly?)
Connection-lifetime double-observe race (could a 'connecting' fire twice without an intervening 'connected' or 'disconnected'?)
prom-client name collisions on a hypothetical second emitter
Bun-mock hygiene if a future test file mocks @slack/bolt without exporting SocketModeReceiver
Companion PR: ghostwright/phantomd#28 (allowlist for slack_app_token)

Architect: phantom-cloud-deploy/local/2026-04-30-research-r7-slack-multi-agent-impl.md (Phase 8a hardening pass, 4-place cross-repo invariant).

chatgpt-codex-connector · 2026-05-01T05:49:55Z

Codex Review: Didn't find any major issues. Hooray!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

mcheemaa merged commit 7acaef7 into main May 1, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

phantom: Socket Mode lifecycle metrics + slack_app_token in AllowedSecretNames#110

phantom: Socket Mode lifecycle metrics + slack_app_token in AllowedSecretNames#110
mcheemaa merged 1 commit intomainfrom
feat/2026-04-30-phase8a-slack-socket-metrics

mcheemaa commented May 1, 2026

Uh oh!

mcheemaa commented May 1, 2026

Uh oh!

mcheemaa commented May 1, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mcheemaa commented May 1, 2026

Summary

Metrics families

Implementation notes

Cross-repo invariant

Tests

Verification

Sequencing

Test plan

Uh oh!

mcheemaa commented May 1, 2026

Uh oh!

mcheemaa commented May 1, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant