Skip to content

phantom: Socket Mode lifecycle metrics + slack_app_token in AllowedSecretNames#110

Merged
mcheemaa merged 1 commit intomainfrom
feat/2026-04-30-phase8a-slack-socket-metrics
May 1, 2026
Merged

phantom: Socket Mode lifecycle metrics + slack_app_token in AllowedSecretNames#110
mcheemaa merged 1 commit intomainfrom
feat/2026-04-30-phase8a-slack-socket-metrics

Conversation

@mcheemaa
Copy link
Copy Markdown
Member

@mcheemaa mcheemaa commented May 1, 2026

Summary

Phase 8a of the multi-agent Slack flow (R7 spec dated 2026-04-30) hardens the existing Socket Mode receiver with Prometheus-grade observability and locks the cross-repo allowlist mirror against phantomd's AllowedSecretNames.

The Socket Mode receiver itself was already in place (R7 §3 confirmed the lift was hardening, not implementation). What this PR adds:

  1. Lifecycle metrics on a new /metrics Prometheus endpoint, hooked via the underlying SocketModeClient event emitter.
  2. A frozen AllowedSecretNamesMirror constant pinning the three Slack-related secret names that traverse the metadata gateway, paired 1-1 with phantomd's AllowedSecretNames.

Metrics families

  • phantom_slack_socket_state{state} (gauge, 7 series). Exactly one series is 1.0 at any instant. Powers 1 - max_over_time(...{state="connected"}[1m]) "tenant offline" alerts.
  • phantom_slack_socket_reconnects_total (counter). Bolt auto-reconnects under the hood; this measures wobble rate. Alert at sustained > 5/min.
  • phantom_slack_socket_connection_seconds (histogram, custom buckets 1s -> 24h). Lifetime of a single Socket Mode connection. p99 should hold above 1 hour.
  • phantom_slack_event_dispatch_seconds{event_type} (histogram, custom buckets 5ms -> 30s). End-to-end Bolt middleware time. Slack ack deadline is 3s.

The /metrics endpoint follows the existing /health precedent: unauthenticated, content-type from prom-client. Per-tenant isolation is at the Caddy URL boundary, not per-route.

Implementation notes

  • Switched from new App({socketMode: true, ...}) to explicit new SocketModeReceiver({...}) so we can reach receiver.client and subscribe lifecycle events. The State enum is verbatim from @slack/socket-mode/dist/src/SocketModeClient.js: connecting, authenticated, connected, reconnecting, disconnecting, disconnected. R7 §3.3 mentioned unable_to_socket_mode_start; the SDK does not actually emit that event (it surfaces as UnrecoverableSocketModeStartError thrown from start()), so we map the unrecoverable path to a synthetic error state on the gauge in connect().
  • Dispatch timing via app.use() global middleware: try/await next() / finally observe(). Captures handler completion including async LLM/memory work the agent triggers in response to a Slack event.
  • prom-client backed. The metrics module owns its own Registry so future Telegram/email channels can plug in via parallel registries without name collisions on the global registry.

Cross-repo invariant

slack-channel-factory.ts exports a frozen AllowedSecretNamesMirror = ["slack_bot_token", "slack_app_token", "slack_gateway_signing_secret"]. The phantomd side lands in PR #28 (TestIsAllowedName_AcceptsSlackAppToken + _RejectsSlackAppTokenVariants). Drift on either side breaks tenant boot with HTTP 404 (the metadata gateway maps ErrInvalidName to 404 to defeat name enumeration). The factory test pins the Mirror against the SECRET_RESPONSES test fixture so a future production-side rename fails loud in CI on both repos.

R7 §3 documents the four-way invariant; this PR closes phantom's two of the four.

Tests

29 net-new:

  • 24 in slack-metrics.test.ts: every metric family, the noop emitter, dispatch middleware including throw-path, lifecycle hooks including disconnect-without-prior-connect.
  • 5 in slack-channel-factory.test.ts: AllowedSecretNamesMirror contains all three names, matches SECRET_RESPONSES, is frozen.

Bun-mock hygiene: every test file that mocks @slack/bolt now also exports SocketModeReceiver in its mock object. Without this, partial mocks leaked across test files and broke any subsequent test that imported slack.ts (caught by running the full channels/ suite together).

Verification

  • bun run typecheck clean
  • bun run lint clean
  • bun test green: 2111 pass / 0 fail / 0 errors / 4769 expects across 156 files
  • Cross-repo invariant verified manually: slack_app_token appears in 4 places (phantomd types.go + phantomd types_test.go + phantom slack-channel-factory.ts + phantom slack-channel-factory.test.ts).
  • Codex review (requested below)
  • Orchestrator/Cheema merge (per feedback_agent_canary_authority, agents do not merge phantom)

Sequencing

The phantomd half (PR #28) is the pre-req: without phantomd's allowlist update, a Phase 8b sidekick that calls PutSecret(name="slack_app_token", ...) is rejected with ErrInvalidName. Recommended merge order: phantomd #28 first (after Codex-clean), then this PR.

Test plan

  • Manual smoke against the canary tenant: deploy this build, scrape <slug>.phantom.ghostwright.dev/metrics, confirm the four metric families render with non-zero data.
  • Force a reconnect (kill the upstream WSS endpoint via iptables -A OUTPUT -p tcp --dport 443 -j DROP for 30s), verify phantom_slack_socket_reconnects_total increments and phantom_slack_socket_connection_seconds observes the lifetime.
  • Send a Slack message, verify phantom_slack_event_dispatch_seconds{event_type="app_mention"}_count increments.
  • Confirm Phase 8b sidekick install path can PutSecret(slack_app_token) against phantomd built with PR Replace Qdrant memory stack with ClawMem #28 (cross-repo smoke).

Net diff: 11 files, ~390 LOC. Follows R7 §3.1's "150 net-new" floor; the rest is test mirror coverage and observability documentation in CLAUDE.md.

…ecretNamesMirror

R7 Phase 8a (multi-agent Slack flow, dated 2026-04-30) hardens the
existing Socket Mode receiver with Prometheus-grade observability and
locks the cross-repo allowlist mirror against phantomd's
AllowedSecretNames.

Metrics surface (new /metrics endpoint):
- phantom_slack_socket_state{state} (gauge, 7 series). Exactly one
  series is 1.0 at any instant. Powers "tenant offline" alerts.
- phantom_slack_socket_reconnects_total (counter). Bolt's auto-
  reconnect runs under the hood; this measures wobble rate.
- phantom_slack_socket_connection_seconds (histogram). Connection
  lifetime from connect to disconnect; long-tail SLO.
- phantom_slack_event_dispatch_seconds{event_type} (histogram).
  End-to-end Bolt middleware time. Slack ack deadline is 3s.

Implementation:
- New SocketModeReceiver hookup (vs. socketMode: true shorthand) to
  reach receiver.client and subscribe lifecycle events. The State
  enum is verbatim from @slack/socket-mode SocketModeClient.js:
  connecting, authenticated, connected, reconnecting, disconnecting,
  disconnected.
- The unrecoverable start() error path maps to a synthetic "error"
  state on the gauge so the fleet view does not silently flatline
  when the receiver never gets past handshake.
- Dispatch timing via app.use() global middleware: try/await
  next() / finally observe(). Captures handler completion including
  any async LLM/memory work the agent does in response.
- prom-client backed; the metrics module owns its own Registry so
  Telegram/email can plug in via parallel registries without name
  collisions on the global registry.

AllowedSecretNamesMirror cross-repo invariant:
- New frozen const exported from slack-channel-factory.ts listing
  slack_bot_token + slack_app_token + slack_gateway_signing_secret.
- The phantomd side ships in the symmetric PR (phantomd #28,
  TestIsAllowedName_AcceptsSlackAppToken pins the assertion).
- Drift on either side breaks tenant boot with HTTP 404 (the
  metadata gateway maps ErrInvalidName to 404 to defeat name
  enumeration). The factory test pins the Mirror against the
  SECRET_RESPONSES test fixture so a future production-side rename
  fails loud in CI.

Tests (29 net-new):
- 24 in slack-metrics.test.ts: each metric family, the noop emitter,
  the dispatch middleware including throw-path, the lifecycle hooks
  including disconnect-without-prior-connect.
- 5 in slack-channel-factory.test.ts: AllowedSecretNamesMirror
  contains all three names, matches SECRET_RESPONSES, is frozen.
- All 2111 existing tests still green; full suite: 2111 pass / 0
  fail / 0 errors.

Bun-mock cross-suite hygiene: every test that previously mocked
@slack/bolt now also exports SocketModeReceiver in its mock object.
Without this, a test file that ran AFTER one of those partial
mocks would import the real module and hit "Export named
'SocketModeReceiver' not found" at module load time. Caught by
running the full channels/ suite together.

Net diff: 11 files, ~390 LOC. Follows R7 §3.1's "150 net-new" floor;
the rest is test mirror coverage and documentation.
@mcheemaa
Copy link
Copy Markdown
Member Author

mcheemaa commented May 1, 2026

Ready for Codex review (feedback_review_protocol_codex_first).

Failure-path + refresh-survival track to focus on (feedback_reviewer_failure_path_track):

  1. State-left-bad-after-failure: dispatch middleware throw path. The try/await next() / finally observe() block records timing even when the handler throws. Verify the metric still emits cleanly across a sequence: success -> throw -> success -> throw. The slack_event_dispatch_seconds{event_type="message"}_count should be exactly the number of dispatches, not the number of successes. Test exists (the dispatch middleware still records on handler throw); confirm the test's assertion shape catches a regression that "swallows the throw without rethrowing", which would silently flip the channel into a non-fail-loud state.

  2. State-left-bad-after-failure: connection lifetime double-observe. observeConnectionLifetime() zeros connectedAtMs on observation. A pathological Bolt sequence (connected -> reconnecting -> disconnected) would call observeConnectionLifetime() twice; the second call short-circuits on the null guard. But there's a subtle race: if reconnecting and disconnected fire in the same tick, the second observation sees null and the histogram count stays at 1 even though two reconnect cycles happened. Is the alternative (observe twice with a half-zero second observation) better? My read: current behavior is correct because Bolt's reconnect path conceptually treats the prior connection as one continuous span until the reconnect succeeds. Worth a second look.

  3. Refresh-survival: prom-client Registry across process restart. The Registry is in-memory; phantomd restart re-adopts the running CH process via /proc//cmdline (KillMode=process). The in-VM Phantom process itself does NOT restart in that path. But: when the in-VM Phantom DOES restart (post-evolution-bump or operator-driven), every counter resets to 0 and every histogram bucket resets. Prometheus handles the counter reset via rate() automatically. The histogram reset is fine. The gauge re-initializes to 0/0/0/0/0/0/1 (disconnected) before any lifecycle event lands; this is the constructor's pre-emit. Correct.

  4. Class of mistake worth checking: prom-client name collisions. prom-client throws Metric with name X has already been registered when the same registry is constructed twice. The factory pattern (one SlackMetrics per process) avoids this in production. Tests pass new Registry() per case, also clean. But: if a future contributor adds a second new SlackMetrics() somewhere (Telegram metrics module copy-pasted from this one with shared metric names), the second one throws on construction. Catch: phantom_slack_* prefix on every metric name limits collision risk to within-Slack metrics. The CLAUDE.md note about per-channel registries is the documented contract.

  5. Mock hygiene cross-suite (Bun-bug per feedback_codex_loop_finite + B.1.4 history): the four test files that mock @slack/bolt now all export SocketModeReceiver. If a future contributor adds a 5th test that mocks @slack/bolt and forgets SocketModeReceiver, the test file passes in isolation but breaks the full suite. Worth a comment in the slack-metrics.ts module header pointing at the mock.module requirement, or worth a CI gate that runs bun test src/ (full suite) before merge.

  6. Bolt 4.6 app.use middleware contract. I'm asserting that await next() returns when the matching listener finishes. Cross-check Bolt 4.6 source to confirm the middleware chain doesn't short-circuit on a no-match (e.g. an event Bolt has not classified might bypass the user listener and next() returns immediately, leading to a sub-microsecond observation). Such observations are fine for the histogram but would skew _count upward. Worth a note in the middleware comment if confirmed.

  7. Cross-repo P0 from the phantomd PR Replace Qdrant memory stack with ClawMem #28 comment: is the AllowedSecretNamesMirror constant + the SECRET_RESPONSES fixture pin enough for the four-way invariant? Or do we need a third pin (e.g. a CI script that diffs phantomd's AllowedSecretNames keys against this Mirror)? Defer to the orchestrator.

Beyond Codex, the cross-repo P1 question: the test file mocks the SocketModeReceiver as a no-op object with client.on() = () => {}. The slack-metrics.test.ts integration tests use a real recording mock that captures listeners. If a future SocketModeReceiver mock in an unrelated test file diverges (e.g. forgets client.on), the SlackChannel constructor throws a confusing error. Worth pinning the mock shape via a shared test helper? Defer to follow-up cleanup PR.

@mcheemaa
Copy link
Copy Markdown
Member Author

mcheemaa commented May 1, 2026

@codex review

Per feedback_reviewer_failure_path_track. Specific risks for review:

  • Dispatch-middleware throw-path correctness (does next() throw propagate to the outer dispatch_seconds histogram correctly?)
  • Connection-lifetime double-observe race (could a 'connecting' fire twice without an intervening 'connected' or 'disconnected'?)
  • prom-client name collisions on a hypothetical second emitter
  • Bun-mock hygiene if a future test file mocks @slack/bolt without exporting SocketModeReceiver
  • Companion PR: ghostwright/phantomd#28 (allowlist for slack_app_token)

Architect: phantom-cloud-deploy/local/2026-04-30-research-r7-slack-multi-agent-impl.md (Phase 8a hardening pass, 4-place cross-repo invariant).

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Hooray!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@mcheemaa mcheemaa merged commit 7acaef7 into main May 1, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant