channels: persistent intro-DM ledger for HTTP mode (H3 onboarding fix) by mcheemaa · Pull Request #119 · ghostwright/phantom

mcheemaa · 2026-05-01T22:21:30Z

Summary

Closes finding H3 from the 2026-05-01 real-vs-fake audit v2: two reachable HTTP-mode tenants (gilded-hearth, gogo-aim) sat at /health.onboarding="pending" for 49-65h with no intro DM ever fired. Root cause is two layered bugs in this repo:

The HTTP receiver's firstDmSent flag is in-memory only. On a transient Slack send failure (rate limit, no ts returned, conversations.open failure), the flag stays false, but the next opportunity to retry is the next process restart. Production tenants with continuous uptime get exactly one shot.
markOnboardingStarted / markOnboardingComplete only run from the Socket Mode startOnboarding path in src/index.ts, which is gated on channelsConfig?.slack?.{owner_user_id,default_user_id,default_channel_id}. None of those exist in HTTP-mode production because no channels.yaml is baked into the rootfs. So /health.onboarding stays at the default "pending" forever even when the HTTP-mode intro DM did go out.

Reproduction

curl https://gilded-hearth.phantom.ghostwright.dev/health -> "onboarding":"pending" after 65h uptime, evolution.generation: 0, scheduler.total: 0. Same for gogo-aim (49h, generation 1).
The agents are technically alive (channels.slack: true, memory healthy) but inert because no first user contact has ever happened.

Root cause and fix

src/channels/slack-http-receiver.ts connect() does the intro DM via sendIntroductionDm and gates on the in-memory firstDmSent instance flag. Both gating and observability need to be persistent.

This PR adds an optional IntroductionLedger surface on SlackHttpChannel:

export type IntroductionLedger = {
  isIntroSent(): boolean;
  markIntroSent(): void;
};

src/index.ts wires it against the existing SQLite onboarding_state table:

introductionLedger: {
  isIntroSent: () => getOnboardingStatus(db).status === "complete",
  markIntroSent: () => {
    markOnboardingStarted(db);
    markOnboardingComplete(db);
  },
}

Behaviour the ledger pins:

/health.onboarding flips from "pending" to "complete" on the first successful intro DM, so audits and dashboards see the right state.
A process restart consults the SQLite row and skips the intro DM when it is already sent, so the installer never gets re-DMed.
A transient Slack failure (rate limit, ts returned empty, conversations.open error, exception thrown) leaves the row clear so the next process start retries naturally. The stamp happens only AFTER result.sent === true.

The ledger is optional; tests and single-tenant dev keep the in-memory firstDmSent fallback with the same semantics it had before. The constructor stays back-compat with every existing caller.

Architect note

This fix lands BEFORE the rootfs rebuild (audit M2) so the next rebuild ships a tenant whose first DM correctly stamps /health and is idempotent across restarts. It rides cleanly with phantom #116 (Phase 12 user research enrichment) which adds firstboot_state for the Socket Mode path; the two ledgers cover different transports and do not overlap.

Live tenants stuck at pending today still need a one-shot operator action when bare-metal SSH is restored: stamp onboarding_state to complete on the existing in-VM SQLite (or send the intro DM manually). Documented as an operator action in the H3 root-cause note (see local doc in phantom-cloud-deploy).

Test plan

bun test src/channels/__tests__/slack-http-receiver.test.ts - 64 pass (7 new ledger-path cases)
bun test src/channels/__tests__/slack-channel-factory.test.ts - 20 pass (1 new factory-forward case)
bun test (full suite) - 2316 pass / 10 skip / 1 todo / 0 fail
bun run typecheck - clean
bun run lint - clean

New regression tests pin: ledger short-circuits intro DM on a true read; ledger stamps only after a successful send; transient Slack errors keep the row clear; the stamp survives reconnect; a throwing markIntroSent does not derail connect(); the in-memory fallback still works without a ledger; the factory forwards the ledger to the channel.

Production HTTP-mode tenants stayed at /health.onboarding="pending" forever and could re-DM the installer on every process restart. Both behaviours trace to the in-memory firstDmSent flag in SlackHttpChannel plus the fact that markOnboardingStarted only runs from the Socket Mode flow (gated on a channels.yaml that is never baked in HTTP-mode rootfs). Add an optional IntroductionLedger surface on SlackHttpChannel that production wires against the SQLite onboarding_state table. The receiver short-circuits the intro DM when the ledger reports sent and stamps it only AFTER a successful Slack send, so: - /health.onboarding flips to "complete" on the first successful DM - a process restart never re-DMs the installer - a transient Slack failure (rate limit, no ts returned, throw) leaves the row clear so the next process start retries The ledger is optional; tests and single-tenant dev fall back to the existing in-memory flag with the prior semantics. The SlackHttpChannel constructor stays compatible with all existing callers. Tests: 7 new ledger-path cases on slack-http-receiver plus a slack-channel-factory regression that pins the introductionLedger forward, all green. Full suite 2316 pass / 0 fail. Refs: real-vs-fake audit v2 finding H3

…otency

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a98d91cd69

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-01T22:26:07Z

 		metadataBaseUrl: process.env.METADATA_BASE_URL,
 		metrics: slackMetrics,
+		introductionLedger: {
+			isIntroSent: () => getOnboardingStatus(db).status === "complete",


Key intro-DM ledger by installer identity

The new ledger wiring treats any onboarding_state.status === "complete" as proof that the intro DM was already sent, but that state is global and not tied to identity.slack.installerUserId. After a Slack reinstall/identity rotation (which this codebase supports via metadata update + restart), SlackHttpChannel.connect() will skip sendIntroductionDm for the new installer because the old install already marked complete. This regresses first-contact behavior for reinstall flows and leaves the new installer without the onboarding DM.

Useful? React with 👍 / 👎.

mcheemaa and others added 2 commits May 1, 2026 15:20

Merge branch 'main' into fix/2026-05-02-h3-onboarding-http-mode-idemp…

2488714

…otency

mcheemaa enabled auto-merge (squash) May 1, 2026 22:24

chatgpt-codex-connector Bot reviewed May 1, 2026

View reviewed changes

mcheemaa merged commit 3e37d50 into main May 1, 2026
1 check passed

mcheemaa deleted the fix/2026-05-02-h3-onboarding-http-mode-idempotency branch May 1, 2026 22:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

channels: persistent intro-DM ledger for HTTP mode (H3 onboarding fix)#119

channels: persistent intro-DM ledger for HTTP mode (H3 onboarding fix)#119
mcheemaa merged 2 commits intomainfrom
fix/2026-05-02-h3-onboarding-http-mode-idempotency

mcheemaa commented May 1, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mcheemaa commented May 1, 2026

Summary

Reproduction

Root cause and fix

Architect note

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant