Skip to content

channels: persistent intro-DM ledger for HTTP mode (H3 onboarding fix)#119

Merged
mcheemaa merged 2 commits intomainfrom
fix/2026-05-02-h3-onboarding-http-mode-idempotency
May 1, 2026
Merged

channels: persistent intro-DM ledger for HTTP mode (H3 onboarding fix)#119
mcheemaa merged 2 commits intomainfrom
fix/2026-05-02-h3-onboarding-http-mode-idempotency

Conversation

@mcheemaa
Copy link
Copy Markdown
Member

@mcheemaa mcheemaa commented May 1, 2026

Summary

Closes finding H3 from the 2026-05-01 real-vs-fake audit v2: two reachable HTTP-mode tenants (gilded-hearth, gogo-aim) sat at /health.onboarding="pending" for 49-65h with no intro DM ever fired. Root cause is two layered bugs in this repo:

  1. The HTTP receiver's firstDmSent flag is in-memory only. On a transient Slack send failure (rate limit, no ts returned, conversations.open failure), the flag stays false, but the next opportunity to retry is the next process restart. Production tenants with continuous uptime get exactly one shot.
  2. markOnboardingStarted / markOnboardingComplete only run from the Socket Mode startOnboarding path in src/index.ts, which is gated on channelsConfig?.slack?.{owner_user_id,default_user_id,default_channel_id}. None of those exist in HTTP-mode production because no channels.yaml is baked into the rootfs. So /health.onboarding stays at the default "pending" forever even when the HTTP-mode intro DM did go out.

Reproduction

  • curl https://gilded-hearth.phantom.ghostwright.dev/health -> "onboarding":"pending" after 65h uptime, evolution.generation: 0, scheduler.total: 0. Same for gogo-aim (49h, generation 1).
  • The agents are technically alive (channels.slack: true, memory healthy) but inert because no first user contact has ever happened.

Root cause and fix

src/channels/slack-http-receiver.ts connect() does the intro DM via sendIntroductionDm and gates on the in-memory firstDmSent instance flag. Both gating and observability need to be persistent.

This PR adds an optional IntroductionLedger surface on SlackHttpChannel:

export type IntroductionLedger = {
  isIntroSent(): boolean;
  markIntroSent(): void;
};

src/index.ts wires it against the existing SQLite onboarding_state table:

introductionLedger: {
  isIntroSent: () => getOnboardingStatus(db).status === "complete",
  markIntroSent: () => {
    markOnboardingStarted(db);
    markOnboardingComplete(db);
  },
}

Behaviour the ledger pins:

  • /health.onboarding flips from "pending" to "complete" on the first successful intro DM, so audits and dashboards see the right state.
  • A process restart consults the SQLite row and skips the intro DM when it is already sent, so the installer never gets re-DMed.
  • A transient Slack failure (rate limit, ts returned empty, conversations.open error, exception thrown) leaves the row clear so the next process start retries naturally. The stamp happens only AFTER result.sent === true.

The ledger is optional; tests and single-tenant dev keep the in-memory firstDmSent fallback with the same semantics it had before. The constructor stays back-compat with every existing caller.

Architect note

This fix lands BEFORE the rootfs rebuild (audit M2) so the next rebuild ships a tenant whose first DM correctly stamps /health and is idempotent across restarts. It rides cleanly with phantom #116 (Phase 12 user research enrichment) which adds firstboot_state for the Socket Mode path; the two ledgers cover different transports and do not overlap.

Live tenants stuck at pending today still need a one-shot operator action when bare-metal SSH is restored: stamp onboarding_state to complete on the existing in-VM SQLite (or send the intro DM manually). Documented as an operator action in the H3 root-cause note (see local doc in phantom-cloud-deploy).

Test plan

  • bun test src/channels/__tests__/slack-http-receiver.test.ts - 64 pass (7 new ledger-path cases)
  • bun test src/channels/__tests__/slack-channel-factory.test.ts - 20 pass (1 new factory-forward case)
  • bun test (full suite) - 2316 pass / 10 skip / 1 todo / 0 fail
  • bun run typecheck - clean
  • bun run lint - clean

New regression tests pin: ledger short-circuits intro DM on a true read; ledger stamps only after a successful send; transient Slack errors keep the row clear; the stamp survives reconnect; a throwing markIntroSent does not derail connect(); the in-memory fallback still works without a ledger; the factory forwards the ledger to the channel.

mcheemaa and others added 2 commits May 1, 2026 15:20
Production HTTP-mode tenants stayed at /health.onboarding="pending"
forever and could re-DM the installer on every process restart. Both
behaviours trace to the in-memory firstDmSent flag in SlackHttpChannel
plus the fact that markOnboardingStarted only runs from the Socket Mode
flow (gated on a channels.yaml that is never baked in HTTP-mode rootfs).

Add an optional IntroductionLedger surface on SlackHttpChannel that
production wires against the SQLite onboarding_state table. The
receiver short-circuits the intro DM when the ledger reports sent and
stamps it only AFTER a successful Slack send, so:

  - /health.onboarding flips to "complete" on the first successful DM
  - a process restart never re-DMs the installer
  - a transient Slack failure (rate limit, no ts returned, throw) leaves
    the row clear so the next process start retries

The ledger is optional; tests and single-tenant dev fall back to the
existing in-memory flag with the prior semantics. The SlackHttpChannel
constructor stays compatible with all existing callers.

Tests: 7 new ledger-path cases on slack-http-receiver plus a
slack-channel-factory regression that pins the introductionLedger
forward, all green. Full suite 2316 pass / 0 fail.

Refs: real-vs-fake audit v2 finding H3
@mcheemaa mcheemaa enabled auto-merge (squash) May 1, 2026 22:24
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a98d91cd69

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/index.ts
metadataBaseUrl: process.env.METADATA_BASE_URL,
metrics: slackMetrics,
introductionLedger: {
isIntroSent: () => getOnboardingStatus(db).status === "complete",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Key intro-DM ledger by installer identity

The new ledger wiring treats any onboarding_state.status === "complete" as proof that the intro DM was already sent, but that state is global and not tied to identity.slack.installerUserId. After a Slack reinstall/identity rotation (which this codebase supports via metadata update + restart), SlackHttpChannel.connect() will skip sendIntroductionDm for the new installer because the old install already marked complete. This regresses first-contact behavior for reinstall flows and leaves the new installer without the onboarding DM.

Useful? React with 👍 / 👎.

@mcheemaa mcheemaa merged commit 3e37d50 into main May 1, 2026
1 check passed
@mcheemaa mcheemaa deleted the fix/2026-05-02-h3-onboarding-http-mode-idempotency branch May 1, 2026 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant