Skip to content

Delete a plan, push pr review notes to explain the preceeding PRs#395

Closed
OisinKyne wants to merge 1 commit intooisin/377-5from
oisin/377-6
Closed

Delete a plan, push pr review notes to explain the preceeding PRs#395
OisinKyne wants to merge 1 commit intooisin/377-5from
oisin/377-6

Conversation

@OisinKyne
Copy link
Copy Markdown
Contributor

PR #386 review and surgery — 2026-04-29

This file captures the review process for integration/pr377-pr381 (PR #386,
"Integrate PR #377 OBOL Permit2 + PR #381 Hermes runtime, harden flows, add
flow-13 dual-stack OBOL"), what got cut, what stayed, and why. It is not
intended to live in the repo long-term; CLAUDE.md says plan/PR-review docs
shouldn't be committed. Keeping it under plans/ (gitignored) so it survives
the work session and can be pasted into the PR body or archived.

Branch state at start of review

Oisin's bottom-of-PR concerns (the starting point)

From the GitHub PR thread:

  1. SKILL.md "supplies a signing key to the controller" (lines 464–467)
    "is this really the best way to do this? would supplying a signature be
    more sane?" + "is this key in any way privileged?"
  2. probe.go (line 1) — "I think the commit that brings this in (and a
    lot of the focus of the defensive programming of this pr) is for an AI
    fuck up in a clean down of an automated test... 1) chainID is an
    insufficient probe for a status of an upstream. 2) erpc does the status
    management already, can't we parse it?"
  3. setMetadata diagnosis — "wrong diagnosis, probably right fix? i can
    imagine our erpc returning stale reverts unless we wait a block/request
    more exactly a fresh one"
  4. Flow private key in env — "i dont know why it needs the private key
    outside in an env as an example, maybe i'll know when i look"

Plus three GitHub-Advanced-Security CodeQL hits (the int conversion at
probe.go:144 was the only legitimate one; vanished with the file).

Layered-on findings from my own audit (pre-edit)

  • The controller's loadRegistrationSigningKey() codepath (controller.go:1383
    pre-edit) was fully wired — 9 call sites for SubmitRegister /
    SetMetadata / SetAgentURI — but never triggered: the
    ERC8004_PRIVATE_KEY / _FILE env vars are not set in the deployment
    template (internal/embed/infrastructure/base/templates/x402.yaml), no
    flow sets them, no test sets them, no GitHub Actions sets them. Pure dead
    code unless someone manually kubectl set envs the controller.
  • The CLI's --private-key-file flag was real — used by flow-11-dual-stack.sh
    and flow-14-live-obol-base-sepolia.sh for bootstrap. It was redundant
    with obol wallet import (which exists for Hermes since the Hermes
    default-agent landing).
  • model.Rank is not on a hot path. Called only from Hermes.SetupDefault,
    Onboard, writeDeploymentFiles, Sync (deploy/sync time). Not per chat
    request. Oisin's "called all the time?" worry was unfounded.
  • model.Rank vs the obol model prefer branch (Add provider smokes and model preference #379) — real conflict.
    Prefer reorders the LiteLLM ConfigMap so the user's choice sits at index 0;
    Rank reorders again on next sync. These two branches will collide and need
    reconciling when Add provider smokes and model preference #379 rebases. Deferred to that PR.
  • Hermes 9119 vs 80 — 9119 is the dashboard container port; users hit it
    through Traefik on port 80 (8080 on macOS due to the privileged port
    caveat already in CLAUDE.md). Pre-existing UX, not introduced here.
  • 9ef8cda harden existing stack refresh was a bundle of (a) frontend
    v0.1.17-rc.3 → rc.5 bump (already shipped via chore: bump frontend image to v0.1.17-rc.5 #383), (b) hardcoded-resource
    Helm-ownership migration, (c) kubectl-patch field-manager migration. (b)
    and (c) target a one-off transition seen on dev laptops.
  • 4323dd3 Preserve LiteLLM config across defaults refresh is a clean
    refactor with real user value — the base helmfile template only contains
    paid/*, so without preserve+restore every obol stack up would wipe
    user-added cloud providers.
  • c08c873/34e62e5 Docker-network reclaim only fixes the leak inside flow
    scripts. obol stack purge calls k3d cluster delete which leaks the
    network when dev-mode pull-through registry-mirror containers are
    attached, exhausting the predefined CIDR pool after ~16 cycles.
  • The qwen3:8b laptop default and the scattered qwen3:0.6b examples were
    pre-dating the qwen3.5 / qwen3.6 generations. Cross-checked against
    ollama.com/library at review time:
    • qwen3.6 smallest is 27b (17GB); no small variants exist.
    • qwen3.5 has 4b/9b/27b/35b/122b. Validated test baseline is
      qwen3.5:9b (6.6GB).
    • qwen3:0.6b does still exist (523MB) but is too small to be a sane
      default in 2026.
    • The 026d30d commit message claim "qwen3.6:27b-coding-mxfp8 ~13 GB" was
      wrong; actual is 31GB.

Decisions and dialog, group by group

Group A — Controller signing key + --private-key-file flag

The most architecturally consequential cut. Tied directly to memory record:
"Never extract private keys from remote-signer; use its REST API for all
signing."

Two questions surfaced before any edits:

Bullet 1: what does the flag do? and if its called only on the
bootstrapping of the flows test, do we need it?
Bullet 2: [the controller env path] — okay delete. (not even in the
flows nor actions nor anywhere?)
Bullet 3: [the env-var on the CLI flag] sounds like what we want
instead of the flag in bullet 1, right?

I presented two options for the flag:

  • A: delete the flag and rewrite flow-11/flow-14 to call obol wallet import once at bootstrap, then run obol sell http / obol sell register
    with no flag. One signing surface.
  • B: keep the flag, fix the misleading Usage and SKILL.md text, keep
    the escape hatch.

Oisin chose A explicitly:

"it might have been written before we got obol hermes wallet import. I
think remove it all and call import in the flows. (option a)"

What got cut

  • internal/serviceoffercontroller/controller.go:
    • registrationKey *ecdsa.PrivateKey field
    • registrationOwnerAddress string field
    • loadRegistrationSigningKey() (read of ERC8004_PRIVATE_KEY/_FILE)
    • syncRegistrationMetadata() and the metadata-sync block in
      reconcileRegistrationActive
    • The c.registrationKey != nil self-register branch and on-chain
      tombstone branch — controller now only publishes the registration
      document and watches for the externally-driven register tx.
    • crypto/ecdsa and crypto imports
  • cmd/obol/sell.go:
    • --private-key-file flag from obol sell http and obol sell register
    • readPrivateKeyMaterial() helper
    • registerDirectWithKey() helper
    • PrivateKeyFile / PrivateKeyInput struct fields
    • The "fallback to private key file if no remote-signer" branch on both
      paths. Now: remote-signer required; clear error if missing.
  • cmd/obol/sell_test.go: tests for readPrivateKeyMaterial and the
    --private-key-file flag presence.

What stayed

  • Remote-signer-only registration (registerSponsored,
    registerDirectViaSigner) — already correct, untouched.
  • obol wallet import — already existed (cmd/obol/wallet.go:44,
    internal/hermes/wallet_import.go), untouched.
  • MetadataSynced, RegistrationTxHash, RegistrationSearchFromBlock
    fields on the CRD — kept for compat; no longer written by the controller.

Flow rewrites

  • flow-11-dual-stack.sh: insert alice wallet import --instance obol-agent --private-key-file <file> --force before alice sell http. Drop
    --private-key-file from the sell http call. Bob already had this
    pattern; Alice now matches.
  • flow-14-live-obol-base-sepolia.sh: same pattern before obol sell register. Drop --private-key-file from the call.
  • flow-13-dual-stack-obol.sh: comment refreshed to stop referencing the
    deleted flag.
  • .agents/skills/obol-stack-dev/SKILL.md: ERC-8004 prerequisites section
    rewritten to describe the remote-signer-only path. Setmetadata-revert
    section credits fix(erc8004): wait for read-side consistency before setMetadata #387's WaitForAgent as the actual fix and demotes the
    "stale Anvil fork" hypothesis from the original PR description to a
    less-likely cause.

Group B — Drop internal/network/probe.go + obol network status chainId column

Oisin:

"for the probe.go and setMetadata stuff, it sounds like we should be
taking most of the slop out here and pointing out that the later fix
actually improved it, and were we to make an obol network status
command again, it should parse erpc."

Reasoning kept in commit:

  • eth_chainId is a poor liveness probe — can't catch lagging upstreams.
  • An Anvil fork of base-sepolia returns the same chain id (84532) as the
    real chain — exactly the false-positive case the probe was built for.
  • eRPC has its own per-upstream health metrics; the right column would
    parse those, not redo the work badly in-process.
  • The actual symptom that motivated the probe (ERC721NonexistentToken
    on a freshly-minted agent ID) was eRPC read-side staleness, fixed by
    fix(erc8004): wait for read-side consistency before setMetadata #387's WaitForAgent.

What got cut

  • internal/network/probe.go (deleted)
  • internal/network/probe_test.go (deleted)
  • cmd/obol/network.go: --no-probe and --probe-timeout flags,
    renderUpstreamProbes(), uiPrinter interface, time import.

flow-02 and flow-05 only grep for eRPC|Pod|Upstream headers, which
remain — no flow regressions.

Group C — Strip Helm-ownership migration code

Oisin:

"if we can't see a normal way kubectl-patch could be writing the
llmconfig, and we're happy this was on the fly edits, maybe we strip
all the rest of this code and tell the dev to not code for hacks unless
we really need to add resilience for some reason."

  • kubectl-patch field manager getting on the ConfigMap was a historical
    case (pre---field-manager=helm model.go). Current code uses
    --field-manager=helm so the trigger is unreachable.
  • The hardcoded-namespace migration (migrateBaseHelmOwnership) had no
    resilience case — pure one-off.
  • BUT: the preserve+restore+merge mechanism IS a real-resilience case.
    internal/embed/infrastructure/base/templates/llm.yaml:60–77 shows
    the base ConfigMap only contains paid/*. Without preserve+restore,
    every obol stack up would wipe user-added cloud providers and custom
    endpoints. Kept.

What got cut

  • migrateBaseHelmOwnership() + baseHelmResource + namespaceArgs()
    (whole block)
  • needsLiteLLMConfigHelmMigration() and the conditional delete-and-restore
    branch in preserveLiteLLMConfigForHelm
  • TestNeedsLiteLLMConfigHelmMigration

What stayed (justified resilience)

  • preserveLiteLLMConfigForHelm — simplified to always snapshot
  • restoreLiteLLMConfig, mergeLiteLLMConfig, configMapFieldOwnershipManifest
  • kubectl.ApplyServerSideForceConflicts (used by restore)
  • --field-manager=helm patches in internal/model/model.go (one-line
    good hygiene; matches helm's manager so future patches don't conflict)
  • TestMergeLiteLLMConfigPreservesChartDefaultsAndPreviousModels
  • The frontend image bump (rc.3 → rc.5) — Oisin: "leave the frontend bump,
    i'm not worried about history neatness"
  • The "Existing Dev Stack Refresh" section in
    .agents/skills/obol-stack-dev/SKILL.md — Oisin: "the obol-stack-dev
    skill is dev focused tbf so maybe okay"

Group D — Reclaim leaked dev k3d networks on obol stack purge

Oisin:

"yes lets do in purge, maybe down if up will gracefully recreate them,
idk if that makes sense. only in development mode."

obol stack down calls k3d cluster stop which preserves the network
for Up to resume; cleaning on Down would break the stop/resume workflow.
Only Purge is wired up. The rare graceful-stop-fails-fall-through-to-delete
path inside Down is left to the next Purge to clean.

Implementation: lifted cleanup_k3d_obol_networks from flows/lib.sh
into Go (reclaimLeakedDevK3dNetworks in internal/stack/stack.go).
Filters to k3d-obol-stack-* networks, skips any with a live *-serverlb
or *-server-N attachment, force-disconnects mirror containers, removes.
Logs "Reclaimed N leaked dev registry network(s)" only when a network
was actually freed.

Added TestHasLiveK3dCluster to pin the live-cluster heuristic.

Group E — Qwen recommendation cleanup

Oisin:

"verify, and if there's a small 3.6, that can be recommended, else,
3.5:4b (for the laptop case). Do a little sanity check we're leaving
behind no qwen3:0.6b's, which are far too old and small these days."

Web-checked ollama.com/library:

Family Small variants Notes
qwen3.6 none under 27b 27b = 17 GB
qwen3.5 0.8b/2b/4b/9b/27b/35b/122b 4b = 3.4 GB; 9b = 6.6 GB
qwen3 0.6b/1.7b/4b/8b/14b/30b/32b/235b older generation

Decision tree:

  • Capable host (≥32GB): qwen3.6:27b 17 GB
  • Coding (capable host): qwen3.6:27b-coding-mxfp8 31 GB (was wrongly listed as ~13 GB in 026d30d's commit message)
  • Validated baseline: qwen3.5:9b 6.6 GB
  • Low-RAM laptop: qwen3.5:4b 3.4 GB (replaces every qwen3:0.6b user-facing reference)
  • Reasoning: deepseek-r1:8b 4.9 GB
  • Lightweight: gemma3:4b 3.3 GB

internal/model/rank.go and internal/model/rank_test.go deliberately
untouched — the qwen3:0.6b references there are regression guards for
users who happen to have it pulled, not recommendations.

Group F — Delete docs/plans/obol-x402-path-comparison.md

CLAUDE.md says plan/report docs don't belong in the repo. The 2026-04-11
benchmark write-up belongs in PR #386's body or a Linear thread, not
docs/plans/. Empty docs/plans/ directory removed too.

Group G — This file

Per Oisin: "Add to your task list that you'll need to write to a file a
transcript of this review process, including stuff like this dialog and
why its being removed."

Lives under plans/ (untracked) so it doesn't violate the no-plan-doc
rule.

Things deliberately not changed

  • Rank vs prefer interaction (with Add provider smokes and model preference #379) — Real conflict, but Add provider smokes and model preference #379 is
    not on this branch and will need to rebase. Ownership of fixing it sits
    with whoever rebases Add provider smokes and model preference #379. Note from Oisin: "okay, we'll fix this when
    we tackle the obol model prefer branch, which isn't in this at all."
  • Hermes port 9119 — No action; this is an internal container port,
    user lands on port 80 via Traefik. UX issue with :8080 on macOS
    pre-exists, covered already in CLAUDE.md.
  • /etc/hosts URL print — Oisin: "deal with that later, iirc we print
    the right url if we fail to bind."
  • The frontend image bump in 9ef8cda — left in. Oisin: "leave the
    frontend bump, i'm not worried about history neatness."
  • Existing-Dev-Stack-Refresh section in obol-stack-dev SKILL.md
    kept since the skill is dev-focused.
  • internal/model/rank.go / rank_test.go — left intact. Regression
    guards for users with qwen3:0.6b pulled, not recommendations.
  • x402-verifier's verifyOnly: true invariant — unchanged. Group A
    did not touch the payment verification path.
  • Tunnel HTTPRoute hostname restrictions — verified intact during
    review; Group A did not change.

Carry-overs / follow-up PRs

  1. Reconcile obol model prefer (Add provider smokes and model preference #379) with model.Rank when Add provider smokes and model preference #379
    rebases. Either Rank respects the user's explicit preference, or
    Prefer writes a separate preferred: field the agent reads first.
  2. go vet warnings on internal/enclave/enclave_darwin.go about
    unsafe.Pointer usage — pre-existing CGo Secure Enclave code,
    untouched here. Worth a separate cleanup pass.
  3. Network reclaim on obol stack down fallback path — the rare
    k3d cluster stop failure → fallback to k3d cluster delete could
    leak a network in dev mode. Currently relies on the next Purge to
    clean up. If this becomes a real problem, add a call there.
  4. Future obol network status reachability column — if we want one,
    parse eRPC's existing health metrics rather than redo the chainId
    probe in-process.

Final status

go build ./... && go test ./... clean after every group. Pre-existing
go vet complaints on enclave Darwin code only.

Files touched across all groups:

Group File Change
A internal/serviceoffercontroller/controller.go strip registration signing path
A cmd/obol/sell.go strip --private-key-file flag + helpers
A cmd/obol/sell_test.go drop tests for removed code
A flows/flow-11-dual-stack.sh call wallet import before sell http
A flows/flow-13-dual-stack-obol.sh comment refresh
A flows/flow-14-live-obol-base-sepolia.sh call wallet import before sell register
A .agents/skills/obol-stack-dev/SKILL.md rewrite ERC-8004 + setMetadata sections
B internal/network/probe.go deleted
B internal/network/probe_test.go deleted
B cmd/obol/network.go drop probe rendering + flags
C internal/stack/stack.go drop migration helpers
C internal/stack/stack_test.go drop migration test
D internal/stack/stack.go add reclaimLeakedDevK3dNetworks (purge)
D internal/stack/stack_test.go add TestHasLiveK3dCluster
E cmd/obol/model.go refresh recommendations
E internal/openclaw/openclaw.go no-models hint
E internal/embed/skills/{sell,monetize-guide}/SKILL.md examples
E docs/getting-started.md pull example
E docs/guides/monetize-inference.md every qwen3:0.6b updated
E CLAUDE.md example with two ollama-detected models
E flows/lib.sh model removal loop
E flows/flow-03-inference.sh comment cleanup
E internal/openclaw/monetize_integration_test.go comment
F docs/plans/obol-x402-path-comparison.md deleted
F docs/plans/ directory removed

Six commits planned, one per group A–F. Group G (this transcript) is
intentionally not committed.

@OisinKyne OisinKyne requested a review from bussyjd April 29, 2026 00:27
@bussyjd
Copy link
Copy Markdown
Collaborator

bussyjd commented Apr 29, 2026

Superseded by #386 — all commits from this branch are already present at the tip of integration/pr377-pr381 (verified by git log <branch> ^origin/integration/pr377-pr381 returning empty). Closing to keep the queue tidy. Track final landing on main via #386.

@bussyjd bussyjd closed this Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants