Delete a plan, push pr review notes to explain the preceeding PRs by OisinKyne · Pull Request #395 · ObolNetwork/obol-stack

OisinKyne · 2026-04-29T00:27:39Z

PR #386 review and surgery — 2026-04-29

This file captures the review process for integration/pr377-pr381 (PR #386,
"Integrate PR #377 OBOL Permit2 + PR #381 Hermes runtime, harden flows, add
flow-13 dual-stack OBOL"), what got cut, what stayed, and why. It is not
intended to live in the repo long-term; CLAUDE.md says plan/PR-review docs
shouldn't be committed. Keeping it under plans/ (gitignored) so it survives
the work session and can be pasted into the PR body or archived.

Branch state at start of review

Tip: 1804d43 fix(model): unify LiteLLM model_name contract, remove double-strip (#389)
(this commit landed during the review; was 64bbe22 when the review started).
79 commits ahead of main.
Composed of these stacked PRs:
- Integrate OBOL x402 payment rollout #377 (integration/pr339-375-376-obol-usdc) — OBOL Permit2 buy/sell support. Bottom of stack.
- Add Hermes default agent runtime #381 (feature/hermes-agent-runtime-refresh) — Hermes as default agent runtime.
- Fix flow-11 buyer wallet reuse #380 (fix/flow11-injected-buyer-wallet) — flow-11 buyer-wallet reuse fix.
- fix(erc8004): wait for read-side consistency before setMetadata #387 (fix/erc8004-setmetadata-stale-read) — WaitForAgent for read-side staleness. Already merged into 386.
- feat: flow-14 live Base Sepolia + network status reachability + flow-13 hardening #388 (feat/obol-x402-hardening-flow-14) — flow-14, model rank fix, network probe. Already merged into 386.
- fix(model): unify LiteLLM model_name contract, remove double-strip #389 (fix/model-name-strip-complexity) — round-trip model_name contract. Already merged into 386.
Related but not on the branch:
- Add provider smokes and model preference #379 (feature/agent-provider-smokes-model-prefer) — adds obol model prefer. Will need to rebase against this branch.

Oisin's bottom-of-PR concerns (the starting point)

From the GitHub PR thread:

SKILL.md "supplies a signing key to the controller" (lines 464–467) —
"is this really the best way to do this? would supplying a signature be
more sane?" + "is this key in any way privileged?"
probe.go (line 1) — "I think the commit that brings this in (and a
lot of the focus of the defensive programming of this pr) is for an AI
fuck up in a clean down of an automated test... 1) chainID is an
insufficient probe for a status of an upstream. 2) erpc does the status
management already, can't we parse it?"
setMetadata diagnosis — "wrong diagnosis, probably right fix? i can
imagine our erpc returning stale reverts unless we wait a block/request
more exactly a fresh one"
Flow private key in env — "i dont know why it needs the private key
outside in an env as an example, maybe i'll know when i look"

Plus three GitHub-Advanced-Security CodeQL hits (the int conversion at
probe.go:144 was the only legitimate one; vanished with the file).

Layered-on findings from my own audit (pre-edit)

The controller's loadRegistrationSigningKey() codepath (controller.go:1383
pre-edit) was fully wired — 9 call sites for SubmitRegister /
SetMetadata / SetAgentURI — but never triggered: the
ERC8004_PRIVATE_KEY / _FILE env vars are not set in the deployment
template (internal/embed/infrastructure/base/templates/x402.yaml), no
flow sets them, no test sets them, no GitHub Actions sets them. Pure dead
code unless someone manually kubectl set envs the controller.
The CLI's --private-key-file flag was real — used by flow-11-dual-stack.sh
and flow-14-live-obol-base-sepolia.sh for bootstrap. It was redundant
with obol wallet import (which exists for Hermes since the Hermes
default-agent landing).
model.Rank is not on a hot path. Called only from Hermes.SetupDefault,
Onboard, writeDeploymentFiles, Sync (deploy/sync time). Not per chat
request. Oisin's "called all the time?" worry was unfounded.
model.Rank vs the obol model prefer branch (Add provider smokes and model preference #379) — real conflict.
Prefer reorders the LiteLLM ConfigMap so the user's choice sits at index 0;
Rank reorders again on next sync. These two branches will collide and need
reconciling when Add provider smokes and model preference #379 rebases. Deferred to that PR.
Hermes 9119 vs 80 — 9119 is the dashboard container port; users hit it
through Traefik on port 80 (8080 on macOS due to the privileged port
caveat already in CLAUDE.md). Pre-existing UX, not introduced here.
9ef8cda harden existing stack refresh was a bundle of (a) frontend
v0.1.17-rc.3 → rc.5 bump (already shipped via chore: bump frontend image to v0.1.17-rc.5 #383), (b) hardcoded-resource
Helm-ownership migration, (c) kubectl-patch field-manager migration. (b)
and (c) target a one-off transition seen on dev laptops.
4323dd3 Preserve LiteLLM config across defaults refresh is a clean
refactor with real user value — the base helmfile template only contains
paid/*, so without preserve+restore every obol stack up would wipe
user-added cloud providers.
c08c873/34e62e5 Docker-network reclaim only fixes the leak inside flow
scripts. obol stack purge calls k3d cluster delete which leaks the
network when dev-mode pull-through registry-mirror containers are
attached, exhausting the predefined CIDR pool after ~16 cycles.
The qwen3:8b laptop default and the scattered qwen3:0.6b examples were
pre-dating the qwen3.5 / qwen3.6 generations. Cross-checked against
ollama.com/library at review time:
- qwen3.6 smallest is 27b (17GB); no small variants exist.
- qwen3.5 has 4b/9b/27b/35b/122b. Validated test baseline is
  qwen3.5:9b (6.6GB).
- qwen3:0.6b does still exist (523MB) but is too small to be a sane
  default in 2026.
- The 026d30d commit message claim "qwen3.6:27b-coding-mxfp8 ~13 GB" was
  wrong; actual is 31GB.

Decisions and dialog, group by group

Group A — Controller signing key + `--private-key-file` flag

The most architecturally consequential cut. Tied directly to memory record:
"Never extract private keys from remote-signer; use its REST API for all
signing."

Two questions surfaced before any edits:

Bullet 1: what does the flag do? and if its called only on the
bootstrapping of the flows test, do we need it?
Bullet 2: [the controller env path] — okay delete. (not even in the
flows nor actions nor anywhere?)
Bullet 3: [the env-var on the CLI flag] sounds like what we want
instead of the flag in bullet 1, right?

I presented two options for the flag:

A: delete the flag and rewrite flow-11/flow-14 to call obol wallet import once at bootstrap, then run obol sell http / obol sell register
with no flag. One signing surface.
B: keep the flag, fix the misleading Usage and SKILL.md text, keep
the escape hatch.

Oisin chose A explicitly:

"it might have been written before we got obol hermes wallet import. I
think remove it all and call import in the flows. (option a)"

What got cut

internal/serviceoffercontroller/controller.go:
- registrationKey *ecdsa.PrivateKey field
- registrationOwnerAddress string field
- loadRegistrationSigningKey() (read of ERC8004_PRIVATE_KEY/_FILE)
- syncRegistrationMetadata() and the metadata-sync block in
  reconcileRegistrationActive
- The c.registrationKey != nil self-register branch and on-chain
  tombstone branch — controller now only publishes the registration
  document and watches for the externally-driven register tx.
- crypto/ecdsa and crypto imports
cmd/obol/sell.go:
- --private-key-file flag from obol sell http and obol sell register
- readPrivateKeyMaterial() helper
- registerDirectWithKey() helper
- PrivateKeyFile / PrivateKeyInput struct fields
- The "fallback to private key file if no remote-signer" branch on both
  paths. Now: remote-signer required; clear error if missing.
cmd/obol/sell_test.go: tests for readPrivateKeyMaterial and the
--private-key-file flag presence.

What stayed

Remote-signer-only registration (registerSponsored,
registerDirectViaSigner) — already correct, untouched.
obol wallet import — already existed (cmd/obol/wallet.go:44,
internal/hermes/wallet_import.go), untouched.
MetadataSynced, RegistrationTxHash, RegistrationSearchFromBlock
fields on the CRD — kept for compat; no longer written by the controller.

Flow rewrites

flow-11-dual-stack.sh: insert alice wallet import --instance obol-agent --private-key-file <file> --force before alice sell http. Drop
--private-key-file from the sell http call. Bob already had this
pattern; Alice now matches.
flow-14-live-obol-base-sepolia.sh: same pattern before obol sell register. Drop --private-key-file from the call.
flow-13-dual-stack-obol.sh: comment refreshed to stop referencing the
deleted flag.
.agents/skills/obol-stack-dev/SKILL.md: ERC-8004 prerequisites section
rewritten to describe the remote-signer-only path. Setmetadata-revert
section credits fix(erc8004): wait for read-side consistency before setMetadata #387's WaitForAgent as the actual fix and demotes the
"stale Anvil fork" hypothesis from the original PR description to a
less-likely cause.

Group B — Drop `internal/network/probe.go` + `obol network status` chainId column

Oisin:

"for the probe.go and setMetadata stuff, it sounds like we should be
taking most of the slop out here and pointing out that the later fix
actually improved it, and were we to make an obol network status
command again, it should parse erpc."

Reasoning kept in commit:

eth_chainId is a poor liveness probe — can't catch lagging upstreams.
An Anvil fork of base-sepolia returns the same chain id (84532) as the
real chain — exactly the false-positive case the probe was built for.
eRPC has its own per-upstream health metrics; the right column would
parse those, not redo the work badly in-process.
The actual symptom that motivated the probe (ERC721NonexistentToken
on a freshly-minted agent ID) was eRPC read-side staleness, fixed by
fix(erc8004): wait for read-side consistency before setMetadata #387's WaitForAgent.

What got cut

internal/network/probe.go (deleted)
internal/network/probe_test.go (deleted)
cmd/obol/network.go: --no-probe and --probe-timeout flags,
renderUpstreamProbes(), uiPrinter interface, time import.

flow-02 and flow-05 only grep for eRPC|Pod|Upstream headers, which
remain — no flow regressions.

Group C — Strip Helm-ownership migration code

Oisin:

"if we can't see a normal way kubectl-patch could be writing the
llmconfig, and we're happy this was on the fly edits, maybe we strip
all the rest of this code and tell the dev to not code for hacks unless
we really need to add resilience for some reason."

kubectl-patch field manager getting on the ConfigMap was a historical
case (pre---field-manager=helm model.go). Current code uses
--field-manager=helm so the trigger is unreachable.
The hardcoded-namespace migration (migrateBaseHelmOwnership) had no
resilience case — pure one-off.
BUT: the preserve+restore+merge mechanism IS a real-resilience case.
internal/embed/infrastructure/base/templates/llm.yaml:60–77 shows
the base ConfigMap only contains paid/*. Without preserve+restore,
every obol stack up would wipe user-added cloud providers and custom
endpoints. Kept.

What got cut

migrateBaseHelmOwnership() + baseHelmResource + namespaceArgs()
(whole block)
needsLiteLLMConfigHelmMigration() and the conditional delete-and-restore
branch in preserveLiteLLMConfigForHelm
TestNeedsLiteLLMConfigHelmMigration

What stayed (justified resilience)

preserveLiteLLMConfigForHelm — simplified to always snapshot
restoreLiteLLMConfig, mergeLiteLLMConfig, configMapFieldOwnershipManifest
kubectl.ApplyServerSideForceConflicts (used by restore)
--field-manager=helm patches in internal/model/model.go (one-line
good hygiene; matches helm's manager so future patches don't conflict)
TestMergeLiteLLMConfigPreservesChartDefaultsAndPreviousModels
The frontend image bump (rc.3 → rc.5) — Oisin: "leave the frontend bump,
i'm not worried about history neatness"
The "Existing Dev Stack Refresh" section in
.agents/skills/obol-stack-dev/SKILL.md — Oisin: "the obol-stack-dev
skill is dev focused tbf so maybe okay"

Group D — Reclaim leaked dev k3d networks on `obol stack purge`

Oisin:

"yes lets do in purge, maybe down if up will gracefully recreate them,
idk if that makes sense. only in development mode."

obol stack down calls k3d cluster stop which preserves the network
for Up to resume; cleaning on Down would break the stop/resume workflow.
Only Purge is wired up. The rare graceful-stop-fails-fall-through-to-delete
path inside Down is left to the next Purge to clean.

Implementation: lifted cleanup_k3d_obol_networks from flows/lib.sh
into Go (reclaimLeakedDevK3dNetworks in internal/stack/stack.go).
Filters to k3d-obol-stack-* networks, skips any with a live *-serverlb
or *-server-N attachment, force-disconnects mirror containers, removes.
Logs "Reclaimed N leaked dev registry network(s)" only when a network
was actually freed.

Added TestHasLiveK3dCluster to pin the live-cluster heuristic.

Group E — Qwen recommendation cleanup

Oisin:

"verify, and if there's a small 3.6, that can be recommended, else,
3.5:4b (for the laptop case). Do a little sanity check we're leaving
behind no qwen3:0.6b's, which are far too old and small these days."

Web-checked ollama.com/library:

Family	Small variants	Notes
qwen3.6	none under 27b	27b = 17 GB
qwen3.5	0.8b/2b/4b/9b/27b/35b/122b	4b = 3.4 GB; 9b = 6.6 GB
qwen3	0.6b/1.7b/4b/8b/14b/30b/32b/235b	older generation

Decision tree:

Capable host (≥32GB): qwen3.6:27b 17 GB
Coding (capable host): qwen3.6:27b-coding-mxfp8 31 GB (was wrongly listed as ~13 GB in 026d30d's commit message)
Validated baseline: qwen3.5:9b 6.6 GB
Low-RAM laptop: qwen3.5:4b 3.4 GB (replaces every qwen3:0.6b user-facing reference)
Reasoning: deepseek-r1:8b 4.9 GB
Lightweight: gemma3:4b 3.3 GB

internal/model/rank.go and internal/model/rank_test.go deliberately
untouched — the qwen3:0.6b references there are regression guards for
users who happen to have it pulled, not recommendations.

Group F — Delete `docs/plans/obol-x402-path-comparison.md`

CLAUDE.md says plan/report docs don't belong in the repo. The 2026-04-11
benchmark write-up belongs in PR #386's body or a Linear thread, not
docs/plans/. Empty docs/plans/ directory removed too.

Group G — This file

Per Oisin: "Add to your task list that you'll need to write to a file a
transcript of this review process, including stuff like this dialog and
why its being removed."

Lives under plans/ (untracked) so it doesn't violate the no-plan-doc
rule.

Things deliberately not changed

Rank vs prefer interaction (with Add provider smokes and model preference #379) — Real conflict, but Add provider smokes and model preference #379 is
not on this branch and will need to rebase. Ownership of fixing it sits
with whoever rebases Add provider smokes and model preference #379. Note from Oisin: "okay, we'll fix this when
we tackle the obol model prefer branch, which isn't in this at all."
Hermes port 9119 — No action; this is an internal container port,
user lands on port 80 via Traefik. UX issue with :8080 on macOS
pre-exists, covered already in CLAUDE.md.
/etc/hosts URL print — Oisin: "deal with that later, iirc we print
the right url if we fail to bind."
The frontend image bump in 9ef8cda — left in. Oisin: "leave the
frontend bump, i'm not worried about history neatness."
Existing-Dev-Stack-Refresh section in obol-stack-dev SKILL.md —
kept since the skill is dev-focused.
internal/model/rank.go / rank_test.go — left intact. Regression
guards for users with qwen3:0.6b pulled, not recommendations.
x402-verifier's verifyOnly: true invariant — unchanged. Group A
did not touch the payment verification path.
Tunnel HTTPRoute hostname restrictions — verified intact during
review; Group A did not change.

Carry-overs / follow-up PRs

Reconcile obol model prefer (Add provider smokes and model preference #379) with model.Rank when Add provider smokes and model preference #379
rebases. Either Rank respects the user's explicit preference, or
Prefer writes a separate preferred: field the agent reads first.
go vet warnings on internal/enclave/enclave_darwin.go about
unsafe.Pointer usage — pre-existing CGo Secure Enclave code,
untouched here. Worth a separate cleanup pass.
Network reclaim on obol stack down fallback path — the rare
k3d cluster stop failure → fallback to k3d cluster delete could
leak a network in dev mode. Currently relies on the next Purge to
clean up. If this becomes a real problem, add a call there.
Future obol network status reachability column — if we want one,
parse eRPC's existing health metrics rather than redo the chainId
probe in-process.

Final status

go build ./... && go test ./... clean after every group. Pre-existing
go vet complaints on enclave Darwin code only.

Files touched across all groups:

Group	File	Change
A	`internal/serviceoffercontroller/controller.go`	strip registration signing path
A	`cmd/obol/sell.go`	strip `--private-key-file` flag + helpers
A	`cmd/obol/sell_test.go`	drop tests for removed code
A	`flows/flow-11-dual-stack.sh`	call `wallet import` before `sell http`
A	`flows/flow-13-dual-stack-obol.sh`	comment refresh
A	`flows/flow-14-live-obol-base-sepolia.sh`	call `wallet import` before `sell register`
A	`.agents/skills/obol-stack-dev/SKILL.md`	rewrite ERC-8004 + setMetadata sections
B	`internal/network/probe.go`	deleted
B	`internal/network/probe_test.go`	deleted
B	`cmd/obol/network.go`	drop probe rendering + flags
C	`internal/stack/stack.go`	drop migration helpers
C	`internal/stack/stack_test.go`	drop migration test
D	`internal/stack/stack.go`	add `reclaimLeakedDevK3dNetworks` (purge)
D	`internal/stack/stack_test.go`	add `TestHasLiveK3dCluster`
E	`cmd/obol/model.go`	refresh recommendations
E	`internal/openclaw/openclaw.go`	no-models hint
E	`internal/embed/skills/{sell,monetize-guide}/SKILL.md`	examples
E	`docs/getting-started.md`	pull example
E	`docs/guides/monetize-inference.md`	every `qwen3:0.6b` updated
E	`CLAUDE.md`	example with two ollama-detected models
E	`flows/lib.sh`	model removal loop
E	`flows/flow-03-inference.sh`	comment cleanup
E	`internal/openclaw/monetize_integration_test.go`	comment
F	`docs/plans/obol-x402-path-comparison.md`	deleted
F	`docs/plans/`	directory removed

Six commits planned, one per group A–F. Group G (this transcript) is
intentionally not committed.

bussyjd · 2026-04-29T09:09:14Z

Superseded by #386 — all commits from this branch are already present at the tip of integration/pr377-pr381 (verified by git log <branch> ^origin/integration/pr377-pr381 returning empty). Closing to keep the queue tidy. Track final landing on main via #386.

Delete a plan, push pr review notes

3ee8073

OisinKyne requested a review from bussyjd April 29, 2026 00:27

bussyjd closed this Apr 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delete a plan, push pr review notes to explain the preceeding PRs#395

Delete a plan, push pr review notes to explain the preceeding PRs#395
OisinKyne wants to merge 1 commit intooisin/377-5from
oisin/377-6

OisinKyne commented Apr 29, 2026

Uh oh!

bussyjd commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

OisinKyne commented Apr 29, 2026

PR #386 review and surgery — 2026-04-29

Branch state at start of review

Oisin's bottom-of-PR concerns (the starting point)

Layered-on findings from my own audit (pre-edit)

Decisions and dialog, group by group

Group A — Controller signing key + --private-key-file flag

What got cut

What stayed

Flow rewrites

Group B — Drop internal/network/probe.go + obol network status chainId column

What got cut

Group C — Strip Helm-ownership migration code

What got cut

What stayed (justified resilience)

Group D — Reclaim leaked dev k3d networks on obol stack purge

Group E — Qwen recommendation cleanup

Group F — Delete docs/plans/obol-x402-path-comparison.md

Group G — This file

Things deliberately not changed

Carry-overs / follow-up PRs

Final status

Uh oh!

bussyjd commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Group A — Controller signing key + `--private-key-file` flag

Group B — Drop `internal/network/probe.go` + `obol network status` chainId column

Group D — Reclaim leaked dev k3d networks on `obol stack purge`

Group F — Delete `docs/plans/obol-x402-path-comparison.md`