Skip to content

Integrate PR #377 (OBOL Permit2) + PR #381 (Hermes runtime), harden flows, add flow-13 dual-stack OBOL#386

Merged
bussyjd merged 84 commits intomainfrom
integration/pr377-pr381
Apr 29, 2026
Merged

Integrate PR #377 (OBOL Permit2) + PR #381 (Hermes runtime), harden flows, add flow-13 dual-stack OBOL#386
bussyjd merged 84 commits intomainfrom
integration/pr377-pr381

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented Apr 27, 2026

Summary

This PR integrates #377 (OBOL Permit2) and #381 (Hermes default agent runtime) into a single testable branch and lands the hardening needed to keep both flows green together against real infrastructure (live Base Sepolia, real Cloudflare tunnel, two independent obol stacks on the same host).

Two end-to-end test flows now pass cleanly:

Flow Asset Chain Result
flow-11-dual-stack.sh USDC live Base Sepolia (84532) 48/45 PASS, 0 FAIL
flow-13-dual-stack-obol.sh (new) OBOL ERC20Permit Anvil fork of 84532 56/53 PASS, 0 FAIL

PASS-count > step-count because some steps assert multiple invariants (e.g. step 21 archives 2 receipts + 1 Agent ID).


Architecture under test

flow-11 — live Base Sepolia, USDC, ERC-8004

                    ┌─────────────────────────────────────────────┐
                    │   Base Sepolia (live, public)               │
                    │     ERC-8004 IdentityRegistry: 0x8004A818…  │
                    │     USDC:                       0x036CbD53… │
                    │     Facilitator: x402.gcp.obol.tech         │
                    └────────────▲─────────────────────▲──────────┘
                                 │                     │
                  ┌──── register/metadata tx ────┐    │
                  │                              │    │
                  │              ┌── Cloudflare ─┘    │
                  │              │  (real public      │
                  │              │  *.trycloudflare.com)
                  │              │                    │ USDC
                  │              ▼                    │ Transfer
                  │     ┌─────────────────────┐      │ (settlement)
                  │     │  cloudflared (k3d)  │      │
                  │     └────────┬────────────┘      │
                  │              │                    │
       ┌──────────┴──────────┐   │   ┌────────────────┴────────────┐
       │  Alice k3d cluster  │   │   │  Bob k3d cluster            │
       │  ─────────────────  │   │   │  ───────────────            │
       │  Traefik gateway    │   │   │  Hermes agent (api server)  │
       │  ServiceOffer CR    │◄──┘   │  buy.py (x402)              │
       │  LiteLLM + Ollama   │       │  x402-buyer sidecar         │
       │  remote-signer      │       │  remote-signer (signs auths)│
       └─────────────────────┘       └─────────────────────────────┘
                  testbed-A                       testbed-A
                          ┌──────── 1 host ───────┐
                          │ aarch64 Linux         │
                          │ docker / k3d / cast   │
                          └───────────────────────┘

flow-13 — dual-stack Anvil fork, OBOL Permit2

                  ┌──────────────────────────────────────────┐
                  │   Host shell (testbed-B)                  │
                  │                                           │
                  │   anvil --fork-url base-sepolia           │
                  │     --port $ANVIL_PORT --host 0.0.0.0     │
                  │     (chain 84532, EVM cheats)             │
                  │                                           │
                  │   x402-rs facilitator (ObolNetwork fork)  │
                  │     v2-eip155-exact +                     │
                  │     config.eip2612_gas_sponsoring=true    │
                  │                                           │
                  │   ForkObolToken (ERC20Permit/Permit2)     │
                  │     deployed via `forge create` to fork   │
                  │     mint(addr, 10*1e18) Alice + Bob       │
                  └─────────▲────────────────▲────────────────┘
                            │                │
                            │  host.k3d.internal:$PORT
                            │   (docker bridge)
                            │                │
                  ┌─────────┴────┐  ┌────────┴──────────────────┐
                  │  Alice k3d   │  │  Bob k3d                  │
                  │  ──────────  │  │  ───────                  │
                  │  cloudflared │◄─┤  Hermes agent             │
                  │  trycloudflare│ │  buy.py (Permit2 signer)  │
                  │  Traefik +   │  │  x402-buyer sidecar       │
                  │  ServiceOffer│  │   replays signed auths    │
                  │  (transferMethod=permit2,                   │
                  │   eip712Name=Obol Network,                  │
                  │   eip712Version=1)                          │
                  └──────────────┘  └───────────────────────────┘
                       │                    │
                       │   ┌────────────────┘
                       │   │  cross-cluster discovery via REAL
                       └───┤  Cloudflare tunnel (faithful to prod)
                           ▼
                       ServiceOffer URL: https://*.trycloudflare.com

The two clusters share one Anvil fork through host.k3d.internal:$ANVIL_PORT, so settlements are observable from both sides via the SAME ERC-20 contract address. The cross-cluster discovery still goes over a real Cloudflare tunnel — we do not bypass cloudflared.


On-chain receipts

flow-11 — Base Sepolia (live)

Latest green run (commit 234b05f, agent ID 5250), all four artifacts archived under .tmp/flow-11-<ts>/*-receipt.json:

Receipt TX hash Basescan
ERC-8004 registration 0x78087f828d42c14d8fbf1f0bcfc6589350109e05d27bcb1840a990bd8d78dd7a link
Metadata SetMetadata 0xae255c4ad4ce3919645e5c0cbacbd5fcb16e2995ef6c09fa36101cde06846e1b link
Funding (USDC Alice→Bob signer, 0.05 USDC) 0xccb35434c6a78450b70632ee5ea72795bbf31d167c9052aa428a743f8e3d39d1 link
Settlement (USDC Bob signer→Alice, 0.001 USDC) 0xe9a406f57fea24f795801880eb58e4ceea5040203accd5579f154bb36c4da55c link

flow-13 — Anvil fork (local, ephemeral)

Latest green run (commit d3ce623):

Receipt TX hash
ForkObolToken deploy (logged at runtime; address 0x31e3fc9D98Cf9A62980755C741ae069F7150De07 is deterministic from deployer nonce)
Bob signer OBOL mint (10 OBOL) 0x2b7fe02cc509341fa36ce819b5a17b69fe37e3027368a37a68b4f530a23bc3b0
OBOL settlement (Bob signer → Alice, 0.001 OBOL = 1e15 wei) 0xc1c81ed61ced7d8343bc0eb2d6f7114bc4e2375da9eccdf1d5eec91caa04728e

Anvil hashes are not on the live chain by design (this flow is meant to exercise OBOL Permit2 without depending on a public bridged OBOL deployment).


What changed

Code (Go)

  • internal/tunnel/tunnel.goEnsureRunning refactored to WaitReady(cfg, *ui.UI) (string, error) with a 5-min budget (override via FLOW_TUNNEL_TIMEOUT). Polls deploy/cloudflared rollout AND obol tunnel status URL in the same loop; returns a clear error naming both subjects on timeout.
  • cmd/obol/sell.go — when obol sell http registers, calls tunnel.EnsureTunnelForSell BEFORE the kubectlApplyOutput so the controller's first reconcile sees a populated tunnelURL ConfigMap. Tunnel failure is fatal on the registered path; --no-register keeps best-effort behavior.
  • internal/hermes/hermes.go — injects GATEWAY_ALLOW_ALL_USERS=true on the hermes-dashboard container only, with an inline justification (local k3d clusters do not expose messaging integrations to the public internet; production must override via values overlay). Without this, the dashboard CrashLoopBackOff blocks pod readiness.
  • internal/openclaw/monetize_integration_test.go — emits FLOW12_SETTLEMENT_TX=<hash> markers so flow-12's shell harness can build a receipt-summary.json analogous to flow-11.

Flows

  • flows/flow-13-dual-stack-obol.sh (new) — 1262 lines. Two obol stacks, one shared Anvil fork via host.k3d.internal, x402-rs facilitator with eip2612_gas_sponsoring=true, ForkObolToken deployed by forge create, real Cloudflare tunnel for cross-cluster discovery. Trap-based cleanup tears down anvil + facilitator + port-forwards on any exit.
  • flows/run-detached.sh (new) — launcher that survives SSH disconnect: tries tmuxscreensetsid -f in that order. Prints the log path; tail it from a fresh SSH session.
  • flows/lib.sh — adds 4 helpers reused by flow-08/11/12/13:
    • explicit PATH export for non-login shells (so cast/kubectl/k3d resolve under nohup/cron)
    • detect_buyer_runtime <runner> — sets BOB_AGENT_{NS,DEPLOY,CONTAINER,SERVICE,REMOTE_PORT,LABEL,RUNTIME} based on which agent namespace exists (Hermes or OpenClaw)
    • promoted USDC-receipt helpers (find_usdc_transfer, archive_receipt, wait_usdc_transfer_receipt, receipt_status_ok, extract_tx_hash) — generic ERC-20, the USDC name is historical
    • ensure_image_in_k3d <image> <cluster>docker save + ctr -n k8s.io images import fallback for hosts where the registry mirror stalls (we hit this on aarch64 with cloudflared:2026.1.2)
  • flows/flow-11-dual-stack.sh — 4 hardening fixes (anchored .env grep, runtime-aware buyer vars, drop natural-language assertions on agent responses, numeric-only Agent ID extraction) + step-28 changed to poll the API-server container's ready=true instead of pod-summary STATUS.
  • flows/flow-08-buy.sh — captures BUY_START_BLOCK, emits PAID_AMOUNT_USDC from the signing Python, archives the on-chain settlement receipt via wait_usdc_transfer_receipt. Balance delta kept as defense in depth.
  • flows/flow-12-obol-payment.sh — pipes go test output through tee, parses FLOW12_* markers, writes receipt-summary.json mirroring flow-11.
  • flows/README.md — flow inventory + "Running a flow detached over SSH" section pointing at run-detached.sh.

Skill

  • .agents/skills/obol-stack-dev/SKILL.md — adds a 120-line "Running Flows on Remote Hosts" section distilling everything that bit us this session: tmux launcher, distroless probe pattern, multi-container readiness check, anvil --host 0.0.0.0, eRPC route pinning, x402-rs scheme config (no permit2 scheme), cloudflared lazy deploy, obol sell http --namespace overload, ERC-8004 registration prereqs, and the setMetadata revert recipe (see below).

Issues found and fixed

Each issue surfaced through real test runs on two aarch64 Linux test hosts (testbed-A and testbed-B). Both run docker + k3d 5.8 + foundry. Fixes are in this PR unless flagged otherwise.

# Issue Fix
1 flow-11 line 526 .env grep matched comment lines containing REMOTE_SIGNER_PRIVATE_KEY → multi-line value to cast wallet address → parse error anchored grep ^[[:space:]]*REMOTE_SIGNER_PRIVATE_KEY= + cut -d= -f2-
2 nohup/setsid bash didn't inherit ~/.foundry/bin/~/.local/bin; cast/kubectl not found flows/lib.sh exports canonical PATH at source time
3 flow-11 hardcoded openclaw-obol-agent namespace + app.kubernetes.io/name=openclaw label; broke on Hermes runtime (#381) detect_buyer_runtime helper + step 28 poll specific container's ready=true
4 natural-language regex on agent buy response (purchase complete|...) didn't match Hermes wording → false FAIL dropped NL assertion; rely on PurchaseRequest CR Ready=True poll as structural proof
5 flow-11 step 21 awk '{print $3}' captured (not from "Agent ID: (not yet registered)" → Python int() crash numeric-only awk with ^[0-9]+$ validation
6 cloudflared 60s rollout in obol sell http silently tolerated → ServiceOffer registers with empty tunnel URL → controller stuck in AwaitingExternalRegistration forever tunnel.WaitReady 5-min budget; called BEFORE kubectlApplyOutput on the registered path; failure is fatal
7 flow-08 didn't archive on-chain settlement receipts (only balance delta) start-block snapshot + wait_usdc_transfer_receipt archival
8 flow-12 (PR #377 OBOL Permit2) had no shell-level receipt summary Go test emits FLOW12_* markers; shell parses + writes receipt-summary.json
9 Hermes hermes-dashboard container CrashLoopBackOff in dev clusters: "No user allowlists configured" → exit 1 → pod Ready=False → port-forward fails inject GATEWAY_ALLOW_ALL_USERS=true on the dashboard container only, with inline justification
10 aarch64-only: cloudflared:2026.1.2 image pull through k3d registry mirror stalls indefinitely (manifest HEAD returns 200 but no blob GETs follow) ensure_image_in_k3d helper in flows/lib.sh: docker save + ctr import as a fallback
11 nohup/setsid -f over Cloudflare-tunneled SSH was observed to die mid-flow at the first long-running CLI call flows/run-detached.sh launcher: tmux → screen → setsid; documented as the canonical entrypoint

Two additional brittlenesses we noticed but did not block on:

  • flow-13 step 28 timing race on Hermes init (~120-180s on aarch64). The patched poll-budget at step 28 (180s now) covers it. Could be tightened by waiting on kubectl rollout status for the Hermes deployment instead of polling.
  • cast send text-output format drifted between foundry releases so regex-based tx-hash extraction was unreliable for the OBOL mint step. Switched to cast send --json + jq-style Python parser.

setMetadata revert investigation (live Base Sepolia)

A colleague hit ! failed to set x402 metadata: erc8004: setMetadata tx: execution reverted on agent ID 5196, wallet 0x2FbFe6cF…. Their on-chain analysis:

  • eth_getTransactionCount for the wallet went 0→1 after the run → only the register tx broadcast.
  • Remote-signer logs are clean → never asked to sign setMetadata.
  • Therefore the revert happened in pre-broadcast simulation (eth_estimateGas/eth_call), not on-chain.

We reproduced the revert and decoded it:

$ cast call 0x8004A818BFB912233c491871b3d84c89A494BD9e \
    "setMetadata(uint256,string,bytes)" 999999999 "x402" 0x01 \
    --from 0x2FbFe6cF08Ac224f97915ecF07eE29Be0b213f51 \
    --rpc-url https://sepolia.base.org

Error: server returned an error response: error code 3:
  execution reverted, data: "0x7e273289000000000000000000000000000000000000000000000000000000003b9ac9ff"

$ cast 4byte 0x7e273289
ERC721NonexistentToken(uint256)

So the simulation hit state where token 5196 didn't yet exist, even though it WAS minted on-chain (verified: ownerOf(5196) → 0x2FbFe6cF… on live Base Sepolia today).

Most plausible cause: a stale eRPC base-sepolia route pinned to a parallel/dead Anvil fork from an earlier flow-12/13 run that didn't unwind. The CLI's obol sell register path:

  1. broadcasts register through the chain's WRITE upstream (lands on live Base Sepolia)
  2. simulates setMetadata via the chain's READ upstream

If READ is pinned to a fork (live or dead), the fork has its own ERC-721 storage where 5196 was never minted → ERC721NonexistentToken revert. The colleague's question — "could a concurrent Anvil fork cause this?" — is yes, transitively, via a leftover obol network add base-sepolia --endpoint http://...:anvil-port --allow-writes. (A killed Anvil would surface as RPC-connect error, not a revert.)

Proposed fix (separate PR, not in this one):

  1. After Register, the CLI should bind.WaitMined the receipt and re-pin the next call's block to latest so the simulation gets a fresh fetch.
  2. obol network status should warn when a custom upstream is unreachable (would have surfaced the leftover fork pin).
  3. For the test environments specifically, document obol network remove base-sepolia as a teardown step in flow-12/13 cleanup traps. (flow-13's trap already kills its anvil + facilitator; the eRPC-side route pin lingers in the cluster.)

This recipe is now documented in obol-stack-dev/SKILL.md so future contributors can short-circuit diagnosis with a single cast call --from <signer> reproduction.


How to reproduce

On any aarch64 Linux host with docker, foundry (cast/anvil/forge), kubectl, helm, helmfile, k3d, ollama, and a .env with REMOTE_SIGNER_PRIVATE_KEY (funded on Base Sepolia: ~0.05 ETH for gas, ~5 USDC for buyer):

git fetch origin
git checkout integration/pr377-pr381

# Pull the qwen3.5:9b model that flow-11 / flow-13 expect
ollama pull qwen3.5:9b

# flow-11 (live Base Sepolia, USDC) — ~12-15 min
bash flows/run-detached.sh flow-11-dual-stack.sh
# → tail the printed log path

# flow-13 (Anvil fork, OBOL Permit2) — ~12-18 min
# Requires an x402-rs build with eip2612_gas_sponsoring (ObolNetwork fork)
export X402_FACILITATOR_BIN=/path/to/x402-rs/target/release/x402-facilitator
bash flows/run-detached.sh flow-13-dual-stack-obol.sh

Receipt artifacts land under .tmp/flow-{11,13}-<timestamp>/*.json with a receipt-summary.json index.


Out of scope / follow-ups

  • setMetadata bind.Transact block-pinning — separate PR with the WaitMined + re-resolve fix described above.
  • cast send text-format brittleness in flows/lib.sh::extract_tx_hash — flow-13 now uses --json; flow-11 still uses the regex path on cast send output and could be migrated for symmetry.
  • CI run on this branch — none yet; both flows have only been validated manually on two aarch64 testbeds.
  • flow-13 registration variant — flow-13 deliberately disables ERC-8004 (the focus is OBOL Permit2). A follow-up could add flow-13-registered-obol.sh that includes registration once obol sell http exposes the OBOL Permit2 asset metadata flags.

Test plan

  • go build ./... clean
  • go test ./cmd/obol/... ./internal/tunnel/... ./internal/serviceoffercontroller/... ./internal/stack/... ./internal/hermes/... clean
  • bash -n on all flow scripts
  • flow-11 against live Base Sepolia: 48/45 PASS, 0 FAIL, all 4 receipts archived
  • flow-13 against Anvil fork (real cloudflared tunnel): 56/53 PASS, 0 FAIL, funding + settlement receipts archived
  • both spark testbeds torn down (no leftover k3d clusters or anvil/x402-rs processes)
  • reviewer to confirm against their own host (with their own funded .env)

bussyjd and others added 30 commits April 24, 2026 21:28
…ermit2)

- .dockerignore: keep broader .workspace* glob from #381
- internal/tunnel/tunnel_test.go: keep both new tests (parseQuickTunnelURL_PicksLatest from #377, BuildLocalManagedConfigYAMLRoutesOnlyRequestedHostname from #381)
- internal/stack/stack_test.go: take #381 port-handling test additions
…board env

Eleven brittleness issues were observed during real Base Sepolia test runs of
flow-11 against PR #377 alone, PR #381 alone, and the integration branch on
spark1 + spark2. This commit batches the fixes.

flow-11-dual-stack.sh
- env grep anchored to assignment lines so a comment containing
  REMOTE_SIGNER_PRIVATE_KEY no longer leaks into cast wallet address
- buyer-runtime detection (openclaw vs hermes) via detect_buyer_runtime,
  called after Bob's stack up; pod-readiness, exec, port-forward, and
  token retrieval all use BOB_AGENT_NS/DEPLOY/CONTAINER/SERVICE/RUNTIME/LABEL
- buy-success no longer relies on natural-language regex; the structural
  proof is the next PurchaseRequest CR Ready=True poll
- Agent ID extraction is numeric-only with explicit validation, so a
  pending registration ("Agent ID: (not yet registered)") fails cleanly
  instead of crashing the script via Python int()

flows/lib.sh
- explicit PATH export for ~/.foundry/bin and ~/.local/bin so nohup /
  setsid / cron launches resolve cast / kubectl / k3d
- detect_buyer_runtime helper (default Hermes, OpenClaw if present)
- promoted USDC-receipt helpers (write_receipt, receipt_status_ok,
  archive_receipt, extract_tx_hash, find_usdc_transfer,
  wait_usdc_transfer_receipt) so flow-08 and flow-12 can reuse them
- ensure_image_in_k3d helper for hosts where the k3d registry mirror
  stalls (aarch64 spark workaround for cloudflared:2026.1.2)

flow-08-buy.sh
- captures BUY_START_BLOCK, emits PAID_AMOUNT_USDC from the signing
  Python, then archives the on-chain settlement receipt via
  wait_usdc_transfer_receipt; balance delta check kept as defense in depth

flow-12-obol-payment.sh + monetize_integration_test.go
- Go test emits FLOW12_SETTLEMENT_TX marker; shell pipes test output
  through tee and writes receipt-summary.json with the same JSON shape
  as flow-11 (registration / funding markers omitted because the OBOL
  Permit2 path doesn't produce them on Anvil)

cmd/obol/sell.go + internal/tunnel/tunnel.go
- WaitReady(cfg, ui) refactored from EnsureRunning, default 5min budget
  (override via FLOW_TUNNEL_TIMEOUT). EnsureTunnelForSell is called
  before kubectlApplyOutput on the registration path so the controller's
  first reconcile sees a populated tunnelURL ConfigMap, fixing the
  AwaitingExternalRegistration race observed on spark1
- on registration path, tunnel failure is fatal with a hint to use
  --no-register; --no-register path keeps best-effort tunnel

internal/hermes/hermes.go
- GATEWAY_ALLOW_ALL_USERS=true on the hermes-dashboard container only,
  with an inline comment explaining that local k3d/dev clusters do not
  expose dashboard messaging integrations to the public internet, and
  that production must override via a values overlay. Unblocks the
  dashboard from CrashLoopBackOff so the pod reaches Ready=True and
  port-forward to the API server works

flows/run-detached.sh + flows/README.md
- new launcher script that survives SSH disconnect (tmux -> screen ->
  setsid -f); README documents the flow inventory and the new launcher
In multi-container pods like Hermes (API server + dashboard) the upstream
hermes-dashboard container can stay in CrashLoopBackOff for unrelated
reasons (missing fastapi/uvicorn in the image's web-UI optional deps),
which makes the pod-summary STATUS column read "CrashLoopBackOff" even
when the API-server container we actually need is happily Running.

Switch step 28 from `grep "Running"` on the STATUS column to a jsonpath
query for the specific container's `ready=true`, and bump the budget
from 24x5s = 120s to 36x5s = 180s to absorb slow init on aarch64 hosts.

Result: integration flow-11 now goes 45/45 with 0 sub-step FAILs.
The discovery step issued a chat-completion to Bob's agent and required
the assistant content to be >100 chars. Hermes occasionally responds
with a short interim "let me check..." message (~93 chars) before
proceeding to the next tool call, causing a false FAIL even though the
agent went on to discover Alice and complete the buy in step 35-36.

Same fix as step 35: drop the natural-language assertion. The structural
proof of discovery is the next step's `buy.py` invocation succeeding and
the PurchaseRequest CR going Ready=True (step 36).
… fork

Mirrors flow-11's two-stack structure but the payment asset is a
fork-local OBOL ERC20Permit token instead of USDC, both Alice's and
Bob's obol stacks share ONE local Anvil fork (via the host.k3d.internal
alias), and the facilitator is a local x402-rs build with
eip2612GasSponsoring (not the public Obol facilitator).

- Anvil port + facilitator port allocated via pick_free_port
- ForkObolToken deployed on the fork via `forge create` against
  contracts/fork-obol/src/ForkObolToken.sol; mints 10 OBOL to Alice
  and 10 OBOL to Bob's signer
- Single trap-based cleanup tears down anvil, facilitator, and any
  port-forwards on any exit
- Skip-if-missing: emits one PASS and exits 0 when neither
  X402_FACILITATOR_BIN nor X402_RS_DIR resolve to a usable build
- Reuses the receipt helpers from lib.sh by setting
  USDC_ADDRESS_BASE_SEPOLIA=$OBOL_TOKEN at call sites; the helpers
  are generic ERC-20 despite the USDC-flavored name
- Bob's `obol network add base-sepolia` points at the same Anvil URL
  Alice uses, with eRPC pinned to the single custom upstream so both
  clusters see the same on-chain state for OBOL balance/Transfer logs
x402-rs has no standalone "v2-eip155-permit2" scheme. The OBOL Permit2 /
EIP-2612 gas sponsoring path is enabled via
config.eip2612_gas_sponsoring=true on the v2-eip155-exact scheme — same
as testutil.StartRealFacilitatorWithOptions does. The previous flow-13
config requested a phantom permit2 scheme, which the facilitator
silently ignored, leaving /supported with v1+v2 exact only and failing
the assertion that looked for a literal "permit2" scheme name.

- drop the bogus v2-eip155-permit2 scheme entry
- attach config.eip2612_gas_sponsoring=true to the v2-eip155-exact entry
- assert /supported lists v1+v2 exact for base-sepolia (the buyer-side
  produces the Permit2 payload; the facilitator's role on this path is
  to verify+settle the sponsored authorization)
@OisinKyne
Copy link
Copy Markdown
Contributor

that includes registration once obol sell http exposes the OBOL Permit2 asset metadata flags.
worth keeping in mind particularly across network

If READ is pinned to a fork (live or dead), the fork has its own ERC-721 storage where 5196 was never minted → ERC721NonexistentToken revert. The colleague's question — "could a concurrent Anvil fork cause this?" — is yes,
setMetadata bind.Transact block-pinning — separate PR with the WaitMined + re-resolve fix described above.
wrong diagnosis, probably right fix? i can imagine our erpc returning stale reverts unless we wait a block/request more exactly a fresh one

going to skim code for anything objectionable (the pr description makes me feel the flow scripts take some shortcuts to make things happen (e.g. i dont know why it needs the private key outside in an env as an example, maybe i'll know when i look). if it looks adequate, we can merge and then you tell me what else is left of open prs that need to go in

Comment thread .agents/skills/obol-stack-dev/SKILL.md Outdated
Comment thread .agents/skills/obol-stack-dev/SKILL.md Outdated
Comment thread .agents/skills/obol-stack-dev/SKILL.md
)

`obol model setup custom`, the LiteLLM `model_name` convention, and the
agent-side stripProviderPrefix helpers were tangled in a way that quietly
broke flow-14 with a 400 "no healthy deployments for this model" on every
chat-completion against a custom vLLM endpoint:

  1. AddCustomEndpoint wrote `model_name: custom/<name>/<model>`.
  2. hermes.configuredModels saw it, called rankModels which pre-stripped
     to `<name>/<model>` before delegating to model.Rank.
  3. model.Rank also strips internally for ranking heuristics — but
     returns the original string. With the pre-strip from (2) the
     "original" was already mutilated.
  4. configuredModels then ran stripProviderPrefix on the primary AGAIN
     before returning, leaving the agent calling LiteLLM with bare
     `<model>` while only `custom/<name>/<model>` was registered.

The band-aid in ca820c9 dropped the `custom/<name>/` prefix on writes,
which unblocked the flow but left the underlying double-strip surface
intact. This change picks the contract explicitly:

  LiteLLM `model_name` is the bare model identifier — the agent reads
  it straight back as the `model` field on chat-completion calls and
  must round-trip unchanged. Same convention every other code path
  already uses (Ollama, Anthropic, OpenAI explicit entries).

Implementation:
  - internal/model/model.go: extract buildCustomEndpointEntry, document
    the contract on AddCustomEndpoint, drop the leftover `_ = name`
    bookkeeping.
  - internal/model/rank.go: keep the unexported stripProviderPrefix for
    ranking heuristics, add a doc comment explicitly forbidding its use
    on round-trippable identifiers.
  - internal/hermes/hermes.go: delete stripProviderPrefix /
    stripProviderPrefixes; rankModels now passes through to model.Rank
    without pre-stripping; configuredModels returns the LiteLLM model
    list unchanged. The agent's `model.default` is now byte-identical
    to the LiteLLM ConfigMap entry.
  - cmd/obol/model.go: clarify --name flag help to "informational only"
    — it still surfaces in `obol model status` but does not participate
    in the route key.

Tests:
  - internal/model/rank_test.go: TestRank_PreservesProviderPrefixOnOutput
    pins the round-trip property at the Rank() boundary, including the
    legacy `custom/<name>/<model>` shape.
  - internal/model/model_test.go: TestBuildCustomEndpointEntry covers
    the bare-model_name + openai/-routing shape, the empty-key fallback,
    and that colon-tagged ids survive intact.
  - internal/hermes/rankmodels_test.go: rewritten to assert the contract
    (was asserting the now-removed strip). Adds the
    `custom/<name>/<model>` regression guard.
  - internal/hermes/hermes_test.go: TestGenerateConfig_PrimaryIsRoundTrippable
    covers the end-to-end shape — whatever LiteLLM publishes is what the
    agent sends back.

Refs ca820c9 (band-aid).

Co-authored-by: bussyjd <bussyjd@users.noreply.github.com>
Comment thread internal/network/probe.go Outdated
Copy link
Copy Markdown
Contributor

@OisinKyne OisinKyne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an immediate yes, expecting my 6 prs merged then this one merged to main, pre-emptively giving the ✅ so you don't get blocked

@bussyjd
Copy link
Copy Markdown
Collaborator Author

bussyjd commented Apr 29, 2026

Flow-validation update — flow-14 GREEN on PR #397 stack tip

I ran flow-14 against 3954852 (5 commits on top of this branch via PR #397) on spark2 + spark1 vLLM qwen36-fast. End-to-end OBOL Permit2 settlement confirmed on live Base Sepolia.

60/0 PASS/FAIL across 55 steps. Settlement tx: 0x7baead9ad4296b1ab5e0bda7a7b726b4203417074e4d91051d91942453d14b44 — Bob signer → Alice, exactly 1e15 wei OBOL. Full receipt JSON + transaction list in #397's recap comment.

This run surfaced (and PR #397 fixes) two latent bugs that pre-date Oisin's surgery on this branch but were exposed by the Group A --private-key-file removal:

  1. archiveReplacedHermesKeystore lacked the ensureVolumeWritable / fixRuntimeVolumeOwnership bookend → host-process EACCES traversing the container-owned keystores PVC.
  2. ApplyCluster flag was plumbed from cmd/obol/wallet.go:47 but never read inside ImportPrivateKeyWalletCmd → no helmfile-sync, no rollout-restart → pod kept the chart bootstrap key.

Plus a chart drift fix: Hermes was pinned to remote-signer 0.3.0 (image v0.1.0, legacy u64 contract) while OpenClaw was on 0.3.1 (image v0.2.0, canonical-string contract from #359). This was breaking every on-chain register call on this branch — obol sell register returned HTTP 422 chain_id: invalid type: string "84532", expected u64. Bumped to 0.3.1 and added TestRemoteSignerChartVersionConsistency to keep the two pins aligned via CI.

Per project rule (no main merge without external reviewer + flows green), the receipts are now in place; this PR + #397 are ready for human review whenever you're ready, Oisin.

@bussyjd
Copy link
Copy Markdown
Collaborator Author

bussyjd commented Apr 29, 2026

Full investigation + fixes report — 2026-04-29

This comment dumps the complete trail of what was done on this branch since Oisin's review surgery merged, the bugs flow-14 surfaced, the fixes that landed in PR #397, and the on-chain receipts that confirm the end-to-end OBOL Permit2 path is now green on Base Sepolia.

TL;DR

  • Oisin's 6-commit review surgery (oisin/377-1oisin/377-6) was fast-forwarded onto this branch, tip moved 1804d43 → 3ee8073. Build + tests green at every checkpoint.
  • Running flow-14 against the surgery tip surfaced 3 latent bugs in the Hermes wallet path that Oisin's Group A simplification (removing --private-key-file) exposed for the first time.
  • Plus 1 cross-cutting blocker unrelated to the surgery: a chart pin drift between Hermes (0.3.0, image v0.1.0) and OpenClaw (0.3.1, image v0.2.0), which broke every on-chain obol sell register call after PR fix(erc8004): align with canonical remote-signer string contract #359 switched obol-stack to the canonical-string signer contract.
  • All 4 are fixed in PR fix(hermes): bookend wallet-import archival with k3d ownership flip #397 (5 commits, on top of this branch). flow-14 is now 60/0 PASS/FAIL, settlement landed on Base Sepolia.

Architecture under test (flow-14)

                   ┌─────────────────────────────────────────────┐
                   │   Base Sepolia (live)                       │
                   │     ERC-8004 IdentityRegistry: 0x8004A818…  │
                   │     ForkObolToken:             0x54AE82bc…  │
                   │     Facilitator: x402.gcp.obol.tech         │
                   └────────────▲─────────────────────▲──────────┘
                                │                     │
                  ┌──── register/setMetadata ────┐    │ OBOL Permit2 settle
                  │                              │    │
                  │              ┌── Cloudflare ─┘    │
                  │              │  trycloudflare.com │
                  │              ▼                    │
                  │     ┌─────────────────────┐      │
                  │     │  cloudflared (k3d)  │      │
                  │     └────────┬────────────┘      │
       ┌──────────┴──────────┐   │   ┌────────────────┴────────────┐
       │  Alice k3d cluster  │   │   │  Bob k3d cluster            │
       │  Hermes seller      │◄──┘   │  Hermes buyer               │
       │  remote-signer      │       │  remote-signer              │
       │  ServiceOffer       │       │  buy.py + x402-buyer        │
       │  LiteLLM            │       │  paid/qwen36-fast           │
       └─────────────────────┘       └─────────────────────────────┘
                          │
              spark2 (testbed-B, aarch64 Linux)
              GPU LLM endpoint: spark1 vLLM qwen36-fast (192.168.18.23:8000)

Oisin's surgery — what landed

The 6 commits fast-forwarded cleanly (1804d43 → 3ee8073). Files net +305 / −1044, 16 files. Reverse-order summary:

# SHA Subject
6 3ee8073 Delete docs/plans/obol-x402-path-comparison.md (CLAUDE.md violation)
5 44e6950 Fix qwen3 / 0.6b user-facing recommendations (10 files)
4 af07f9e feat(stack): reclaim leaked dev k3d networks on obol stack purge
3 249b223 Strip Helm-ownership migration code (unreachable historical hack)
2 1f6cda2 Remove internal/network/probe.go (eth_chainId is a poor liveness probe)
1 c10c068 Remove --private-key-file from obol sell and the controller signing-key path; flows now use obol wallet import

Group A (c10c068) is the load-bearing one for the bugs below — it routed Alice's seller wallet through obol wallet import against a live cluster for the first time. Bob's flow already used this path but ran before his stack came up, so the latent bugs never fired.

The 4 bugs flow-14 surfaced (each fixed in PR #397)

1. archiveReplacedHermesKeystore host-side EACCES (run #1, step 22)

✗ failed to archive replaced keystore: stat …/<uuid>.json: permission denied

provisionKeystoreToVolume's last step fixRuntimeVolumeOwnership chowns the keystores dir to container uid 10000 mode 700. The archive helper then tried os.Stat(<dir>/<oldUUID>.json) from the host (uid 1000) — couldn't even traverse the dir. Fix in 6c5106a: bookend archive with ensureVolumeWritable + fixRuntimeVolumeOwnership, mirroring provisionKeystoreToVolume's pattern.

2. ApplyCluster flag plumbed but never read (run #3, step 23)

Wallet: 0xb6aF1F8FDB5948677AbD1365aE7544875387bB15
! direct registration failed: erc8004: register tx: gas required exceeds allowance (0)

cmd/obol/wallet.go:47 passes ApplyCluster: walletClusterAvailable(cfg) to ImportPrivateKeyWalletCmd, but the function body never consumed the flag. Result: wallet metadata + keystore + values-remote-signer.yaml all written to disk, but no helmfile sync, no rollout restart. The running pod kept its chart-bootstrap throwaway key, which has 0 ETH on Base Sepolia → "gas required exceeds allowance (0)".

Fix in a214050: when ApplyCluster=true, call hermes.Sync(cfg, id, u) (helmfile sync). Failure is best-effort with a recovery hint.

3. helm doesn't roll Deployments on Secret-data-only changes (run #4, step 23)

Wallet: 0x3379C44988d0E092f3B1896eedCd1c217FDE5d14
! direct registration failed: erc8004: register tx: gas required exceeds allowance (0)

hermes.Sync ran and the cluster Secret remote-signer-keystore-password did get the new password — but the Deployment template still references the same Secret by name, so helm patches the Secret in-place and does not roll the pod. The pod kept the old Secret data in env, decrypted only the chart's bootstrap keystore, signed with another throwaway address.

Fix in b17995a: after a successful Sync, run kubectl rollout restart deployment/remote-signer -n hermes-{id} and wait up to 120s for ready. Same pattern OpenClaw's restartRemoteSigner already used.

Same commit also fixed a flow-14 silent-death pattern at step 23: under set -euo pipefail from lib.sh, register_out=$(timeout 300 obol sell register …) exiting non-zero killed the script before the if [ register_rc -ne 0 ] could fire fail + emit_metrics. Wrap in set +e/-e mirroring step 22's import block.

4. chain_id u64/string client/server drift (run #5, step 23) — chart pin drift

! direct registration failed: erc8004: register tx: remote sign: HTTP 422:
  chain_id: invalid type: string "84532", expected u64

The right wallet was finally loaded — but remote-signer rejected the JSON body. Root cause was a client/server contract drift across two PRs:

PR Outcome
#357 CLOSED — proposed sending all SignTxRequest numeric fields as integers
#359 MERGED (commit b9495b8 / 499e0e4) — switched obol-stack to canonical-string contract
#374 (chart 0.3.1) MERGED — bumps OpenClaw chart pin to 0.3.1, ships image v0.2.0 (accepts strings)

But #374 was scoped to OpenClaw only:

internal/openclaw/openclaw.go:54   remoteSignerChartVersion = "0.3.1"
internal/hermes/hermes.go:34       remoteSignerChartVersion = "0.3.0"   ← MISSED

Chart 0.3.0 ships image v0.1.0 (legacy u64 contract). Hermes-routed obol sell register always hit the 422.

Fix in 3954852: bump Hermes pin to 0.3.1 + add TestRemoteSignerChartVersionConsistency (reads both source files at test time, asserts agreement, fails CI on future drift).

PR #397 final state — 5 commits, all green

3954852  fix(hermes): bump remote-signer chart 0.3.0 → 0.3.1 + consistency test
769086d  test(hermes): unit tests + flow-14 guard for wallet-import cluster wiring
b17995a  fix(hermes,flow-14): roll remote-signer after import + protect register from set -e
a214050  fix(hermes): honor ApplyCluster — helmfile-sync after wallet import
6c5106a  fix(hermes): bookend wallet-import archival with k3d ownership flip

Plus 7 new unit tests in internal/hermes/wallet_import_test.go (mock seams via var syncFn, restartHermesRemoteSignerFn, ensureVolumeWritableFn, fixRuntimeVolumeOwnershipFn) and 1 chart-consistency test in internal/hermes/chart_consistency_test.go. Plus a flow-level guard in flows/flow-14-live-obol-base-sepolia.sh that asserts the remote-signer pod is <120s old after wallet import — fast-fail diagnostic for any future drop of the rollout-restart.

Final flow-14 run — 60/0 PASS/FAIL

Commit 3954852 (PR #397 tip = 5 commits on top of 3ee8073)
Chart remote-signer 0.3.1 / image ghcr.io/obolnetwork/remote-signer:v0.2.0
Agent ID 5274
Tunnel https://land-movement-refrigerator-databases.trycloudflare.com
Inference endpoint spark1 vLLM qwen36-fast (http://192.168.18.23:8000/v1)
Total steps 55
PASS lines 60 (some steps assert multiple invariants)
FAIL lines 0

On-chain receipts (Base Sepolia, chain 84532)

Tx Purpose Basescan
0xad68b982…f3760 ERC-8004 register (agentId 5274) link
0x481fb33a…0506a SetMetadata link
0xc2faae06…3a63 Funding (OBOL → Bob signer) link
0x7baead9a…d14b44 Settlement Bob signer → Alice, exactly 1e15 wei (= 0.001 OBOL) link

Balance deltas asserted exact on both sides:

  • Alice 0x58aA1bB7… +1000000000000000 wei
  • Bob signer 0x2627b9D7… −1000000000000000 wei

Plus inference correctness on the paid path (step 48): the paid/qwen36-fast reply was a 26-char coherent answer, not the parrot regex from the colleague's earlier flow-13 screenshot.

What's NOT done — stays on the integration branch

Two scoped follow-ups that came up in review but are out of scope for #386 + #397:

  1. Add provider smokes and model preference #379 (obol model prefer) needs a rebase against this branch and a reconciliation with model.Rank so prefer wins over rank. Real conflict — can't be done until Add provider smokes and model preference #379 rebases.
  2. rc3 prep — OBOL on Ethereum mainnet (and possibly Base mainnet), facilitator gas monitoring (so we don't run dry), 6 open Dependabot vulnerabilities (4 high, 2 moderate) on default branch.

Plus three lower-priority enhancements the user and I discussed:

  • Pre-broadcast wallet-address assertion in obol sell register (proper layer fix for the stale-pod scenario; the flow guard catches it at the boundary but the CLI itself should refuse to broadcast against a stale signer).
  • Chart redesign: keystore password as PVC sidecar (<uuid>.password next to <uuid>.json, mode 0400). Removes the Secret coordination entirely; chart change, not Go change. Out of scope for current rc.
  • obol sell demo flow — Oisin still wants this; PR First commit towards a demo sell command #355 (worktree-sell-demo) state needs review.

Hand-off

This branch + PR #397 are flows-validated. Per project rule (feedback_main_merge_gates.md), I do not press merge to main from my side — both PRs are queued for human review. When ready, suggested merge order:

1. Merge #397 → integration/pr377-pr381 (this branch)
2. CI re-run on #386
3. Human review approval on #386
4. Merge #386 → main
5. Tag v0.9.0-rc3 (after rc3 backlog items above are addressed)

…397)

* fix(hermes): bookend wallet-import archival with k3d ownership flip

archiveReplacedHermesKeystore stat/mkdir/rename the host-path PVC
directly, but provisionKeystoreToVolume's last step
(fixRuntimeVolumeOwnership) leaves the keystores dir as
mode 700 owned by the container's uid 10000. The host-side process
(uid 1000) then cannot traverse the dir, so os.Stat returns EACCES
and the wrapping caller surfaces "failed to archive replaced
keystore: stat …: permission denied".

Mirror the pattern provisionKeystoreToVolume already uses: call
ensureVolumeWritable up front (chowns to host uid via k3d node-exec),
defer fixRuntimeVolumeOwnership so all return paths restore container
ownership for the remote-signer pod.

The bug pre-dates the obol-wallet-import flow rewrite; flow-14 only
started exercising the path on Alice once the --private-key-file
escape hatch was removed.

* fix(hermes): honor ApplyCluster — helmfile-sync after wallet import

ImportPrivateKeyWalletOptions.ApplyCluster has been plumbed all the
way from cmd/obol/wallet.go since the OpenClaw → Hermes routing fix,
but ImportPrivateKeyWalletCmd never actually consumed it. Effect:
`obol wallet import` against a live cluster wrote the new keystore
to the host-path PVC and updated values-remote-signer.yaml on disk,
but the running remote-signer pod kept decrypting with the old
chart-bootstrap keystore-password Secret and signed with the chart's
throwaway address (e.g. 0xb6aF…). On a flow-14 register tx that
surfaced as "gas required exceeds allowance (0)" — chart key has
no funds.

Mirror OpenClaw's finalizeWalletProvision pattern: when the cluster
is reachable, run hermes.Sync to helmfile-sync the deployment.
helmfile reapplies the keystore-password Secret with the new value
and helm rolls the remote-signer deployment, so the pod restarts
against the freshly-imported keystore.

Failure to sync is best-effort — emits a warning and a recovery
hint instead of failing the import outright (cluster might come up
later).

* fix(hermes,flow-14): roll remote-signer after import + protect register from set -e

Two follow-ups to the helmfile-sync addition (a214050):

1. helm doesn't roll a Deployment when only a Secret's data changed —
   the Deployment template still references the same Secret name, so
   helm patches the Secret in-place and leaves the pod running with
   the stale env. After Sync, run an explicit `kubectl rollout restart
   deployment/remote-signer` and wait up to 120s for the new pod to be
   ready. Mirrors OpenClaw's restartRemoteSigner semantics.

2. flow-14 step 23 ran `register_out=$(timeout 300 obol sell register …)`
   under set -e from lib.sh. obol sell register correctly exits 1 on
   chain failure, but the assignment-with-command-substitution under
   errexit kills the script before the if-check can fire fail() and
   emit_metrics — the run looked like a silent death at "STEP: [23]"
   instead of a clean FAIL with metrics. Wrap in set +e/-e the same
   way step 22 (wallet import) already does.

Together with 6c5106a (archive bookend) and a214050 (Sync on
ApplyCluster), `obol wallet import` against a live Hermes cluster now
fully replaces the chart bootstrap key end-to-end without flow-level
workarounds.

* test(hermes): unit tests + flow-14 guard for wallet-import cluster wiring

Tests cover the regression classes surfaced in this PR:

- TestArchiveReplacedHermesKeystore_NilExisting / SameUUID — happy
  short-circuit paths must NOT call the k3d node-exec helpers.
- TestArchiveReplacedHermesKeystore_BookendOrder — guards 6c5106a:
  the (ensureVolumeWritable → fixRuntimeVolumeOwnership) bookend MUST
  run in order, and the deferred fix MUST fire on every return path
  including the os.Stat ENOENT early-return.
- TestArchiveReplacedHermesKeystore_RenamesToReplaced — happy-path
  archival writes the file under <dir>/replaced/<uuid>-<ts>.json and
  removes the original.
- TestImportPrivateKeyWalletCmd_ApplyClusterFalseSkipsCluster — guards
  the inverse of a214050: the pre-cluster bootstrap path must NOT
  helmfile-sync or rollout-restart.
- TestImportPrivateKeyWalletCmd_ApplyClusterTrueRollsPod — primary
  guard: ApplyCluster=true must invoke both Sync AND
  restartHermesRemoteSigner (helm doesn't roll on Secret-data changes,
  so the rollout-restart is non-optional).
- TestImportPrivateKeyWalletCmd_SyncFailureSkipsRestart — best-effort
  contract: Sync error → skip restart, do NOT fail the import as a
  whole; on-disk artifacts let a later `obol hermes sync` finish.

Tests use indirection seams (var syncFn, restartHermesRemoteSignerFn,
ensureVolumeWritableFn, fixRuntimeVolumeOwnershipFn) to spy/replace
without standing up a real k3d cluster.

Flow-level guard: a new step between 22 (wallet import) and 23
(register) asserts the remote-signer pod's startTime is within 120s
of now. If a regression drops the explicit kubectl rollout-restart,
the pod stays old → assertion fails fast with a clear "wallet import
did not roll the deployment" diagnostic, instead of falling through
to the 5-minute "gas required exceeds allowance (0)" symptom.

* fix(hermes): bump remote-signer chart 0.3.0 → 0.3.1 + consistency test

Chart 0.3.1 was published 2026-04-23 with appVersion `v0.2.0`, which
accepts the canonical-string signer contract (chain_id, value, etc.
serialized as JSON strings) introduced by PR #359 / commit b9495b8.
Chart 0.3.0 ships `v0.1.0` which only accepts the legacy u64 contract
and rejects every signing call from current obol-stack with HTTP 422
"chain_id: invalid type: string \"84532\", expected u64".

OpenClaw was bumped to 0.3.1 in PR #374 but Hermes was missed — the
two charts are pinned in independent constants and Renovate only
updated one. flow-14 step 23 (Alice ERC-8004 register via remote-
signer) reproduced the failure on every run against current main.

TestRemoteSignerChartVersionConsistency reads both source files at
test time and asserts the two pins agree, so future chart bumps
either touch both files together or fail CI.

Pairs with: PR #357 (closed in favour of #359), task #46.

* refactor(charts): single source of truth for remote-signer chart pin

Both Hermes and OpenClaw deploy the same `remote-signer` Helm chart but
each held its own private constant + Renovate annotation. PR #374
bumped only OpenClaw to 0.3.1; Hermes stayed on 0.3.0 and shipped image
v0.1.0 which rejects the canonical-string signer contract — exactly
the drift class TestRemoteSignerChartVersionConsistency was added to
catch.

Promote the pin to a single exported constant in
`internal/agentruntime/charts.go` (the package both consumers already
import for Namespace/Hostname/KeystoreVolumePath, no new dep edge),
move the Renovate annotation to live alongside it, and delete the
consistency test — drift is now structurally impossible.

Mirrors the OPENCLAW_VERSION pattern (single source-of-truth file +
TestOpenClawVersionConsistency over its three consumers); future
shared chart pins follow the same shape under internal/agentruntime/.

---------

Co-authored-by: bussyjd <bussyjd@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants