fix(hermes): bookend wallet-import archival with k3d ownership flip by bussyjd · Pull Request #397 · ObolNetwork/obol-stack

bussyjd · 2026-04-29T05:13:36Z

Summary

archiveReplacedHermesKeystore operated on the host-path PVC directly, but provisionKeystoreToVolume's final fixRuntimeVolumeOwnership leaves the keystores dir as mode 700 owned by the container's uid 10000. From the host side (uid 1000), even os.Stat fails with EACCES, surfaced as:

failed to archive replaced keystore: stat …/<uuid>.json: permission denied

This bit flow-14 at step 22 (Alice: import seller wallet into remote-signer) immediately after Oisin's #386 surgery removed the --private-key-file escape hatch and routed the seller path through obol wallet import. flow-14 step 22 confirmed reproducible on a freshly-wiped workspace on testbed-B (spark2) — the failure is intrinsic to the import path, not state pollution.

Fix

Mirror the pattern provisionKeystoreToVolume already uses for write access:

ensureVolumeWritable(cfg, dir, u)         // chown -R 1000:1000 via k3d node-exec
defer fixRuntimeVolumeOwnership(cfg, dir, u)  // chown -R 10000:10000 back

8 added lines, no logic change beyond the bookend.

Why didn't this bite before

v0.9.0-rc1/rc2 flow-14 used --private-key-file directly; never went through obol wallet import → never called archiveReplacedHermesKeystore.
Bob's flow has called wallet import before, but ran before his stack came up → no chart-bootstrapped keystore to archive → existingWallet was nil → first guard returned at line 128.
The new Alice path (Integrate PR #377 (OBOL Permit2) + PR #381 (Hermes runtime), harden flows, add flow-13 dual-stack OBOL #386 Group A) calls wallet import after obol stack up so a chart-deployed bootstrap keystore exists, with container ownership.

Test plan

go build ./... clean
go vet ./internal/hermes/... clean
go test ./... clean
flow-14 against live Base Sepolia on spark2 reaches Receipt summary: (next: re-run on this branch + spark1 vLLM qwen36-fast)

Stacking

Sits on top of integration/pr377-pr381 (PR #386, tip 3ee8073 — Oisin's review surgery already merged in). Doesn't change anything in #386 itself; this is a pre-existing latent bug exposed by Oisin's Group A simplification of the seller wallet path.

archiveReplacedHermesKeystore stat/mkdir/rename the host-path PVC directly, but provisionKeystoreToVolume's last step (fixRuntimeVolumeOwnership) leaves the keystores dir as mode 700 owned by the container's uid 10000. The host-side process (uid 1000) then cannot traverse the dir, so os.Stat returns EACCES and the wrapping caller surfaces "failed to archive replaced keystore: stat …: permission denied". Mirror the pattern provisionKeystoreToVolume already uses: call ensureVolumeWritable up front (chowns to host uid via k3d node-exec), defer fixRuntimeVolumeOwnership so all return paths restore container ownership for the remote-signer pod. The bug pre-dates the obol-wallet-import flow rewrite; flow-14 only started exercising the path on Alice once the --private-key-file escape hatch was removed.

ImportPrivateKeyWalletOptions.ApplyCluster has been plumbed all the way from cmd/obol/wallet.go since the OpenClaw → Hermes routing fix, but ImportPrivateKeyWalletCmd never actually consumed it. Effect: `obol wallet import` against a live cluster wrote the new keystore to the host-path PVC and updated values-remote-signer.yaml on disk, but the running remote-signer pod kept decrypting with the old chart-bootstrap keystore-password Secret and signed with the chart's throwaway address (e.g. 0xb6aF…). On a flow-14 register tx that surfaced as "gas required exceeds allowance (0)" — chart key has no funds. Mirror OpenClaw's finalizeWalletProvision pattern: when the cluster is reachable, run hermes.Sync to helmfile-sync the deployment. helmfile reapplies the keystore-password Secret with the new value and helm rolls the remote-signer deployment, so the pod restarts against the freshly-imported keystore. Failure to sync is best-effort — emits a warning and a recovery hint instead of failing the import outright (cluster might come up later).

…er from set -e Two follow-ups to the helmfile-sync addition (a214050): 1. helm doesn't roll a Deployment when only a Secret's data changed — the Deployment template still references the same Secret name, so helm patches the Secret in-place and leaves the pod running with the stale env. After Sync, run an explicit `kubectl rollout restart deployment/remote-signer` and wait up to 120s for the new pod to be ready. Mirrors OpenClaw's restartRemoteSigner semantics. 2. flow-14 step 23 ran `register_out=$(timeout 300 obol sell register …)` under set -e from lib.sh. obol sell register correctly exits 1 on chain failure, but the assignment-with-command-substitution under errexit kills the script before the if-check can fire fail() and emit_metrics — the run looked like a silent death at "STEP: [23]" instead of a clean FAIL with metrics. Wrap in set +e/-e the same way step 22 (wallet import) already does. Together with 6c5106a (archive bookend) and a214050 (Sync on ApplyCluster), `obol wallet import` against a live Hermes cluster now fully replaces the chart bootstrap key end-to-end without flow-level workarounds.

…ring Tests cover the regression classes surfaced in this PR: - TestArchiveReplacedHermesKeystore_NilExisting / SameUUID — happy short-circuit paths must NOT call the k3d node-exec helpers. - TestArchiveReplacedHermesKeystore_BookendOrder — guards 6c5106a: the (ensureVolumeWritable → fixRuntimeVolumeOwnership) bookend MUST run in order, and the deferred fix MUST fire on every return path including the os.Stat ENOENT early-return. - TestArchiveReplacedHermesKeystore_RenamesToReplaced — happy-path archival writes the file under <dir>/replaced/<uuid>-<ts>.json and removes the original. - TestImportPrivateKeyWalletCmd_ApplyClusterFalseSkipsCluster — guards the inverse of a214050: the pre-cluster bootstrap path must NOT helmfile-sync or rollout-restart. - TestImportPrivateKeyWalletCmd_ApplyClusterTrueRollsPod — primary guard: ApplyCluster=true must invoke both Sync AND restartHermesRemoteSigner (helm doesn't roll on Secret-data changes, so the rollout-restart is non-optional). - TestImportPrivateKeyWalletCmd_SyncFailureSkipsRestart — best-effort contract: Sync error → skip restart, do NOT fail the import as a whole; on-disk artifacts let a later `obol hermes sync` finish. Tests use indirection seams (var syncFn, restartHermesRemoteSignerFn, ensureVolumeWritableFn, fixRuntimeVolumeOwnershipFn) to spy/replace without standing up a real k3d cluster. Flow-level guard: a new step between 22 (wallet import) and 23 (register) asserts the remote-signer pod's startTime is within 120s of now. If a regression drops the explicit kubectl rollout-restart, the pod stays old → assertion fails fast with a clear "wallet import did not roll the deployment" diagnostic, instead of falling through to the 5-minute "gas required exceeds allowance (0)" symptom.

Chart 0.3.1 was published 2026-04-23 with appVersion `v0.2.0`, which accepts the canonical-string signer contract (chain_id, value, etc. serialized as JSON strings) introduced by PR #359 / commit b9495b8. Chart 0.3.0 ships `v0.1.0` which only accepts the legacy u64 contract and rejects every signing call from current obol-stack with HTTP 422 "chain_id: invalid type: string \"84532\", expected u64". OpenClaw was bumped to 0.3.1 in PR #374 but Hermes was missed — the two charts are pinned in independent constants and Renovate only updated one. flow-14 step 23 (Alice ERC-8004 register via remote- signer) reproduced the failure on every run against current main. TestRemoteSignerChartVersionConsistency reads both source files at test time and asserts the two pins agree, so future chart bumps either touch both files together or fail CI. Pairs with: PR #357 (closed in favour of #359), task #46.

Both Hermes and OpenClaw deploy the same `remote-signer` Helm chart but each held its own private constant + Renovate annotation. PR #374 bumped only OpenClaw to 0.3.1; Hermes stayed on 0.3.0 and shipped image v0.1.0 which rejects the canonical-string signer contract — exactly the drift class TestRemoteSignerChartVersionConsistency was added to catch. Promote the pin to a single exported constant in `internal/agentruntime/charts.go` (the package both consumers already import for Namespace/Hostname/KeystoreVolumePath, no new dep edge), move the Renovate annotation to live alongside it, and delete the consistency test — drift is now structurally impossible. Mirrors the OPENCLAW_VERSION pattern (single source-of-truth file + TestOpenClawVersionConsistency over its three consumers); future shared chart pins follow the same shape under internal/agentruntime/.

bussyjd · 2026-04-29T08:47:35Z

flow-14 GREEN end-to-end on `3954852` ✅

60/0 PASS/FAIL across 55 steps. Live OBOL Permit2 settlement on Base Sepolia confirmed against the Obol public facilitator.

Run identity


Commit	`3954852` (PR tip)
Chart	`remote-signer 0.3.1` (image `v0.2.0`)
Agent ID	`5274`
Tunnel	`https://land-movement-refrigerator-databases.trycloudflare.com`
LLM endpoint	spark1 vLLM `qwen36-fast` (`http://192.168.18.23:8000/v1`)

On-chain receipts (Base Sepolia, chain 84532)

Tx	Purpose	Basescan
`0xad68b9826d389786980ed0a6c60e2a5c761e05b288c5a18eb83711ca4f2f3760`	ERC-8004 register (agentId 5274)	link
`0x481fb33a19cb88194b61a10b8281152ef25dad248f5bc256bb7795dc7690506a`	SetMetadata	link
`0xc2faae0652acd7aea7bb9391d21e486a812616b60bd77205ba3cfdd42c653a63`	Funding (OBOL → Bob signer)	link
`0x7baead9ad4296b1ab5e0bda7a7b726b4203417074e4d91051d91942453d14b44`	Settlement Bob signer → Alice, exactly `1e15 wei` (= 0.001 OBOL)	link

Balance deltas asserted exact on both sides:

Alice 0x58aA1bB7… +1000000000000000 wei
Bob signer 0x2627b9D7… −1000000000000000 wei

What this run validated

Fix	Step it cleared
`6c5106a` archive perm bookend	Step 22 — wallet import archived prior keystore without EACCES
`a214050` honor ApplyCluster	Step 22 — `helmfile sync` printed inline, new password Secret applied
`b17995a` rollout-restart + flow set+e/-e	Step 22 — `✓ Remote-signer restarted` line emitted; step 24 had a clean error path
`769086d` Go tests + flow pod-age guard	Step 23 — pod age 13s assertion passed (catches future regressions of the same shape)
`3954852` chart 0.3.0 → 0.3.1 (image v0.1.0 → v0.2.0)	Step 24 — `obol sell register` signed via canonical-string contract; previously HTTP 422 `chain_id: invalid type: string "84532"`

Inference correctness (step 48): the paid response on paid/qwen36-fast was a 26-char coherent answer, not the parrot regex from the colleague's earlier flow-13 screenshot.

Status

This PR (#397) is ready for external review against base integration/pr377-pr381 (PR Integrate PR #377 (OBOL Permit2) + PR #381 (Hermes runtime), harden flows, add flow-13 dual-stack OBOL #386).
Per project rule, no main merge from my side — flows-green + external reviewer are both prereqs for landing Integrate PR #377 (OBOL Permit2) + PR #381 (Hermes runtime), harden flows, add flow-13 dual-stack OBOL #386 + fix(hermes): bookend wallet-import archival with k3d ownership flip #397 on main.

…lows, add flow-13 dual-stack OBOL (#386) * feat(x402): support OBOL permit2 payments * test: cover OBOL x402 payment flows * test(flows): harden OBOL payment smoke flow * fix flow-11 buyer wallet reuse * Add Hermes default agent runtime * Bump frontend RC for Hermes chat * test(flows): harden OBOL payment live flow * Expose Hermes native dashboard deeplink * Fix flow 11 wallet preseed import * fix hermes update and default deeplink * preserve agent hosts across runtimes * harden existing stack refresh * Preserve LiteLLM config across defaults refresh * Update paid route docs * Pin paid route v1 invariants * fix(flows,tunnel,hermes): harden flow-11/12 + cloudflared wait + dashboard env Eleven brittleness issues were observed during real Base Sepolia test runs of flow-11 against PR #377 alone, PR #381 alone, and the integration branch on spark1 + spark2. This commit batches the fixes. flow-11-dual-stack.sh - env grep anchored to assignment lines so a comment containing REMOTE_SIGNER_PRIVATE_KEY no longer leaks into cast wallet address - buyer-runtime detection (openclaw vs hermes) via detect_buyer_runtime, called after Bob's stack up; pod-readiness, exec, port-forward, and token retrieval all use BOB_AGENT_NS/DEPLOY/CONTAINER/SERVICE/RUNTIME/LABEL - buy-success no longer relies on natural-language regex; the structural proof is the next PurchaseRequest CR Ready=True poll - Agent ID extraction is numeric-only with explicit validation, so a pending registration ("Agent ID: (not yet registered)") fails cleanly instead of crashing the script via Python int() flows/lib.sh - explicit PATH export for ~/.foundry/bin and ~/.local/bin so nohup / setsid / cron launches resolve cast / kubectl / k3d - detect_buyer_runtime helper (default Hermes, OpenClaw if present) - promoted USDC-receipt helpers (write_receipt, receipt_status_ok, archive_receipt, extract_tx_hash, find_usdc_transfer, wait_usdc_transfer_receipt) so flow-08 and flow-12 can reuse them - ensure_image_in_k3d helper for hosts where the k3d registry mirror stalls (aarch64 spark workaround for cloudflared:2026.1.2) flow-08-buy.sh - captures BUY_START_BLOCK, emits PAID_AMOUNT_USDC from the signing Python, then archives the on-chain settlement receipt via wait_usdc_transfer_receipt; balance delta check kept as defense in depth flow-12-obol-payment.sh + monetize_integration_test.go - Go test emits FLOW12_SETTLEMENT_TX marker; shell pipes test output through tee and writes receipt-summary.json with the same JSON shape as flow-11 (registration / funding markers omitted because the OBOL Permit2 path doesn't produce them on Anvil) cmd/obol/sell.go + internal/tunnel/tunnel.go - WaitReady(cfg, ui) refactored from EnsureRunning, default 5min budget (override via FLOW_TUNNEL_TIMEOUT). EnsureTunnelForSell is called before kubectlApplyOutput on the registration path so the controller's first reconcile sees a populated tunnelURL ConfigMap, fixing the AwaitingExternalRegistration race observed on spark1 - on registration path, tunnel failure is fatal with a hint to use --no-register; --no-register path keeps best-effort tunnel internal/hermes/hermes.go - GATEWAY_ALLOW_ALL_USERS=true on the hermes-dashboard container only, with an inline comment explaining that local k3d/dev clusters do not expose dashboard messaging integrations to the public internet, and that production must override via a values overlay. Unblocks the dashboard from CrashLoopBackOff so the pod reaches Ready=True and port-forward to the API server works flows/run-detached.sh + flows/README.md - new launcher script that survives SSH disconnect (tmux -> screen -> setsid -f); README documents the flow inventory and the new launcher * fix(flow-11): step 28 checks API-server container ready, not pod STATUS In multi-container pods like Hermes (API server + dashboard) the upstream hermes-dashboard container can stay in CrashLoopBackOff for unrelated reasons (missing fastapi/uvicorn in the image's web-UI optional deps), which makes the pod-summary STATUS column read "CrashLoopBackOff" even when the API-server container we actually need is happily Running. Switch step 28 from `grep "Running"` on the STATUS column to a jsonpath query for the specific container's `ready=true`, and bump the budget from 24x5s = 120s to 36x5s = 180s to absorb slow init on aarch64 hosts. Result: integration flow-11 now goes 45/45 with 0 sub-step FAILs. * fix(flow-11): drop step-34 discovery length assertion The discovery step issued a chat-completion to Bob's agent and required the assistant content to be >100 chars. Hermes occasionally responds with a short interim "let me check..." message (~93 chars) before proceeding to the next tool call, causing a false FAIL even though the agent went on to discover Alice and complete the buy in step 35-36. Same fix as step 35: drop the natural-language assertion. The structural proof of discovery is the next step's `buy.py` invocation succeeding and the PurchaseRequest CR going Ready=True (step 36). * feat(flows): add flow-13 dual-stack OBOL Permit2 against shared Anvil fork Mirrors flow-11's two-stack structure but the payment asset is a fork-local OBOL ERC20Permit token instead of USDC, both Alice's and Bob's obol stacks share ONE local Anvil fork (via the host.k3d.internal alias), and the facilitator is a local x402-rs build with eip2612GasSponsoring (not the public Obol facilitator). - Anvil port + facilitator port allocated via pick_free_port - ForkObolToken deployed on the fork via `forge create` against contracts/fork-obol/src/ForkObolToken.sol; mints 10 OBOL to Alice and 10 OBOL to Bob's signer - Single trap-based cleanup tears down anvil, facilitator, and any port-forwards on any exit - Skip-if-missing: emits one PASS and exits 0 when neither X402_FACILITATOR_BIN nor X402_RS_DIR resolve to a usable build - Reuses the receipt helpers from lib.sh by setting USDC_ADDRESS_BASE_SEPOLIA=$OBOL_TOKEN at call sites; the helpers are generic ERC-20 despite the USDC-flavored name - Bob's `obol network add base-sepolia` points at the same Anvil URL Alice uses, with eRPC pinned to the single custom upstream so both clusters see the same on-chain state for OBOL balance/Transfer logs * fix(flow-13): correct facilitator scheme config + /supported assertion x402-rs has no standalone "v2-eip155-permit2" scheme. The OBOL Permit2 / EIP-2612 gas sponsoring path is enabled via config.eip2612_gas_sponsoring=true on the v2-eip155-exact scheme — same as testutil.StartRealFacilitatorWithOptions does. The previous flow-13 config requested a phantom permit2 scheme, which the facilitator silently ignored, leaving /supported with v1+v2 exact only and failing the assertion that looked for a literal "permit2" scheme name. - drop the bogus v2-eip155-permit2 scheme entry - attach config.eip2612_gas_sponsoring=true to the v2-eip155-exact entry - assert /supported lists v1+v2 exact for base-sepolia (the buyer-side produces the Permit2 payload; the facilitator's role on this path is to verify+settle the sponsored authorization) * fix(flow-13): bind anvil to 0.0.0.0 so k3d pods can reach it * fix(flow-13): use busybox transient pods for in-cluster probes * fix(flow-13): drop ERC-8004 registration; informational discovery * fix(flow-13): bypass cloudflared, use docker host route for cross-cluster * fix(flow-13): bring up cloudflared explicitly; restore real tunnel path * docs(obol-stack-dev): distil flow-11/12/13 session learnings * fix(flow-13): scale cloudflared to 1 explicitly (rollout restart is no-op at 0) * fix(flow-13): cast send --json + balance proof for Bob mint * fix(erc8004): wait for read-side consistency before setMetadata A colleague hit `! failed to set x402 metadata: erc8004: setMetadata tx: execution reverted` on agent ID 5196 (live Base Sepolia). On-chain analysis showed the wallet nonce went 0->1, only the register tx broadcast, and the remote-signer was never asked to sign setMetadata. The revert happened in pre-broadcast simulation (eth_estimateGas), with selector 0x7e273289 = ERC721NonexistentToken(uint256). We reproduced it via static cast call. Root cause: bind.Transact for setMetadata invokes eth_estimateGas through the chain READER. The Register tx's WaitMined confirms on the WRITE upstream, but the READER (especially through eRPC, which has independent upstreams per chain) can lag behind by a block or two. If the simulation fires before the reader sees the just-minted token, the registry's ERC-721 ownerOf check inside setMetadata reverts. The aggravated form is when an `obol network add base-sepolia --endpoint http://...:anvil-port` from a prior flow-12/flow-13 run leaves a stale custom upstream pinned for chain 84532. The simulation routes to that fork (which has its own ERC-721 storage where 5196 was never minted), producing the same revert. We documented this in the obol-stack-dev skill. Fix: - Add Client.AgentWallet (calls ERC-8004 getAgentWallet view) so we can probe the reader's view of a specific agent id without depending on ERC-721 ownerOf, which isn't in the registry's ABI subset. - Add Client.WaitForAgent that polls AgentWallet until the reader returns a non-revert response, with a 30s default timeout. - After client.Register / client.RegisterWithOpts in cmd/obol/sell.go's registerDirectWithKey and registerWithRemoteSigner paths, call WaitForAgent before SetMetadata. A reader that catches up is a prerequisite for the simulation to succeed. Tests: - TestWaitForAgent_RetriesUntilOwnerVisible — the reader returns "execution reverted" twice then succeeds; verifies WaitForAgent waits through the staleness window. - TestWaitForAgent_TimeoutReturnsError — verifies persistent reverts surface as a clear timeout error after the budget. Out of scope: - Detection of stale eRPC custom upstreams (proposed: `obol network status` upstream-reachability check) — left as a follow-up. - Cleanup-trap teardown in flow-12/flow-13 to remove the base-sepolia network pin — separate flow PR. * Add Hermes default agent runtime * Bump frontend RC for Hermes chat * Expose Hermes native dashboard deeplink * fix hermes update and default deeplink * preserve agent hosts across runtimes * harden existing stack refresh * Preserve LiteLLM config across defaults refresh * Update paid route docs * Pin paid route v1 invariants * feat(network): probe upstream chain ids in `obol network status` Adds ProbeUpstream / ProbeAllUpstreams (eth_chainId, 2s parallel timeout) and wires `obol network status` to warn on unreachable or chain-id mismatched upstreams — typically a stale `obol network add base-sepolia --endpoint <local-anvil>` left over from a flow run whose Anvil was since killed or recreated. The report covering v0.9.0-rc1 called this out as the root cause of the setMetadata revert PR #387 fixed; this surfaces the same condition proactively at status-check time. `--no-probe` opts out for callers who don't want the network round-trip. * feat(flow-14): live Base Sepolia OBOL Permit2 sibling of flow-13 Adds flow-14 — a live-network counterpart to the Anvil-fork flow-13. Same dual-stack topology, but no Anvil, no local x402-rs facilitator; talks to live https://sepolia.base.org and the public Obol facilitator at x402.gcp.obol.tech. Required env vars OBOL_TOKEN_BASE_SEPOLIA (the deployed ERC20Permit address) and BOB_FUNDING_PRIVATE_KEY (a real funded buyer wallet) fail fast at the top so the script never spends gas before the operator has set both. Registration is enabled in flow-14 (flow-13 deliberately disables it for the protocol-level fork test) so PR #387's WaitForAgent fix runs on the OBOL path too. eip712Name is derived from the on-chain name() — an early-fail probe that catches EIP-712 domain mismatches before any Permit2 signing happens. flow-13 picks up the same EIP-712 early-fail probe, plus a cleanup-trap `obol network remove base-sepolia` on both clusters so a leftover custom pin from a prior run can't leak into the next flow's reads. monetize-inference.md gains an operator note: `eip2612_gas_sponsoring: true` shifts gas to the facilitator signer, must monitor balance. * test(fork-obol): assert ForkObolToken parity vs canonical OBOL Adds a build-time parity check (TestForkObolToken_ParityWithCanonicalOBOL) that catches drift between contracts/fork-obol/src/ForkObolToken.sol and the canonical OBOL token at 0x0B010000b7624eb9B3DfBC279673C76E9D29D5F7 (verified via Sourcify full-match). The test does three independent checks for the bits that affect x402 Permit2 settlement: 1. Greps the .sol source for the EIP-712 typehash + Permit typehash string literals (catches accidental constant edits). 2. keccak256s those literals in Go and compares to the canonical bytes (catches typo drift on either side). 3. Reproduces mainnet OBOL's DOMAIN_SEPARATOR() — 0x5a3cd81e... — from the formula keccak256(abi.encode(typeHash, nameHash, versionHash, chainid=1, address=0x0B01...)) (catches abi-encoding drift). Asserts decimals = 18 and that the source still hashes the literals "Obol Network" (name) and "1" (version). PARITY.md documents what MUST match (and is now tested) vs the deltas that are intentional (governance, access control, ENS, burn, transfer hooks) and orthogonal to settlement. contracts/fork-obol/.gitignore added so forge build artefacts (cache/, out/, broadcast/) stop showing up as untracked. * fix(model): rank Ollama models by parameter count, not Ollama list order Symptom: a colleague's Hermes agent answered every prompt with a wall of text describing its own tool list, because the configured default model was llama3.2:1b — too small to handle the agent's tool-using system prompt. Root cause: rankModels in internal/hermes/hermes.go (and the duplicate in internal/openclaw/openclaw.go) picked `local[0]` — whatever model the Ollama daemon happened to return first. On hosts that had recently pulled llama3.2:1b, that 1B model won over qwen3.5:9b every time. The old comment ("Within a tier, the first model wins") was honest about this, just wrong as a strategy. Fix: extract a single capability-aware ranker into internal/model: - Cloud models (Claude, GPT, o-series) outrank local models. - Within the cloud tier, an explicit precedence table prefers Opus over Sonnet over Haiku, gpt-5 over gpt-4 over gpt-3.5, etc. - Within the local tier, models are sorted by parameter count parsed from the tag — `qwen3.5:9b` → 9, `mixtral:8x7b` → 56, `llama3.2:1b` → 1. Larger first. - Untagged Ollama models fall back to a family-default table; the table is iterated longest-prefix-first so `llama3.3` (default 70) matches before `llama3` (default 8). - Tiebreak alphabetically for determinism. - Embedding models (nomic-embed) score 0 so they never become the chat default. Both internal/hermes/rankModels and internal/openclaw/rankModels are now thin wrappers over model.Rank — the openclaw one preserves its `openai/` prefix for LiteLLM routing. Eight table-driven tests in internal/model/rank_test.go cover the regression scenario, the cloud quality table, parameter parsing for b/Bx7b/235b shapes, the longest-prefix family lookup, alphabetical tiebreak, the embedding-model exclusion, and the empty-input case. * test(inference): assert response coherence on free + paid paths The model-rank fix prevents 1B-parameter models from becoming the agent default, but the regression was only visible at the response layer (tool-catalogue parroting). Add assertions that exercise both layers, not just status codes: flow-04 (free Hermes inference, getting-started.md §5): - After the existing 200 OK assertion, send "hello" and assert the reply does not parrot the tool catalogue (numbered list of Hermes / Skills / Terminal / Todo / Vision Analyze with markdown bold), and is no longer than a coherent greeting deserves (600 char ceiling). - Read the configured default model from hermes-config and reject any tag declaring 1B / 0.5B / 0.6B parameters as too small for the agent's tool-using system prompt. flow-11 (live USDC) + flow-14 (live OBOL): - After the existing paid-200 assertion, parse the CONTENT line and apply the same anti-parrot regex. A paid 200 with garbage in the body is still a regression from the buyer's perspective. internal/hermes/rankmodels_test.go + internal/openclaw/rankmodels_test.go: - Confirm each runtime's thin rank wrapper preserves the right shape (Hermes strips provider prefixes, OpenClaw re-adds openai/ for LiteLLM routing) on top of model.Rank. Together with the existing model.Rank tests, this is the regression guard for the 1B-default scenario at three layers: ranker, runtime wrapper, end-to-end inference response. * fix(model): handle decimal parameter tags (qwen3:0.6b regression) Ollama tags like `qwen3:0.6b` (and `1.5b`, `0.5b`, etc.) didn't match the original regex `(\d+(?:x\d+)?)b` and fell through to the family default — meaning `qwen3:0.6b` got rank 14 (qwen3 family) and was mistakenly chosen over qwen3.5:9b. The 0.6B model has the same small-model failure mode the rank fix was supposed to prevent. Updated regex accepts `\d+(?:\.\d+)?(?:x\d+(?:\.\d+)?)?b` so decimal sizes parse correctly. Ranks are now expressed in deci-billions (params × 10) so `0.6b` → 6, `1b` → 10, `9b` → 90 — distinct integer values for the comparator. Family defaults table scaled to match. Two new test cases pin the regression: `qwen3:0.6b` must lose to `qwen3.5:9b`, and `smol:1.5b` (untagged family) must lose to a known 9B model. * fix(flow-14): poll for funding visibility on both public RPC and eRPC Flow-14 ran clean through registration on spark2 but failed at step 36 ("Bob signer OBOL balance 0") right after a successful funding transfer. Bob's signer wallet at 0x9d87… had 5e15 wei on chain (verified post- incident via cast call) but the public RPC's read replica returned 0 when the step queried it 0-1 blocks after the funding tx mined. Then step 41's PurchaseRequest CR never appeared because buy.py inside Bob's agent pod also read through eRPC (10s eth_call TTL) and saw 0 during its pre-sign balance check, refusing to sign auths. The cascade took down steps 41-45 (sidecar empty, paid 200 → 404 model not found, no settlement). Same pattern flow-11 already uses for the USDC sibling flow — port it: - Step 36 wraps balanceOf in a 12-attempt × 2s poll against the public RPC. Fail-fast hard-exits the flow if balance never reaches OBOL_PRICE_WEI within 24s, instead of letting downstream steps cascade. - New step "Bob: eRPC reflects funding" runs buy.py's `balance` command inside the agent pod up to 18× × 5s, asserting the in-pod view matches the on-chain reality before any buy attempt. bob_buy_skill_balance helper copied from flow-11; works against both Hermes and OpenClaw runtimes via the BOB_AGENT_* vars exported by detect_buyer_runtime. This is the same class of read-side staleness PR #387 fixed for the ERC-8004 setMetadata path. * fix(flow-14): probe OBOL balance via direct eRPC eth_call (not buy.py) The previous attempt at the in-pod balance poll called `buy.py balance`, but that subcommand is hardcoded to query the USDC contract — flow-14 funds with OBOL, so the poll always returned 0 and timed out at 90s even when the on-chain OBOL balance was visible to the public RPC. Replace with `bob_obol_balance_via_erpc`: a small kubectl-exec helper that runs python3 inside the litellm pod and POSTs an eth_call for balanceOf(signer) on the OBOL token to Bob's eRPC at http://erpc.erpc.svc.cluster.local:4000/rpc/base-sepolia. That's the same URL pattern existing skills already use, and it queries the correct asset. Step 36 (public RPC poll) already proved the funding tx mined and the on-chain balance >= price. This step now confirms the in-cluster view has caught up before the agent's buy is invoked. * fix(flow-14): probe eRPC on port 80, not 4000 The eRPC chart's Service exposes 80/TCP + 4001/TCP — port 4000 is the container port, but the Service maps it to 80. Other in-cluster skills (signer.py, rpc.py) get this right by hitting the bare hostname; only discovery.py uses :4000 explicitly and it's wrong. Verified against the live spark2 cluster: GET on http://erpc.erpc.svc.cluster.local/rpc/base-sepolia returns eth_chainId=0x14a34 (84532) instantly, and eth_call balanceOf returns the correct 15e15 wei OBOL balance for Bob's signer. Step 37's previous run timed out for 90s on every attempt against :4000 because nothing was listening there. * fix(flow-14): make Bob-signer balance delta tolerant of funding races Step 48's strict pre/post equality on Bob's signer balance fails when the funding tx in step 35 races the public RPC's read replicas: signer pre-fund: 10e15 step 35 funds: +5e15 → 15e15 actual step 36 polls: 15e15 (sometimes), 10e15 (when reads land on a replica that hasn't seen the funding tx yet) step 47 settlement: -1e15 → 14e15 or 19e15 depending on which side of the funding stale read landed The settlement itself is correct in either case. We already assert the two canonical proofs strictly: - Alice's balance delta == OBOL_PRICE_WEI (matches every run) - On-chain Transfer(signer → Alice, OBOL_PRICE_WEI) event archived Convert the redundant Bob-signer pre/post check from a hard fail to an informational pass that surfaces the diff. Settlement correctness is unchanged. Verified end-to-end on spark2 (run #4, 2026-04-28T14:31:55Z): all critical assertions PASS, settlement tx 0x936b138e6cbb79e35920552f5c70ba14743744911f83db88d5c3cb4c994a1733 on Base Sepolia for exactly 0.001 OBOL. * fix(flow-11): runtime-aware bob_remote_signer_address (Hermes too) The helper was hardcoded to namespace `openclaw-obol-agent` and container `openclaw`. After #381 makes Hermes the default agent runtime, that exec hits a non-existent pod and returns empty silently — step 32 then sees signer="unknown" and fails the wallet-mismatch check. Use BOB_AGENT_NS / BOB_AGENT_DEPLOY / BOB_AGENT_CONTAINER which detect_buyer_runtime exports based on which agent namespace actually exists in Bob's cluster. Caught by flow-11 run on spark2 against the merged #380+#381 branch. * fix(flows): reclaim leaked Docker networks on flow start + exit Each `k3d cluster create` reserves a /16 from Docker's predefined 172.16.0.0/12 pool (~16 networks max). If the create crashes mid-way or the cluster is force-removed without `obol stack down`, the network is orphaned. After enough leaks every new cluster fails with "all predefined address pools have been fully subnetted" — exactly what killed flow-11 run #3 on spark2 today after ~15 successive runs. New helper `cleanup_k3d_obol_networks` in flows/lib.sh: - Filters strictly to `k3d-obol-stack-*` so it never touches user or other-app networks. - Relies on `docker network rm` refusing to remove networks with active endpoints, so it's safe to call while a flow is running — a live cluster's network is preserved automatically. Wired into flow-11 / flow-13 / flow-14 both reactively (EXIT trap) and proactively (top-of-flow), so a previously-leaked network from an aborted run is reclaimed before the new run tries to allocate. * fix(wallet): route obol wallet import to Hermes runtime Hermes is the default agent runtime as of #381, but `obol wallet import` was still wired exclusively into the OpenClaw codepath: - keystore was written to data/openclaw-{id}/remote-signer-keystores/ - wallet metadata was written to config/applications/openclaw/{id}/ The Hermes remote-signer pod reads from data/hermes-{id}/... so the preseed never reached the actual signer — flow-11 step 32 saw the auto-generated wallet (0xa0A2…2033d) instead of the preseeded buyer wallet (0x8E15…4916). Add a new internal/hermes/wallet_import.go that mirrors the OpenClaw import path but writes to the Hermes deployment dir + keystore volume, and re-wire cmd/obol/wallet.go to dispatch there unconditionally (active dev — no legacy OpenClaw fallback needed). Update flow-11 preseed_bob_wallet to scaffold via `bob hermes onboard` and verify via `bob hermes wallet address`, matching the new default. * fix(hermes): allow onboard --no-sync without a live cluster `obol hermes onboard --no-sync` was calling `writeDeploymentFiles` which always invoked `model.ConfigureLiteLLM` against the cluster — breaking pre-stack-up scaffolding (e.g. flow-11's wallet preseed step, which scaffolds the agent before the cluster comes up). Skip the LiteLLM auto-config when no kubeconfig is present. Stack-up's own auto-config will run after the cluster is live, so nothing is lost. * chore(stack): pull hermes image directly, drop local-build path We have zero customization on top of nousresearch/hermes-agent — the local hermes-agent clone tracks upstream main 1:1. Building the image from source on every fresh `obol stack up` was wasting 7+ minutes when a dev clone happened to be present (one of three candidate paths). Drop: - hermesSourceDir() in internal/stack/stack.go - the OBOL_HERMES_SOURCE_DIR env var - devLocalImages() (collapsed back into baseLocalImages — only x402-* and serviceoffer-controller actually need source builds) Hermes is pulled like any other upstream image via the tag in internal/hermes/hermes.go (`nousresearch/hermes-agent:latest`, overridable with OBOL_HERMES_IMAGE). * flows: route to external LLM via canonical `obol model` CLI Real-world recipe: an operator already has vLLM/sglang on their GPU box and wants the Obol stack to use that endpoint instead of host Ollama (the auto-config default). The canonical user flow is: obol model remove qwen3.5:9b qwen3:0.6b # drop auto-detected Ollama obol model setup custom --name X --endpoint URL --model M `setup custom` validates the endpoint, patches LiteLLM, hot-adds via the model API, and internally calls syncAgentModels -> hermes.Sync, which rewrites the default agent's deployment files with the new primary model. No ConfigMap surgery, no manual restart. Footgun documented: without the `obol model remove` step, the auto- detected Ollama entries out-rank the new custom entry — internal/ model/rank.go:localRank parses `:9b` as 90 deci-billions while `qwen36-fast` (no `:Nb` tag) ranks 0, so the agent silently stays on the slow host model. flows/lib.sh:route_llm_via_obol_cli wraps that exact CLI sequence behind OBOL_LLM_ENDPOINT / OBOL_LLM_MODEL / OBOL_LLM_NAME / OBOL_LLM_API_KEY env vars. Wired into flow-11 + flow-14 right after each stack_init_and_up so both Alice (paid responses) and Bob (agent autonomy) use the GPU when env is set; unset → flows keep the prior auto-config behavior. CLAUDE.md gets a "Pointing the stack at an external OpenAI-compatible LLM" subsection in the LLM Routing section, with the canonical recipe and the rank-logic footgun. The gap that caused the prior divergence was: CLAUDE.md framed `obol model setup` as cloud-provider config only, the custom-endpoint flow was a single buried one-liner, and the rank-logic interaction was undocumented. * chore(model): bump user-facing model recommendations to qwen3.6 / qwen3 User-facing pull suggestions still pointed at qwen3.5:4b — outdated now that Qwen3.6 (high-quality MoE 30B-A3B + 27B coding) and Qwen3 8B (laptop-friendly) are the current generation. cmd/obol/model.go (interactive `obol model pull` prompt): - Default tier: qwen3.6:27b (17 GB, recommended on ≥32GB RAM hosts) - Coding tier: qwen3.6:27b-coding-mxfp8 (~13 GB, MXFP8) - Laptop tier: qwen3:8b (5.2 GB) — qwen3.6 has no small variants on Ollama, so the previous gen's 8B is the right small default. - Reasoning: deepseek-r1:8b unchanged - Lightweight: gemma3:4b unchanged internal/openclaw/openclaw.go + cmd/obol/model.go (no-models hint): qwen3:8b for laptops, qwen3.6:27b for capable hosts. internal/embed/skills/monetize-guide/SKILL.md (the user-facing monetize walkthrough): same swap. Tests, smoke fixtures, and docs that record historical validation against `qwen3.5:9b` are intentionally left alone — those describe what was *actually run*, not what we currently recommend. * fix(flows): only call `obol model remove` for existing entries Each `obol model {remove,setup}` write op calls syncAgentModels → hermes.Sync → helmfile sync, producing a fresh Deployment revision and a new ReplicaSet. Three back-to-back rollouts in a slow image- pull environment (host Ollama + k3d containerd cold cache) stack ReplicaSets with no Ready replica. We saw this on spark2: three RSes (787bb9d4d7, 7cdbcd6d77, 54996f74c8), the original scaled to 0 before the new ones became Ready, and the agent pod was stuck in Init while the bootstrap-hermes-install initContainer waited on the hermes-agent image pull. Skip `obol model remove` when the entry isn't present so the helper boils down to a single rollout (the one for `obol model setup custom`). The auto-detected Ollama entries are explicitly checked against `obol model list` before removal. * fix(flow-14): drop bogus --namespace flag from `obol sell register` step 20 was passing `--namespace llm` to `obol sell register` which doesn't accept that flag. The bash `|| true` swallowed the error, the script reported PASS, and the offer sat in `Registered=AwaitingExternalRegistration` forever — step 21's `Ready=True` poll then timed out. `obol sell register` only accepts --chain / --sponsored / --endpoint / --name / --description / --image / --private-key-file. The offer is found by the controller (which publishes registration resources after the on-chain tx lands); the CLI doesn't need namespace scoping. Also drop the `|| true` so a real register failure surfaces immediately instead of leaving the run wedged at step 21 polling. * fix(flow-14): bring up tunnel BEFORE `obol sell register` `obol sell register` calls tunnel.GetTunnelURL(cfg) when --endpoint isn't passed. The flow had register at step 20 but the cloudflared scale-up at step ~22, so register was hanging trying to fetch a tunnel URL from an empty obol-frontend ConfigMap (cloudflared sat at 0 replicas — `obol stack up` deploys it that way; flow-14's direct ServiceOffer YAML apply bypasses the CLI's EnsureTunnelForSell hook). Reorder: bring cloudflared up + capture TUNNEL_URL right after ServiceOffer creation, then run register with --endpoint $TUNNEL_URL explicit (so the call doesn't depend on the in-cluster lookup at all). Add a 5-minute `timeout` wrapper as a defense-in-depth — the on-chain tx + WaitForAgent + SetMetadata should land in ~30-60s; anything beyond that is a hang we want to surface as a fail, not silently block the run. * fix(flow-14): call obol binary directly under timeout, not the alice() function `timeout 300 alice sell register …` doesn't work — `timeout` is an external program that cannot see the bash `alice()` runner function, so it fails to exec with exit 127 before the on-chain call ever happens. Run #9 silently exited mid-step 22 because of this (the captured `register_out` was empty, register_rc was 127, but the FAIL line never made it through tee to the log before tmux died). Call the obol binary directly with `env OBOL_*=… $ALICE_DIR/bin/obol sell register …` under the timeout — same env the alice() function exports, but visible to the timeout(1) child. * fix(flows): force-disconnect registry mirrors before removing leaked k3d networks `k3d cluster create` joins three persistent registry-mirror containers (k3d-obol-{docker,ghcr,quay}-io.localhost) to the cluster's network. `k3d cluster delete` removes the cluster nodes but does NOT disconnect those mirror containers — the network is left with 3 attached endpoints, so `docker network rm` refuses to remove it. After ~16 delete-create cycles the predefined CIDR pool exhausts and every new cluster fails with "all predefined address pools have been fully subnetted" (hit again on flow-14 run #10 on spark2). Update cleanup_k3d_obol_networks to: 1. Skip live clusters: a network with `*-server-N` or `*-serverlb` attached means k3d is still using it — leave alone. 2. Otherwise (mirror-only attachments), force-disconnect every attached container and then remove the network. Mirrors auto-rejoin the next cluster's network when k3d sets up the new cluster, so disconnect is non-destructive for the cache. * feat(model): --no-sync flag on obol model {remove,setup custom} Each `obol model` write op runs syncAgentModels at the end, which calls hermes.Sync -> helmfile sync, producing a fresh Deployment revision and a new ReplicaSet on every call. Three back-to-back rollouts on a slow- pull cluster (host Ollama + cold containerd cache + concurrent image pulls from cascading RSes) wedge with the agent pod stuck in Init forever — exactly what flow-14 has been hitting at step 32. Add `--no-sync` to `obol model remove` and `obol model setup custom` so callers can batch model edits and run `obol model sync` once at the end. Real-operator value too: scripted setup ("remove auto-detected Ollama, add my vLLM endpoint, then sync") shouldn't pay for two extra agent rollouts. Update flows/lib.sh:route_llm_via_obol_cli to use --no-sync on all intermediate writes and call `obol model sync` once at the end. Should collapse the three-rollout cascade to a single Hermes redeploy. * fix(model): use bare model name in LiteLLM `model_name` for custom endpoints `obol model setup custom --name X --model Y` was writing the LiteLLM entry as `model_name: custom/X/Y`. The Hermes agent then read the LiteLLM model list, picked that entry as primary, applied stripProviderPrefix once (-> X/Y), and stripped again on the way to the agent config (-> Y). At inference time the agent passed `Y` to LiteLLM, which only had `custom/X/Y` as a literal model_name, so every chat completion returned 400 "no healthy deployments for this model" — exactly what flow-14 hit at step 40-41 with the spark1 vLLM endpoint. Drop the `custom/<name>/` prefix: LiteLLM `model_name = <model>`. The agent's call and the LiteLLM entry match by exact string. The `--name` flag remains useful as a human-facing label in `obol model status` output but isn't part of the route key. Re-running `setup custom` with the same `--model` re-binds the route — which is the natural "repoint" behavior operators want. * fix(flows): pass OBOL_LLM_MODEL through buy.py prompt flow-14 step 41 (and flow-11 equivalent) hardcoded `--model qwen3.5:9b` in the agent's buy prompt. When the run uses an external GPU LLM (OBOL_LLM_ENDPOINT + OBOL_LLM_MODEL=qwen36-fast), Alice's LiteLLM serves the bare `qwen36-fast` entry — but Bob's PurchaseRequest CR ends up keyed on `qwen3.5:9b`, the buyer sidecar publishes `paid/qwen3.5:9b` as the paid alias, and Bob's agent calls `paid/qwen3.5:9b` which routes via the `paid/*` wildcard to the sidecar, which in turn forwards to Alice with model=`qwen3.5:9b` — which Alice's LiteLLM doesn't have. 400 "no healthy deployments". Use ${OBOL_LLM_MODEL:-qwen3.5:9b} so the buy.py call (and the PAID_MODEL fallback) follow whatever the seller's actual model is. Defaults stay unchanged when no env override is set. * fix(stack): preload nousresearch/hermes-agent into k3d containerd Each fresh k3d cluster has a cold containerd cache. The cluster's hosts.toml mirrors docker.io through k3d-obol-docker-io.localhost so in theory pulls go through the local mirror — but in practice the mirror's resolve+blob handshake stalls under contention or after a restart, leaving the first Hermes pod stuck in `Init:1/2` for 10+ minutes pulling the 2.4GB nousresearch/hermes-agent image. flow-14 keeps tripping on this at step 32 (hermes API ready polling) on spark2. Mirror existing dev-image preload pattern: pull the image to the host docker daemon (cheap if cached), then `k3d image import` into the cluster's containerd. Already done for openclaw — now done for nousresearch/hermes-agent too via the new exported `hermes.ImageRef()` helper. This trades ~30s of host-to-cluster tarball import (one-time per cluster) for the difference between a stuck pull and a working pod. * fix(model): unify LiteLLM model_name contract, remove double-strip (#389) `obol model setup custom`, the LiteLLM `model_name` convention, and the agent-side stripProviderPrefix helpers were tangled in a way that quietly broke flow-14 with a 400 "no healthy deployments for this model" on every chat-completion against a custom vLLM endpoint: 1. AddCustomEndpoint wrote `model_name: custom/<name>/<model>`. 2. hermes.configuredModels saw it, called rankModels which pre-stripped to `<name>/<model>` before delegating to model.Rank. 3. model.Rank also strips internally for ranking heuristics — but returns the original string. With the pre-strip from (2) the "original" was already mutilated. 4. configuredModels then ran stripProviderPrefix on the primary AGAIN before returning, leaving the agent calling LiteLLM with bare `<model>` while only `custom/<name>/<model>` was registered. The band-aid in ca820c9 dropped the `custom/<name>/` prefix on writes, which unblocked the flow but left the underlying double-strip surface intact. This change picks the contract explicitly: LiteLLM `model_name` is the bare model identifier — the agent reads it straight back as the `model` field on chat-completion calls and must round-trip unchanged. Same convention every other code path already uses (Ollama, Anthropic, OpenAI explicit entries). Implementation: - internal/model/model.go: extract buildCustomEndpointEntry, document the contract on AddCustomEndpoint, drop the leftover `_ = name` bookkeeping. - internal/model/rank.go: keep the unexported stripProviderPrefix for ranking heuristics, add a doc comment explicitly forbidding its use on round-trippable identifiers. - internal/hermes/hermes.go: delete stripProviderPrefix / stripProviderPrefixes; rankModels now passes through to model.Rank without pre-stripping; configuredModels returns the LiteLLM model list unchanged. The agent's `model.default` is now byte-identical to the LiteLLM ConfigMap entry. - cmd/obol/model.go: clarify --name flag help to "informational only" — it still surfaces in `obol model status` but does not participate in the route key. Tests: - internal/model/rank_test.go: TestRank_PreservesProviderPrefixOnOutput pins the round-trip property at the Rank() boundary, including the legacy `custom/<name>/<model>` shape. - internal/model/model_test.go: TestBuildCustomEndpointEntry covers the bare-model_name + openai/-routing shape, the empty-key fallback, and that colon-tagged ids survive intact. - internal/hermes/rankmodels_test.go: rewritten to assert the contract (was asserting the now-removed strip). Adds the `custom/<name>/<model>` regression guard. - internal/hermes/hermes_test.go: TestGenerateConfig_PrimaryIsRoundTrippable covers the end-to-end shape — whatever LiteLLM publishes is what the agent sends back. Refs ca820c9 (band-aid). Co-authored-by: bussyjd <bussyjd@users.noreply.github.com> * Remove raw private key messing now that its not needed with obol hermes wallet import command * Remove underfunctional probe for now in favour of a better one later * Not putting in too much code to cover hacks * feat(stack): reclaim leaked dev k3d networks on obol stack purge * Fix up old references to qwen3 and a 0.6b model * Delete a plan, push pr review notes * fix(hermes): bookend wallet-import archival with k3d ownership flip (#397) * fix(hermes): bookend wallet-import archival with k3d ownership flip archiveReplacedHermesKeystore stat/mkdir/rename the host-path PVC directly, but provisionKeystoreToVolume's last step (fixRuntimeVolumeOwnership) leaves the keystores dir as mode 700 owned by the container's uid 10000. The host-side process (uid 1000) then cannot traverse the dir, so os.Stat returns EACCES and the wrapping caller surfaces "failed to archive replaced keystore: stat …: permission denied". Mirror the pattern provisionKeystoreToVolume already uses: call ensureVolumeWritable up front (chowns to host uid via k3d node-exec), defer fixRuntimeVolumeOwnership so all return paths restore container ownership for the remote-signer pod. The bug pre-dates the obol-wallet-import flow rewrite; flow-14 only started exercising the path on Alice once the --private-key-file escape hatch was removed. * fix(hermes): honor ApplyCluster — helmfile-sync after wallet import ImportPrivateKeyWalletOptions.ApplyCluster has been plumbed all the way from cmd/obol/wallet.go since the OpenClaw → Hermes routing fix, but ImportPrivateKeyWalletCmd never actually consumed it. Effect: `obol wallet import` against a live cluster wrote the new keystore to the host-path PVC and updated values-remote-signer.yaml on disk, but the running remote-signer pod kept decrypting with the old chart-bootstrap keystore-password Secret and signed with the chart's throwaway address (e.g. 0xb6aF…). On a flow-14 register tx that surfaced as "gas required exceeds allowance (0)" — chart key has no funds. Mirror OpenClaw's finalizeWalletProvision pattern: when the cluster is reachable, run hermes.Sync to helmfile-sync the deployment. helmfile reapplies the keystore-password Secret with the new value and helm rolls the remote-signer deployment, so the pod restarts against the freshly-imported keystore. Failure to sync is best-effort — emits a warning and a recovery hint instead of failing the import outright (cluster might come up later). * fix(hermes,flow-14): roll remote-signer after import + protect register from set -e Two follow-ups to the helmfile-sync addition (a214050): 1. helm doesn't roll a Deployment when only a Secret's data changed — the Deployment template still references the same Secret name, so helm patches the Secret in-place and leaves the pod running with the stale env. After Sync, run an explicit `kubectl rollout restart deployment/remote-signer` and wait up to 120s for the new pod to be ready. Mirrors OpenClaw's restartRemoteSigner semantics. 2. flow-14 step 23 ran `register_out=$(timeout 300 obol sell register …)` under set -e from lib.sh. obol sell register correctly exits 1 on chain failure, but the assignment-with-command-substitution under errexit kills the script before the if-check can fire fail() and emit_metrics — the run looked like a silent death at "STEP: [23]" instead of a clean FAIL with metrics. Wrap in set +e/-e the same way step 22 (wallet import) already does. Together with 6c5106a (archive bookend) and a214050 (Sync on ApplyCluster), `obol wallet import` against a live Hermes cluster now fully replaces the chart bootstrap key end-to-end without flow-level workarounds. * test(hermes): unit tests + flow-14 guard for wallet-import cluster wiring Tests cover the regression classes surfaced in this PR: - TestArchiveReplacedHermesKeystore_NilExisting / SameUUID — happy short-circuit paths must NOT call the k3d node-exec helpers. - TestArchiveReplacedHermesKeystore_BookendOrder — guards 6c5106a: the (ensureVolumeWritable → fixRuntimeVolumeOwnership) bookend MUST run in order, and the deferred fix MUST fire on every return path including the os.Stat ENOENT early-return. - TestArchiveReplacedHermesKeystore_RenamesToReplaced — happy-path archival writes the file under <dir>/replaced/<uuid>-<ts>.json and removes the original. - TestImportPrivateKeyWalletCmd_ApplyClusterFalseSkipsCluster — guards the inverse of a214050: the pre-cluster bootstrap path must NOT helmfile-sync or rollout-restart. - TestImportPrivateKeyWalletCmd_ApplyClusterTrueRollsPod — primary guard: ApplyCluster=true must invoke both Sync AND restartHermesRemoteSigner (helm doesn't roll on Secret-data changes, so the rollout-restart is non-optional). - TestImportPrivateKeyWalletCmd_SyncFailureSkipsRestart — best-effort contract: Sync error → skip restart, do NOT fail the import as a whole; on-disk artifacts let a later `obol hermes sync` finish. Tests use indirection seams (var syncFn, restartHermesRemoteSignerFn, ensureVolumeWritableFn, fixRuntimeVolumeOwnershipFn) to spy/replace without standing up a real k3d cluster. Flow-level guard: a new step between 22 (wallet import) and 23 (register) asserts the remote-signer pod's startTime is within 120s of now. If a regression drops the explicit kubectl rollout-restart, the pod stays old → assertion fails fast with a clear "wallet import did not roll the deployment" diagnostic, instead of falling through to the 5-minute "gas required exceeds allowance (0)" symptom. * fix(hermes): bump remote-signer chart 0.3.0 → 0.3.1 + consistency test Chart 0.3.1 was published 2026-04-23 with appVersion `v0.2.0`, which accepts the canonical-string signer contract (chain_id, value, etc. serialized as JSON strings) introduced by PR #359 / commit b9495b8. Chart 0.3.0 ships `v0.1.0` which only accepts the legacy u64 contract and rejects every signing call from current obol-stack with HTTP 422 "chain_id: invalid type: string \"84532\", expected u64". OpenClaw was bumped to 0.3.1 in PR #374 but Hermes was missed — the two charts are pinned in independent constants and Renovate only updated one. flow-14 step 23 (Alice ERC-8004 register via remote- signer) reproduced the failure on every run against current main. TestRemoteSignerChartVersionConsistency reads both source files at test time and asserts the two pins agree, so future chart bumps either touch both files together or fail CI. Pairs with: PR #357 (closed in favour of #359), task #46. * refactor(charts): single source of truth for remote-signer chart pin Both Hermes and OpenClaw deploy the same `remote-signer` Helm chart but each held its own private constant + Renovate annotation. PR #374 bumped only OpenClaw to 0.3.1; Hermes stayed on 0.3.0 and shipped image v0.1.0 which rejects the canonical-string signer contract — exactly the drift class TestRemoteSignerChartVersionConsistency was added to catch. Promote the pin to a single exported constant in `internal/agentruntime/charts.go` (the package both consumers already import for Namespace/Hostname/KeystoreVolumePath, no new dep edge), move the Renovate annotation to live alongside it, and delete the consistency test — drift is now structurally impossible. Mirrors the OPENCLAW_VERSION pattern (single source-of-truth file + TestOpenClawVersionConsistency over its three consumers); future shared chart pins follow the same shape under internal/agentruntime/. --------- Co-authored-by: bussyjd <bussyjd@users.noreply.github.com> --------- Co-authored-by: bussyjd <bussyjd@users.noreply.github.com> Co-authored-by: Oisín Kyne <oisin@obol.tech>

bussyjd added 6 commits April 29, 2026 13:11

bussyjd mentioned this pull request Apr 29, 2026

Integrate PR #377 (OBOL Permit2) + PR #381 (Hermes runtime), harden flows, add flow-13 dual-stack OBOL #386

Merged

7 tasks

bussyjd merged commit 4885a5a into integration/pr377-pr381 Apr 29, 2026

bussyjd mentioned this pull request Apr 29, 2026

Move agent lifecycle UX under obol agent #385

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(hermes): bookend wallet-import archival with k3d ownership flip#397

fix(hermes): bookend wallet-import archival with k3d ownership flip#397
bussyjd merged 6 commits intointegration/pr377-pr381from
fix/hermes-wallet-import-archive-perms

bussyjd commented Apr 29, 2026

Uh oh!

bussyjd commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bussyjd commented Apr 29, 2026

Summary

Fix

Why didn't this bite before

Test plan

Stacking

Uh oh!

bussyjd commented Apr 29, 2026

flow-14 GREEN end-to-end on 3954852 ✅

Run identity

On-chain receipts (Base Sepolia, chain 84532)

What this run validated

Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

flow-14 GREEN end-to-end on `3954852` ✅