Conversation
…ermit2) - .dockerignore: keep broader .workspace* glob from #381 - internal/tunnel/tunnel_test.go: keep both new tests (parseQuickTunnelURL_PicksLatest from #377, BuildLocalManagedConfigYAMLRoutesOnlyRequestedHostname from #381) - internal/stack/stack_test.go: take #381 port-handling test additions
…board env Eleven brittleness issues were observed during real Base Sepolia test runs of flow-11 against PR #377 alone, PR #381 alone, and the integration branch on spark1 + spark2. This commit batches the fixes. flow-11-dual-stack.sh - env grep anchored to assignment lines so a comment containing REMOTE_SIGNER_PRIVATE_KEY no longer leaks into cast wallet address - buyer-runtime detection (openclaw vs hermes) via detect_buyer_runtime, called after Bob's stack up; pod-readiness, exec, port-forward, and token retrieval all use BOB_AGENT_NS/DEPLOY/CONTAINER/SERVICE/RUNTIME/LABEL - buy-success no longer relies on natural-language regex; the structural proof is the next PurchaseRequest CR Ready=True poll - Agent ID extraction is numeric-only with explicit validation, so a pending registration ("Agent ID: (not yet registered)") fails cleanly instead of crashing the script via Python int() flows/lib.sh - explicit PATH export for ~/.foundry/bin and ~/.local/bin so nohup / setsid / cron launches resolve cast / kubectl / k3d - detect_buyer_runtime helper (default Hermes, OpenClaw if present) - promoted USDC-receipt helpers (write_receipt, receipt_status_ok, archive_receipt, extract_tx_hash, find_usdc_transfer, wait_usdc_transfer_receipt) so flow-08 and flow-12 can reuse them - ensure_image_in_k3d helper for hosts where the k3d registry mirror stalls (aarch64 spark workaround for cloudflared:2026.1.2) flow-08-buy.sh - captures BUY_START_BLOCK, emits PAID_AMOUNT_USDC from the signing Python, then archives the on-chain settlement receipt via wait_usdc_transfer_receipt; balance delta check kept as defense in depth flow-12-obol-payment.sh + monetize_integration_test.go - Go test emits FLOW12_SETTLEMENT_TX marker; shell pipes test output through tee and writes receipt-summary.json with the same JSON shape as flow-11 (registration / funding markers omitted because the OBOL Permit2 path doesn't produce them on Anvil) cmd/obol/sell.go + internal/tunnel/tunnel.go - WaitReady(cfg, ui) refactored from EnsureRunning, default 5min budget (override via FLOW_TUNNEL_TIMEOUT). EnsureTunnelForSell is called before kubectlApplyOutput on the registration path so the controller's first reconcile sees a populated tunnelURL ConfigMap, fixing the AwaitingExternalRegistration race observed on spark1 - on registration path, tunnel failure is fatal with a hint to use --no-register; --no-register path keeps best-effort tunnel internal/hermes/hermes.go - GATEWAY_ALLOW_ALL_USERS=true on the hermes-dashboard container only, with an inline comment explaining that local k3d/dev clusters do not expose dashboard messaging integrations to the public internet, and that production must override via a values overlay. Unblocks the dashboard from CrashLoopBackOff so the pod reaches Ready=True and port-forward to the API server works flows/run-detached.sh + flows/README.md - new launcher script that survives SSH disconnect (tmux -> screen -> setsid -f); README documents the flow inventory and the new launcher
In multi-container pods like Hermes (API server + dashboard) the upstream hermes-dashboard container can stay in CrashLoopBackOff for unrelated reasons (missing fastapi/uvicorn in the image's web-UI optional deps), which makes the pod-summary STATUS column read "CrashLoopBackOff" even when the API-server container we actually need is happily Running. Switch step 28 from `grep "Running"` on the STATUS column to a jsonpath query for the specific container's `ready=true`, and bump the budget from 24x5s = 120s to 36x5s = 180s to absorb slow init on aarch64 hosts. Result: integration flow-11 now goes 45/45 with 0 sub-step FAILs.
The discovery step issued a chat-completion to Bob's agent and required the assistant content to be >100 chars. Hermes occasionally responds with a short interim "let me check..." message (~93 chars) before proceeding to the next tool call, causing a false FAIL even though the agent went on to discover Alice and complete the buy in step 35-36. Same fix as step 35: drop the natural-language assertion. The structural proof of discovery is the next step's `buy.py` invocation succeeding and the PurchaseRequest CR going Ready=True (step 36).
… fork Mirrors flow-11's two-stack structure but the payment asset is a fork-local OBOL ERC20Permit token instead of USDC, both Alice's and Bob's obol stacks share ONE local Anvil fork (via the host.k3d.internal alias), and the facilitator is a local x402-rs build with eip2612GasSponsoring (not the public Obol facilitator). - Anvil port + facilitator port allocated via pick_free_port - ForkObolToken deployed on the fork via `forge create` against contracts/fork-obol/src/ForkObolToken.sol; mints 10 OBOL to Alice and 10 OBOL to Bob's signer - Single trap-based cleanup tears down anvil, facilitator, and any port-forwards on any exit - Skip-if-missing: emits one PASS and exits 0 when neither X402_FACILITATOR_BIN nor X402_RS_DIR resolve to a usable build - Reuses the receipt helpers from lib.sh by setting USDC_ADDRESS_BASE_SEPOLIA=$OBOL_TOKEN at call sites; the helpers are generic ERC-20 despite the USDC-flavored name - Bob's `obol network add base-sepolia` points at the same Anvil URL Alice uses, with eRPC pinned to the single custom upstream so both clusters see the same on-chain state for OBOL balance/Transfer logs
x402-rs has no standalone "v2-eip155-permit2" scheme. The OBOL Permit2 / EIP-2612 gas sponsoring path is enabled via config.eip2612_gas_sponsoring=true on the v2-eip155-exact scheme — same as testutil.StartRealFacilitatorWithOptions does. The previous flow-13 config requested a phantom permit2 scheme, which the facilitator silently ignored, leaving /supported with v1+v2 exact only and failing the assertion that looked for a literal "permit2" scheme name. - drop the bogus v2-eip155-permit2 scheme entry - attach config.eip2612_gas_sponsoring=true to the v2-eip155-exact entry - assert /supported lists v1+v2 exact for base-sepolia (the buyer-side produces the Permit2 payload; the facilitator's role on this path is to verify+settle the sponsored authorization)
going to skim code for anything objectionable (the pr description makes me feel the flow scripts take some shortcuts to make things happen (e.g. i dont know why it needs the private key outside in an env as an example, maybe i'll know when i look). if it looks adequate, we can merge and then you tell me what else is left of open prs that need to go in |
) `obol model setup custom`, the LiteLLM `model_name` convention, and the agent-side stripProviderPrefix helpers were tangled in a way that quietly broke flow-14 with a 400 "no healthy deployments for this model" on every chat-completion against a custom vLLM endpoint: 1. AddCustomEndpoint wrote `model_name: custom/<name>/<model>`. 2. hermes.configuredModels saw it, called rankModels which pre-stripped to `<name>/<model>` before delegating to model.Rank. 3. model.Rank also strips internally for ranking heuristics — but returns the original string. With the pre-strip from (2) the "original" was already mutilated. 4. configuredModels then ran stripProviderPrefix on the primary AGAIN before returning, leaving the agent calling LiteLLM with bare `<model>` while only `custom/<name>/<model>` was registered. The band-aid in ca820c9 dropped the `custom/<name>/` prefix on writes, which unblocked the flow but left the underlying double-strip surface intact. This change picks the contract explicitly: LiteLLM `model_name` is the bare model identifier — the agent reads it straight back as the `model` field on chat-completion calls and must round-trip unchanged. Same convention every other code path already uses (Ollama, Anthropic, OpenAI explicit entries). Implementation: - internal/model/model.go: extract buildCustomEndpointEntry, document the contract on AddCustomEndpoint, drop the leftover `_ = name` bookkeeping. - internal/model/rank.go: keep the unexported stripProviderPrefix for ranking heuristics, add a doc comment explicitly forbidding its use on round-trippable identifiers. - internal/hermes/hermes.go: delete stripProviderPrefix / stripProviderPrefixes; rankModels now passes through to model.Rank without pre-stripping; configuredModels returns the LiteLLM model list unchanged. The agent's `model.default` is now byte-identical to the LiteLLM ConfigMap entry. - cmd/obol/model.go: clarify --name flag help to "informational only" — it still surfaces in `obol model status` but does not participate in the route key. Tests: - internal/model/rank_test.go: TestRank_PreservesProviderPrefixOnOutput pins the round-trip property at the Rank() boundary, including the legacy `custom/<name>/<model>` shape. - internal/model/model_test.go: TestBuildCustomEndpointEntry covers the bare-model_name + openai/-routing shape, the empty-key fallback, and that colon-tagged ids survive intact. - internal/hermes/rankmodels_test.go: rewritten to assert the contract (was asserting the now-removed strip). Adds the `custom/<name>/<model>` regression guard. - internal/hermes/hermes_test.go: TestGenerateConfig_PrimaryIsRoundTrippable covers the end-to-end shape — whatever LiteLLM publishes is what the agent sends back. Refs ca820c9 (band-aid). Co-authored-by: bussyjd <bussyjd@users.noreply.github.com>
…es wallet import command
OisinKyne
left a comment
There was a problem hiding this comment.
Not an immediate yes, expecting my 6 prs merged then this one merged to main, pre-emptively giving the ✅ so you don't get blocked
Flow-validation update — flow-14 GREEN on PR #397 stack tipI ran flow-14 against
This run surfaced (and PR #397 fixes) two latent bugs that pre-date Oisin's surgery on this branch but were exposed by the Group A
Plus a chart drift fix: Hermes was pinned to Per project rule (no main merge without external reviewer + flows green), the receipts are now in place; this PR + #397 are ready for human review whenever you're ready, Oisin. |
Full investigation + fixes report — 2026-04-29This comment dumps the complete trail of what was done on this branch since Oisin's review surgery merged, the bugs flow-14 surfaced, the fixes that landed in PR #397, and the on-chain receipts that confirm the end-to-end OBOL Permit2 path is now green on Base Sepolia. TL;DR
Architecture under test (flow-14)Oisin's surgery — what landedThe 6 commits fast-forwarded cleanly (
Group A ( The 4 bugs flow-14 surfaced (each fixed in PR #397)1.
|
| PR | Outcome |
|---|---|
#357 |
CLOSED — proposed sending all SignTxRequest numeric fields as integers |
#359 |
MERGED (commit b9495b8 / 499e0e4) — switched obol-stack to canonical-string contract |
#374 (chart 0.3.1) |
MERGED — bumps OpenClaw chart pin to 0.3.1, ships image v0.2.0 (accepts strings) |
But #374 was scoped to OpenClaw only:
internal/openclaw/openclaw.go:54 remoteSignerChartVersion = "0.3.1"
internal/hermes/hermes.go:34 remoteSignerChartVersion = "0.3.0" ← MISSED
Chart 0.3.0 ships image v0.1.0 (legacy u64 contract). Hermes-routed obol sell register always hit the 422.
Fix in 3954852: bump Hermes pin to 0.3.1 + add TestRemoteSignerChartVersionConsistency (reads both source files at test time, asserts agreement, fails CI on future drift).
PR #397 final state — 5 commits, all green
3954852 fix(hermes): bump remote-signer chart 0.3.0 → 0.3.1 + consistency test
769086d test(hermes): unit tests + flow-14 guard for wallet-import cluster wiring
b17995a fix(hermes,flow-14): roll remote-signer after import + protect register from set -e
a214050 fix(hermes): honor ApplyCluster — helmfile-sync after wallet import
6c5106a fix(hermes): bookend wallet-import archival with k3d ownership flip
Plus 7 new unit tests in internal/hermes/wallet_import_test.go (mock seams via var syncFn, restartHermesRemoteSignerFn, ensureVolumeWritableFn, fixRuntimeVolumeOwnershipFn) and 1 chart-consistency test in internal/hermes/chart_consistency_test.go. Plus a flow-level guard in flows/flow-14-live-obol-base-sepolia.sh that asserts the remote-signer pod is <120s old after wallet import — fast-fail diagnostic for any future drop of the rollout-restart.
Final flow-14 run — 60/0 PASS/FAIL
| Commit | 3954852 (PR #397 tip = 5 commits on top of 3ee8073) |
| Chart | remote-signer 0.3.1 / image ghcr.io/obolnetwork/remote-signer:v0.2.0 |
| Agent ID | 5274 |
| Tunnel | https://land-movement-refrigerator-databases.trycloudflare.com |
| Inference endpoint | spark1 vLLM qwen36-fast (http://192.168.18.23:8000/v1) |
| Total steps | 55 |
| PASS lines | 60 (some steps assert multiple invariants) |
| FAIL lines | 0 |
On-chain receipts (Base Sepolia, chain 84532)
| Tx | Purpose | Basescan |
|---|---|---|
0xad68b982…f3760 |
ERC-8004 register (agentId 5274) | link |
0x481fb33a…0506a |
SetMetadata | link |
0xc2faae06…3a63 |
Funding (OBOL → Bob signer) | link |
0x7baead9a…d14b44 |
Settlement Bob signer → Alice, exactly 1e15 wei (= 0.001 OBOL) |
link |
Balance deltas asserted exact on both sides:
- Alice
0x58aA1bB7…+1000000000000000wei - Bob signer
0x2627b9D7…−1000000000000000wei
Plus inference correctness on the paid path (step 48): the paid/qwen36-fast reply was a 26-char coherent answer, not the parrot regex from the colleague's earlier flow-13 screenshot.
What's NOT done — stays on the integration branch
Two scoped follow-ups that came up in review but are out of scope for #386 + #397:
- Add provider smokes and model preference #379 (
obol model prefer) needs a rebase against this branch and a reconciliation withmodel.Rankso prefer wins over rank. Real conflict — can't be done until Add provider smokes and model preference #379 rebases. - rc3 prep — OBOL on Ethereum mainnet (and possibly Base mainnet), facilitator gas monitoring (so we don't run dry), 6 open Dependabot vulnerabilities (4 high, 2 moderate) on default branch.
Plus three lower-priority enhancements the user and I discussed:
- Pre-broadcast wallet-address assertion in
obol sell register(proper layer fix for the stale-pod scenario; the flow guard catches it at the boundary but the CLI itself should refuse to broadcast against a stale signer). - Chart redesign: keystore password as PVC sidecar (
<uuid>.passwordnext to<uuid>.json, mode 0400). Removes the Secret coordination entirely; chart change, not Go change. Out of scope for current rc. obol selldemo flow — Oisin still wants this; PR First commit towards a demo sell command #355 (worktree-sell-demo) state needs review.
Hand-off
This branch + PR #397 are flows-validated. Per project rule (feedback_main_merge_gates.md), I do not press merge to main from my side — both PRs are queued for human review. When ready, suggested merge order:
1. Merge #397 → integration/pr377-pr381 (this branch)
2. CI re-run on #386
3. Human review approval on #386
4. Merge #386 → main
5. Tag v0.9.0-rc3 (after rc3 backlog items above are addressed)
…397) * fix(hermes): bookend wallet-import archival with k3d ownership flip archiveReplacedHermesKeystore stat/mkdir/rename the host-path PVC directly, but provisionKeystoreToVolume's last step (fixRuntimeVolumeOwnership) leaves the keystores dir as mode 700 owned by the container's uid 10000. The host-side process (uid 1000) then cannot traverse the dir, so os.Stat returns EACCES and the wrapping caller surfaces "failed to archive replaced keystore: stat …: permission denied". Mirror the pattern provisionKeystoreToVolume already uses: call ensureVolumeWritable up front (chowns to host uid via k3d node-exec), defer fixRuntimeVolumeOwnership so all return paths restore container ownership for the remote-signer pod. The bug pre-dates the obol-wallet-import flow rewrite; flow-14 only started exercising the path on Alice once the --private-key-file escape hatch was removed. * fix(hermes): honor ApplyCluster — helmfile-sync after wallet import ImportPrivateKeyWalletOptions.ApplyCluster has been plumbed all the way from cmd/obol/wallet.go since the OpenClaw → Hermes routing fix, but ImportPrivateKeyWalletCmd never actually consumed it. Effect: `obol wallet import` against a live cluster wrote the new keystore to the host-path PVC and updated values-remote-signer.yaml on disk, but the running remote-signer pod kept decrypting with the old chart-bootstrap keystore-password Secret and signed with the chart's throwaway address (e.g. 0xb6aF…). On a flow-14 register tx that surfaced as "gas required exceeds allowance (0)" — chart key has no funds. Mirror OpenClaw's finalizeWalletProvision pattern: when the cluster is reachable, run hermes.Sync to helmfile-sync the deployment. helmfile reapplies the keystore-password Secret with the new value and helm rolls the remote-signer deployment, so the pod restarts against the freshly-imported keystore. Failure to sync is best-effort — emits a warning and a recovery hint instead of failing the import outright (cluster might come up later). * fix(hermes,flow-14): roll remote-signer after import + protect register from set -e Two follow-ups to the helmfile-sync addition (a214050): 1. helm doesn't roll a Deployment when only a Secret's data changed — the Deployment template still references the same Secret name, so helm patches the Secret in-place and leaves the pod running with the stale env. After Sync, run an explicit `kubectl rollout restart deployment/remote-signer` and wait up to 120s for the new pod to be ready. Mirrors OpenClaw's restartRemoteSigner semantics. 2. flow-14 step 23 ran `register_out=$(timeout 300 obol sell register …)` under set -e from lib.sh. obol sell register correctly exits 1 on chain failure, but the assignment-with-command-substitution under errexit kills the script before the if-check can fire fail() and emit_metrics — the run looked like a silent death at "STEP: [23]" instead of a clean FAIL with metrics. Wrap in set +e/-e the same way step 22 (wallet import) already does. Together with 6c5106a (archive bookend) and a214050 (Sync on ApplyCluster), `obol wallet import` against a live Hermes cluster now fully replaces the chart bootstrap key end-to-end without flow-level workarounds. * test(hermes): unit tests + flow-14 guard for wallet-import cluster wiring Tests cover the regression classes surfaced in this PR: - TestArchiveReplacedHermesKeystore_NilExisting / SameUUID — happy short-circuit paths must NOT call the k3d node-exec helpers. - TestArchiveReplacedHermesKeystore_BookendOrder — guards 6c5106a: the (ensureVolumeWritable → fixRuntimeVolumeOwnership) bookend MUST run in order, and the deferred fix MUST fire on every return path including the os.Stat ENOENT early-return. - TestArchiveReplacedHermesKeystore_RenamesToReplaced — happy-path archival writes the file under <dir>/replaced/<uuid>-<ts>.json and removes the original. - TestImportPrivateKeyWalletCmd_ApplyClusterFalseSkipsCluster — guards the inverse of a214050: the pre-cluster bootstrap path must NOT helmfile-sync or rollout-restart. - TestImportPrivateKeyWalletCmd_ApplyClusterTrueRollsPod — primary guard: ApplyCluster=true must invoke both Sync AND restartHermesRemoteSigner (helm doesn't roll on Secret-data changes, so the rollout-restart is non-optional). - TestImportPrivateKeyWalletCmd_SyncFailureSkipsRestart — best-effort contract: Sync error → skip restart, do NOT fail the import as a whole; on-disk artifacts let a later `obol hermes sync` finish. Tests use indirection seams (var syncFn, restartHermesRemoteSignerFn, ensureVolumeWritableFn, fixRuntimeVolumeOwnershipFn) to spy/replace without standing up a real k3d cluster. Flow-level guard: a new step between 22 (wallet import) and 23 (register) asserts the remote-signer pod's startTime is within 120s of now. If a regression drops the explicit kubectl rollout-restart, the pod stays old → assertion fails fast with a clear "wallet import did not roll the deployment" diagnostic, instead of falling through to the 5-minute "gas required exceeds allowance (0)" symptom. * fix(hermes): bump remote-signer chart 0.3.0 → 0.3.1 + consistency test Chart 0.3.1 was published 2026-04-23 with appVersion `v0.2.0`, which accepts the canonical-string signer contract (chain_id, value, etc. serialized as JSON strings) introduced by PR #359 / commit b9495b8. Chart 0.3.0 ships `v0.1.0` which only accepts the legacy u64 contract and rejects every signing call from current obol-stack with HTTP 422 "chain_id: invalid type: string \"84532\", expected u64". OpenClaw was bumped to 0.3.1 in PR #374 but Hermes was missed — the two charts are pinned in independent constants and Renovate only updated one. flow-14 step 23 (Alice ERC-8004 register via remote- signer) reproduced the failure on every run against current main. TestRemoteSignerChartVersionConsistency reads both source files at test time and asserts the two pins agree, so future chart bumps either touch both files together or fail CI. Pairs with: PR #357 (closed in favour of #359), task #46. * refactor(charts): single source of truth for remote-signer chart pin Both Hermes and OpenClaw deploy the same `remote-signer` Helm chart but each held its own private constant + Renovate annotation. PR #374 bumped only OpenClaw to 0.3.1; Hermes stayed on 0.3.0 and shipped image v0.1.0 which rejects the canonical-string signer contract — exactly the drift class TestRemoteSignerChartVersionConsistency was added to catch. Promote the pin to a single exported constant in `internal/agentruntime/charts.go` (the package both consumers already import for Namespace/Hostname/KeystoreVolumePath, no new dep edge), move the Renovate annotation to live alongside it, and delete the consistency test — drift is now structurally impossible. Mirrors the OPENCLAW_VERSION pattern (single source-of-truth file + TestOpenClawVersionConsistency over its three consumers); future shared chart pins follow the same shape under internal/agentruntime/. --------- Co-authored-by: bussyjd <bussyjd@users.noreply.github.com>
Summary
This PR integrates #377 (OBOL Permit2) and #381 (Hermes default agent runtime) into a single testable branch and lands the hardening needed to keep both flows green together against real infrastructure (live Base Sepolia, real Cloudflare tunnel, two independent obol stacks on the same host).
Two end-to-end test flows now pass cleanly:
flow-11-dual-stack.shflow-13-dual-stack-obol.sh(new)PASS-count > step-count because some steps assert multiple invariants (e.g. step 21 archives 2 receipts + 1 Agent ID).
Architecture under test
flow-11 — live Base Sepolia, USDC, ERC-8004
flow-13 — dual-stack Anvil fork, OBOL Permit2
The two clusters share one Anvil fork through
host.k3d.internal:$ANVIL_PORT, so settlements are observable from both sides via the SAME ERC-20 contract address. The cross-cluster discovery still goes over a real Cloudflare tunnel — we do not bypass cloudflared.On-chain receipts
flow-11 — Base Sepolia (live)
Latest green run (commit
234b05f, agent ID 5250), all four artifacts archived under.tmp/flow-11-<ts>/*-receipt.json:0x78087f828d42c14d8fbf1f0bcfc6589350109e05d27bcb1840a990bd8d78dd7a0xae255c4ad4ce3919645e5c0cbacbd5fcb16e2995ef6c09fa36101cde06846e1b0xccb35434c6a78450b70632ee5ea72795bbf31d167c9052aa428a743f8e3d39d10xe9a406f57fea24f795801880eb58e4ceea5040203accd5579f154bb36c4da55cflow-13 — Anvil fork (local, ephemeral)
Latest green run (commit
d3ce623):0x31e3fc9D98Cf9A62980755C741ae069F7150De07is deterministic from deployer nonce)0x2b7fe02cc509341fa36ce819b5a17b69fe37e3027368a37a68b4f530a23bc3b00xc1c81ed61ced7d8343bc0eb2d6f7114bc4e2375da9eccdf1d5eec91caa04728eAnvil hashes are not on the live chain by design (this flow is meant to exercise OBOL Permit2 without depending on a public bridged OBOL deployment).
What changed
Code (Go)
internal/tunnel/tunnel.go—EnsureRunningrefactored toWaitReady(cfg, *ui.UI) (string, error)with a 5-min budget (override viaFLOW_TUNNEL_TIMEOUT). Pollsdeploy/cloudflaredrollout ANDobol tunnel statusURL in the same loop; returns a clear error naming both subjects on timeout.cmd/obol/sell.go— whenobol sell httpregisters, callstunnel.EnsureTunnelForSellBEFORE thekubectlApplyOutputso the controller's first reconcile sees a populatedtunnelURLConfigMap. Tunnel failure is fatal on the registered path;--no-registerkeeps best-effort behavior.internal/hermes/hermes.go— injectsGATEWAY_ALLOW_ALL_USERS=trueon thehermes-dashboardcontainer only, with an inline justification (local k3d clusters do not expose messaging integrations to the public internet; production must override via values overlay). Without this, the dashboard CrashLoopBackOff blocks pod readiness.internal/openclaw/monetize_integration_test.go— emitsFLOW12_SETTLEMENT_TX=<hash>markers so flow-12's shell harness can build areceipt-summary.jsonanalogous to flow-11.Flows
flows/flow-13-dual-stack-obol.sh(new) — 1262 lines. Two obol stacks, one shared Anvil fork viahost.k3d.internal, x402-rs facilitator witheip2612_gas_sponsoring=true, ForkObolToken deployed byforge create, real Cloudflare tunnel for cross-cluster discovery. Trap-based cleanup tears down anvil + facilitator + port-forwards on any exit.flows/run-detached.sh(new) — launcher that survives SSH disconnect: triestmux→screen→setsid -fin that order. Prints the log path; tail it from a fresh SSH session.flows/lib.sh— adds 4 helpers reused by flow-08/11/12/13:PATHexport for non-login shells (socast/kubectl/k3dresolve under nohup/cron)detect_buyer_runtime <runner>— setsBOB_AGENT_{NS,DEPLOY,CONTAINER,SERVICE,REMOTE_PORT,LABEL,RUNTIME}based on which agent namespace exists (Hermes or OpenClaw)find_usdc_transfer,archive_receipt,wait_usdc_transfer_receipt,receipt_status_ok,extract_tx_hash) — generic ERC-20, the USDC name is historicalensure_image_in_k3d <image> <cluster>—docker save+ctr -n k8s.io images importfallback for hosts where the registry mirror stalls (we hit this on aarch64 with cloudflared:2026.1.2)flows/flow-11-dual-stack.sh— 4 hardening fixes (anchored.envgrep, runtime-aware buyer vars, drop natural-language assertions on agent responses, numeric-only Agent ID extraction) + step-28 changed to poll the API-server container'sready=trueinstead of pod-summary STATUS.flows/flow-08-buy.sh— capturesBUY_START_BLOCK, emitsPAID_AMOUNT_USDCfrom the signing Python, archives the on-chain settlement receipt viawait_usdc_transfer_receipt. Balance delta kept as defense in depth.flows/flow-12-obol-payment.sh— pipesgo testoutput throughtee, parsesFLOW12_*markers, writesreceipt-summary.jsonmirroring flow-11.flows/README.md— flow inventory + "Running a flow detached over SSH" section pointing atrun-detached.sh.Skill
.agents/skills/obol-stack-dev/SKILL.md— adds a 120-line "Running Flows on Remote Hosts" section distilling everything that bit us this session: tmux launcher, distroless probe pattern, multi-container readiness check, anvil--host 0.0.0.0, eRPC route pinning, x402-rs scheme config (nopermit2scheme), cloudflared lazy deploy,obol sell http --namespaceoverload, ERC-8004 registration prereqs, and the setMetadata revert recipe (see below).Issues found and fixed
Each issue surfaced through real test runs on two aarch64 Linux test hosts (testbed-A and testbed-B). Both run docker + k3d 5.8 + foundry. Fixes are in this PR unless flagged otherwise.
.envgrep matched comment lines containingREMOTE_SIGNER_PRIVATE_KEY→ multi-line value tocast wallet address→ parse error^[[:space:]]*REMOTE_SIGNER_PRIVATE_KEY=+cut -d= -f2-~/.foundry/bin/~/.local/bin;cast/kubectlnot foundflows/lib.shexports canonical PATH at source timeopenclaw-obol-agentnamespace +app.kubernetes.io/name=openclawlabel; broke on Hermes runtime (#381)detect_buyer_runtimehelper + step 28 poll specific container'sready=truepurchase complete|...) didn't match Hermes wording → false FAILReady=Truepoll as structural proofawk '{print $3}'captured(notfrom "Agent ID: (not yet registered)" → Python int() crash^[0-9]+$validationobol sell httpsilently tolerated → ServiceOffer registers with empty tunnel URL → controller stuck inAwaitingExternalRegistrationforevertunnel.WaitReady5-min budget; called BEFOREkubectlApplyOutputon the registered path; failure is fatalwait_usdc_transfer_receiptarchivalFLOW12_*markers; shell parses + writesreceipt-summary.jsonhermes-dashboardcontainer CrashLoopBackOff in dev clusters: "No user allowlists configured" → exit 1 → podReady=False→ port-forward failsGATEWAY_ALLOW_ALL_USERS=trueon the dashboard container only, with inline justificationensure_image_in_k3dhelper inflows/lib.sh: docker save + ctr import as a fallbackflows/run-detached.shlauncher: tmux → screen → setsid; documented as the canonical entrypointTwo additional brittlenesses we noticed but did not block on:
kubectl rollout statusfor the Hermes deployment instead of polling.cast sendtext-output format drifted between foundry releases so regex-based tx-hash extraction was unreliable for the OBOL mint step. Switched tocast send --json+jq-style Python parser.setMetadata revert investigation (live Base Sepolia)
A colleague hit
! failed to set x402 metadata: erc8004: setMetadata tx: execution revertedon agent ID 5196, wallet0x2FbFe6cF…. Their on-chain analysis:eth_getTransactionCountfor the wallet went 0→1 after the run → only theregistertx broadcast.setMetadata.eth_estimateGas/eth_call), not on-chain.We reproduced the revert and decoded it:
So the simulation hit state where token 5196 didn't yet exist, even though it WAS minted on-chain (verified:
ownerOf(5196) → 0x2FbFe6cF…on live Base Sepolia today).Most plausible cause: a stale eRPC
base-sepoliaroute pinned to a parallel/dead Anvil fork from an earlier flow-12/13 run that didn't unwind. The CLI'sobol sell registerpath:registerthrough the chain's WRITE upstream (lands on live Base Sepolia)setMetadatavia the chain's READ upstreamIf READ is pinned to a fork (live or dead), the fork has its own ERC-721 storage where 5196 was never minted →
ERC721NonexistentTokenrevert. The colleague's question — "could a concurrent Anvil fork cause this?" — is yes, transitively, via a leftoverobol network add base-sepolia --endpoint http://...:anvil-port --allow-writes. (A killed Anvil would surface as RPC-connect error, not a revert.)Proposed fix (separate PR, not in this one):
Register, the CLI shouldbind.WaitMinedthe receipt and re-pin the next call's block tolatestso the simulation gets a fresh fetch.obol network statusshould warn when a custom upstream is unreachable (would have surfaced the leftover fork pin).obol network remove base-sepoliaas a teardown step in flow-12/13 cleanup traps. (flow-13's trap already kills its anvil + facilitator; the eRPC-side route pin lingers in the cluster.)This recipe is now documented in
obol-stack-dev/SKILL.mdso future contributors can short-circuit diagnosis with a singlecast call --from <signer>reproduction.How to reproduce
On any aarch64 Linux host with docker, foundry (cast/anvil/forge), kubectl, helm, helmfile, k3d, ollama, and a
.envwithREMOTE_SIGNER_PRIVATE_KEY(funded on Base Sepolia: ~0.05 ETH for gas, ~5 USDC for buyer):Receipt artifacts land under
.tmp/flow-{11,13}-<timestamp>/*.jsonwith areceipt-summary.jsonindex.Out of scope / follow-ups
bind.Transactblock-pinning — separate PR with theWaitMined+ re-resolve fix described above.cast sendtext-format brittleness inflows/lib.sh::extract_tx_hash— flow-13 now uses--json; flow-11 still uses the regex path oncast sendoutput and could be migrated for symmetry.flow-13-registered-obol.shthat includes registration onceobol sell httpexposes the OBOL Permit2 asset metadata flags.Test plan
go build ./...cleango test ./cmd/obol/... ./internal/tunnel/... ./internal/serviceoffercontroller/... ./internal/stack/... ./internal/hermes/...cleanbash -non all flow scripts.env)