Skip to content

feat(executor): install agent CLI plugins into the executor sandbox#3

Open
nickwinder wants to merge 5 commits into
mainfrom
nick/agentic-usability/executor-cli-plugins
Open

feat(executor): install agent CLI plugins into the executor sandbox#3
nickwinder wants to merge 5 commits into
mainfrom
nick/agentic-usability/executor-cli-plugins

Conversation

@nickwinder
Copy link
Copy Markdown
Contributor

@nickwinder nickwinder commented May 14, 2026

Why

Want to A/B-test whether shipping a plugin (skills, slash commands, marketplace bundles) measurably improves an agent's ability to use the SDK under evaluation. Same suite, two runs — one without executorPlugins, one with — then compare per-test-case judge scores.

Summary

  • New executorPlugins config field seeds Claude Code / Codex / Gemini plugins into the executor's agent CLI inside its sandbox before each run.
  • Plugins are installed in the executor sandbox only — the judge sandbox stays plugin-free so its scoring is independent of the executor's tooling.
  • Per-adapter install layout (each adapter owns its own logic, matching how installCommand and extractLog already work):
    • claude — extracts the plugin dir to $HOME/.claude/plugins/<name>/ and loads it via the documented --plugin-dir <path> CLI flag at each invocation. Marketplace registration is intentionally skipped (the marketplace trust prompt can't be answered in --print mode). Requires .claude-plugin/plugin.json at the plugin root.
    • codex — walks the plugin's skills/ subtree, extracts each SKILL.md-bearing dir to $CODEX_HOME/skills/<name>/ (auto-discovered). Requires .codex-plugin/plugin.json; rejects duplicate skill names across plugins.
    • gemini — extracts the whole plugin dir to $HOME/.gemini/extensions/<name>/. Requires gemini-extension.json at the plugin root.
    • custom — throws a clear install-time error; silently no-op'ing would make an A/B run meaningless.
  • Validation: plugin names are slug-safe and unique within the array; discriminator-required fields (path for local, url for git) checked at config load time. Adapter-compatibility is enforced at install time (depends on the CLI mode the executor actually runs in).
  • Plumbing: extracted shared uploadDirToSandbox() helper in scaffolding.ts and reused it for uploadSources(); new resolveExecutorPlugins() reuses the existing source resolver so git-sourced plugins share the same cache/repos/ cache. Failures write plugin-install-error.log to the run dir.
  • Also fixes a pre-existing framework bug surfaced along the way: the Claude adapter's spawnWithSchema now wraps non-object JSON schemas under a single result property and unwraps envelope.structured_output.result before returning. Claude's --json-schema rejects top-level non-object schemas — same envelope dance the Codex adapter already does.

Docs: README + skills/_reference/config-schema.md document the new field and the cross-runtime support matrix.

Smoke test

Verified end-to-end against real microsandbox VMs (node:20-slim), beyond the unit suite.

Plugin install matrix — booted sandbox VMs and ran installPluginsInSandbox for every adapter × plugin-source combination, inspecting the VM filesystem afterwards:

Adapter no plugins local git (file://) 2× parallel
claude ✅ no --plugin-dir ✅ extracted + --plugin-dir emitted ✅ host-side clone → install ✅ both, 2 flags
codex ✅ multi-skill extract ✅ clone → skill install ✅ parallel
gemini ✅ extension installed ✅ clone → install
custom ✅ rejects with a clear "does not support executorPlugins" error

Error / validation cases — all reject as expected: missing .claude-plugin/plugin.json, missing skills/ directory for codex, missing gemini-extension.json, duplicate plugin names, and duplicate skill names across codex plugins (the error names both contributing plugins).

End-to-end execute runs — full pipeline against a sample test case with real agent CLIs:

  • claude executor + local plugin → exit 0; agent-cmd.log confirms the plugin was wired into the invocation via --plugin-dir; solution produced; no plugin-install-error.log.
  • codex executor + local plugin → exit 0; skills installed into the auto-discovered dir; solution produced.

Design point verified by the smoke test: node:20-slim exports HOME=/ for the root user, so on that image plugins install under /.claude/plugins, /.codex/skills, /.gemini/extensions. This is correct, not a bug. Codex and gemini auto-discover from their $HOME-relative directories — verified in-VM: codex creates /.codex, and the gemini CLI itself reports /.gemini/settings.json. The install must therefore resolve $HOME exactly the way the CLI does; deriving the install dir purely from $HOME guarantees it always matches the CLI's discovery dir, on any image. (claude is decoupled from this — it loads plugins via an explicit --plugin-dir flag.)

348 tests pass; type-check + lint clean.

Adds an `executorPlugins` config field that seeds Claude Code / Codex /
Gemini plugins into the executor's agent CLI before each run. Lets you
A/B-test whether shipping a plugin (skills, slash commands, marketplace
bundles) measurably improves an agent's ability to use the SDK under
evaluation.

Plugins are installed in the executor sandbox only — the judge sandbox
stays plugin-free so its scoring is independent of the executor's
tooling. Run the same suite twice (once without `executorPlugins`,
once with) and compare per-test-case judge scores.

Per-adapter install layout (each adapter owns its own logic, matching
how `installCommand` and `extractLog` already work):

  - claude: extracts the plugin dir to `\$HOME/.claude/plugins/<name>/`
    and loads it for each session via the documented `--plugin-dir`
    CLI flag. Requires `.claude-plugin/plugin.json` at the plugin root.
    Marketplace registration is intentionally skipped — Claude Code's
    marketplace flow prompts for trust acceptance, which can't be
    answered in `--print` mode.
  - codex: walks the plugin's `skills/` subtree and extracts each
    `SKILL.md`-bearing dir to `\$CODEX_HOME/skills/<name>/`
    (auto-discovered). Requires `.codex-plugin/plugin.json` and rejects
    duplicate skill names across plugins.
  - gemini: extracts the whole plugin dir to
    `\$HOME/.gemini/extensions/<name>/`. Requires `gemini-extension.json`
    at the plugin root.
  - custom: throws a clear error — silently no-op'ing would make an
    A/B run meaningless.

Plumbing:
  - Extracted shared `uploadDirToSandbox()` helper in `scaffolding.ts`;
    `uploadSources()` now uses it too.
  - New `resolveExecutorPlugins()` reuses the existing source resolver
    so git-sourced plugins share the same `cache/repos/` cache.
  - Wired between agent-CLI install and agent run in `execute.ts`.
    Failures write `plugin-install-error.log` to the run dir.

Validation: plugin names are slug-safe (letters/digits/\`.\`/\`_\`/\`-\`),
unique within the array, and each entry's discriminator-required field
(\`path\` for local, \`url\` for git) is checked at config-load time.
Adapter-compatibility is enforced at install time, not load time, since
support depends on the CLI mode the executor actually runs in.

Also fixes a pre-existing framework bug surfaced along the way: the
Claude adapter's \`spawnWithSchema\` now wraps non-object JSON schemas
under a single \`result\` property and unwraps
\`envelope.structured_output.result\` before returning. Claude's
\`--json-schema\` rejects top-level non-object schemas; same envelope
dance the Codex adapter already does.

Docs: README + \`skills/_reference/config-schema.md\` document the new
field and the cross-runtime support matrix. Type-check + lint clean;
348 tests pass.
@nickwinder nickwinder self-assigned this May 14, 2026
…p work

- Validate manifests and upload plugin dirs concurrently across plugins
- Drop pre-emptive `mkdir -p` calls: uploadDirToSandbox already mkdirs the
  destination as part of the tar-extract command
- Hardcode `/root/.codex/skills` and `/root/.gemini/extensions` (matching
  ClaudeAdapter); removes a per-install `printf HOME` RPC roundtrip in each
- Collapse resolveExecutorPlugins to a single SourceConfig-mapping branch
- Rename the `stat as fsStat` import alias back to `stat` by renaming the
  local var (skillStat → md)
- Trim narrative block comments
…kill collision

- Restore runtime $HOME / $CODEX_HOME probe in each adapter so install
  destinations match the documented `$HOME/.claude/plugins/<name>`,
  `$CODEX_HOME/skills/<name>`, `$HOME/.gemini/extensions/<name>` layout.
  Pre-cleanup, codex/gemini already used the probe; the cleanup hardcoded
  /root for them and applied the same hardcoding to claude. This restores
  $HOME derivation for all three with a `/root` shell-side fallback.
- Codex: replace the seenSkillNames Set with a Map tracking the
  originating plugin per skill name; on collision the error now names
  both plugins instead of only the latest. Reviewer-flagged.
Some base images (e.g. node:20-slim) export HOME=/ for the root user.
Deriving the plugin install dir straight from $HOME then lands plugins
in top-level dot-dirs (/.claude/plugins, /.codex/skills,
/.gemini/extensions). Treat a probed $HOME of '/' (or empty) as
degenerate and fall back to /root, so plugins install under
/root/.{claude,codex,gemini} as expected. Codex now probes $CODEX_HOME
and $HOME separately so $CODEX_HOME is still honoured when set.

Verified end-to-end against microsandbox node:20-slim VMs.
…o-discovery

Reverts the /root fallback for degenerate $HOME. Verification on a
node:20-slim microsandbox VM showed both codex and gemini resolve their
home from the $HOME env var (which the image sets to '/' for root), and
auto-discover skills/extensions from there — codex created /.codex,
gemini reported '/.gemini/settings.json'. The /root fallback installed
plugins where those CLIs never look, silently breaking discovery.

The correct invariant: for an auto-discovery adapter, the install dir
must equal the CLI's $HOME resolution. Since both the install probe and
the agent CLI run in the same VM, deriving the install dir purely from
$HOME guarantees they match — on any image. claude is unaffected either
way (it loads plugins via an explicit --plugin-dir flag) but is reverted
too so all three adapters stay consistent.
@nickwinder nickwinder requested a review from HungKNguyen May 15, 2026 06:56
@nickwinder nickwinder marked this pull request as ready for review May 15, 2026 06:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant