feat(executor): install agent CLI plugins into the executor sandbox#3
Open
nickwinder wants to merge 5 commits into
Open
feat(executor): install agent CLI plugins into the executor sandbox#3nickwinder wants to merge 5 commits into
nickwinder wants to merge 5 commits into
Conversation
Adds an `executorPlugins` config field that seeds Claude Code / Codex /
Gemini plugins into the executor's agent CLI before each run. Lets you
A/B-test whether shipping a plugin (skills, slash commands, marketplace
bundles) measurably improves an agent's ability to use the SDK under
evaluation.
Plugins are installed in the executor sandbox only — the judge sandbox
stays plugin-free so its scoring is independent of the executor's
tooling. Run the same suite twice (once without `executorPlugins`,
once with) and compare per-test-case judge scores.
Per-adapter install layout (each adapter owns its own logic, matching
how `installCommand` and `extractLog` already work):
- claude: extracts the plugin dir to `\$HOME/.claude/plugins/<name>/`
and loads it for each session via the documented `--plugin-dir`
CLI flag. Requires `.claude-plugin/plugin.json` at the plugin root.
Marketplace registration is intentionally skipped — Claude Code's
marketplace flow prompts for trust acceptance, which can't be
answered in `--print` mode.
- codex: walks the plugin's `skills/` subtree and extracts each
`SKILL.md`-bearing dir to `\$CODEX_HOME/skills/<name>/`
(auto-discovered). Requires `.codex-plugin/plugin.json` and rejects
duplicate skill names across plugins.
- gemini: extracts the whole plugin dir to
`\$HOME/.gemini/extensions/<name>/`. Requires `gemini-extension.json`
at the plugin root.
- custom: throws a clear error — silently no-op'ing would make an
A/B run meaningless.
Plumbing:
- Extracted shared `uploadDirToSandbox()` helper in `scaffolding.ts`;
`uploadSources()` now uses it too.
- New `resolveExecutorPlugins()` reuses the existing source resolver
so git-sourced plugins share the same `cache/repos/` cache.
- Wired between agent-CLI install and agent run in `execute.ts`.
Failures write `plugin-install-error.log` to the run dir.
Validation: plugin names are slug-safe (letters/digits/\`.\`/\`_\`/\`-\`),
unique within the array, and each entry's discriminator-required field
(\`path\` for local, \`url\` for git) is checked at config-load time.
Adapter-compatibility is enforced at install time, not load time, since
support depends on the CLI mode the executor actually runs in.
Also fixes a pre-existing framework bug surfaced along the way: the
Claude adapter's \`spawnWithSchema\` now wraps non-object JSON schemas
under a single \`result\` property and unwraps
\`envelope.structured_output.result\` before returning. Claude's
\`--json-schema\` rejects top-level non-object schemas; same envelope
dance the Codex adapter already does.
Docs: README + \`skills/_reference/config-schema.md\` document the new
field and the cross-runtime support matrix. Type-check + lint clean;
348 tests pass.
…p work - Validate manifests and upload plugin dirs concurrently across plugins - Drop pre-emptive `mkdir -p` calls: uploadDirToSandbox already mkdirs the destination as part of the tar-extract command - Hardcode `/root/.codex/skills` and `/root/.gemini/extensions` (matching ClaudeAdapter); removes a per-install `printf HOME` RPC roundtrip in each - Collapse resolveExecutorPlugins to a single SourceConfig-mapping branch - Rename the `stat as fsStat` import alias back to `stat` by renaming the local var (skillStat → md) - Trim narrative block comments
…kill collision - Restore runtime $HOME / $CODEX_HOME probe in each adapter so install destinations match the documented `$HOME/.claude/plugins/<name>`, `$CODEX_HOME/skills/<name>`, `$HOME/.gemini/extensions/<name>` layout. Pre-cleanup, codex/gemini already used the probe; the cleanup hardcoded /root for them and applied the same hardcoding to claude. This restores $HOME derivation for all three with a `/root` shell-side fallback. - Codex: replace the seenSkillNames Set with a Map tracking the originating plugin per skill name; on collision the error now names both plugins instead of only the latest. Reviewer-flagged.
Some base images (e.g. node:20-slim) export HOME=/ for the root user.
Deriving the plugin install dir straight from $HOME then lands plugins
in top-level dot-dirs (/.claude/plugins, /.codex/skills,
/.gemini/extensions). Treat a probed $HOME of '/' (or empty) as
degenerate and fall back to /root, so plugins install under
/root/.{claude,codex,gemini} as expected. Codex now probes $CODEX_HOME
and $HOME separately so $CODEX_HOME is still honoured when set.
Verified end-to-end against microsandbox node:20-slim VMs.
…o-discovery Reverts the /root fallback for degenerate $HOME. Verification on a node:20-slim microsandbox VM showed both codex and gemini resolve their home from the $HOME env var (which the image sets to '/' for root), and auto-discover skills/extensions from there — codex created /.codex, gemini reported '/.gemini/settings.json'. The /root fallback installed plugins where those CLIs never look, silently breaking discovery. The correct invariant: for an auto-discovery adapter, the install dir must equal the CLI's $HOME resolution. Since both the install probe and the agent CLI run in the same VM, deriving the install dir purely from $HOME guarantees they match — on any image. claude is unaffected either way (it loads plugins via an explicit --plugin-dir flag) but is reverted too so all three adapters stay consistent.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Want to A/B-test whether shipping a plugin (skills, slash commands, marketplace bundles) measurably improves an agent's ability to use the SDK under evaluation. Same suite, two runs — one without
executorPlugins, one with — then compare per-test-case judge scores.Summary
executorPluginsconfig field seeds Claude Code / Codex / Gemini plugins into the executor's agent CLI inside its sandbox before each run.installCommandandextractLogalready work):claude— extracts the plugin dir to$HOME/.claude/plugins/<name>/and loads it via the documented--plugin-dir <path>CLI flag at each invocation. Marketplace registration is intentionally skipped (the marketplace trust prompt can't be answered in--printmode). Requires.claude-plugin/plugin.jsonat the plugin root.codex— walks the plugin'sskills/subtree, extracts eachSKILL.md-bearing dir to$CODEX_HOME/skills/<name>/(auto-discovered). Requires.codex-plugin/plugin.json; rejects duplicate skill names across plugins.gemini— extracts the whole plugin dir to$HOME/.gemini/extensions/<name>/. Requiresgemini-extension.jsonat the plugin root.custom— throws a clear install-time error; silently no-op'ing would make an A/B run meaningless.pathforlocal,urlforgit) checked at config load time. Adapter-compatibility is enforced at install time (depends on the CLI mode the executor actually runs in).uploadDirToSandbox()helper inscaffolding.tsand reused it foruploadSources(); newresolveExecutorPlugins()reuses the existing source resolver so git-sourced plugins share the samecache/repos/cache. Failures writeplugin-install-error.logto the run dir.spawnWithSchemanow wraps non-object JSON schemas under a singleresultproperty and unwrapsenvelope.structured_output.resultbefore returning. Claude's--json-schemarejects top-level non-object schemas — same envelope dance the Codex adapter already does.Docs: README +
skills/_reference/config-schema.mddocument the new field and the cross-runtime support matrix.Smoke test
Verified end-to-end against real
microsandboxVMs (node:20-slim), beyond the unit suite.Plugin install matrix — booted sandbox VMs and ran
installPluginsInSandboxfor every adapter × plugin-source combination, inspecting the VM filesystem afterwards:file://)claude--plugin-dir--plugin-diremittedcodexgeminicustomError / validation cases — all reject as expected: missing
.claude-plugin/plugin.json, missingskills/directory for codex, missinggemini-extension.json, duplicate plugin names, and duplicate skill names across codex plugins (the error names both contributing plugins).End-to-end
executeruns — full pipeline against a sample test case with real agent CLIs:claudeexecutor + local plugin → exit 0;agent-cmd.logconfirms the plugin was wired into the invocation via--plugin-dir; solution produced; noplugin-install-error.log.codexexecutor + local plugin → exit 0; skills installed into the auto-discovered dir; solution produced.Design point verified by the smoke test:
node:20-slimexportsHOME=/for the root user, so on that image plugins install under/.claude/plugins,/.codex/skills,/.gemini/extensions. This is correct, not a bug. Codex and gemini auto-discover from their$HOME-relative directories — verified in-VM: codex creates/.codex, and the gemini CLI itself reports/.gemini/settings.json. The install must therefore resolve$HOMEexactly the way the CLI does; deriving the install dir purely from$HOMEguarantees it always matches the CLI's discovery dir, on any image. (claude is decoupled from this — it loads plugins via an explicit--plugin-dirflag.)348 tests pass; type-check + lint clean.