feat(executor): install agent CLI plugins into the executor sandbox by nickwinder · Pull Request #3 · PSPDFKit-labs/agentic-usability

nickwinder · 2026-05-14T20:51:27Z

Why

Want to A/B-test whether shipping a plugin (skills, slash commands, marketplace bundles) measurably improves an agent's ability to use the SDK under evaluation. Same suite, two runs — one without executorPlugins, one with — then compare per-test-case judge scores.

Summary

New executorPlugins config field seeds Claude Code / Codex / Gemini plugins into the executor's agent CLI inside its sandbox before each run.
Plugins are installed in the executor sandbox only — the judge sandbox stays plugin-free so its scoring is independent of the executor's tooling.
Per-adapter install layout (each adapter owns its own logic, matching how installCommand and extractLog already work):
- claude — extracts the plugin dir to $HOME/.claude/plugins/<name>/ and loads it via the documented --plugin-dir <path> CLI flag at each invocation. Marketplace registration is intentionally skipped (the marketplace trust prompt can't be answered in --print mode). Requires .claude-plugin/plugin.json at the plugin root.
- codex — walks the plugin's skills/ subtree, extracts each SKILL.md-bearing dir to $CODEX_HOME/skills/<name>/ (auto-discovered). Requires .codex-plugin/plugin.json; rejects duplicate skill names across plugins.
- gemini — extracts the whole plugin dir to $HOME/.gemini/extensions/<name>/. Requires gemini-extension.json at the plugin root.
- custom — throws a clear install-time error; silently no-op'ing would make an A/B run meaningless.
Validation: plugin names are slug-safe and unique within the array; discriminator-required fields (path for local, url for git) checked at config load time. Adapter-compatibility is enforced at install time (depends on the CLI mode the executor actually runs in).
Plumbing: extracted shared uploadDirToSandbox() helper in scaffolding.ts and reused it for uploadSources(); new resolveExecutorPlugins() reuses the existing source resolver so git-sourced plugins share the same cache/repos/ cache. Failures write plugin-install-error.log to the run dir.
Also fixes a pre-existing framework bug surfaced along the way: the Claude adapter's spawnWithSchema now wraps non-object JSON schemas under a single result property and unwraps envelope.structured_output.result before returning. Claude's --json-schema rejects top-level non-object schemas — same envelope dance the Codex adapter already does.

Docs: README + skills/_reference/config-schema.md document the new field and the cross-runtime support matrix.

Smoke test

Verified end-to-end against real microsandbox VMs (node:20-slim), beyond the unit suite.

Plugin install matrix — booted sandbox VMs and ran installPluginsInSandbox for every adapter × plugin-source combination, inspecting the VM filesystem afterwards:

Adapter	no plugins	local	git (`file://`)	2× parallel
`claude`	✅ no `--plugin-dir`	✅ extracted + `--plugin-dir` emitted	✅ host-side clone → install	✅ both, 2 flags
`codex`	—	✅ multi-skill extract	✅ clone → skill install	✅ parallel
`gemini`	—	✅ extension installed	✅ clone → install	—
`custom`	✅ rejects with a clear "does not support executorPlugins" error

Error / validation cases — all reject as expected: missing .claude-plugin/plugin.json, missing skills/ directory for codex, missing gemini-extension.json, duplicate plugin names, and duplicate skill names across codex plugins (the error names both contributing plugins).

End-to-end execute runs — full pipeline against a sample test case with real agent CLIs:

claude executor + local plugin → exit 0; agent-cmd.log confirms the plugin was wired into the invocation via --plugin-dir; solution produced; no plugin-install-error.log.
codex executor + local plugin → exit 0; skills installed into the auto-discovered dir; solution produced.

Design point verified by the smoke test: node:20-slim exports HOME=/ for the root user, so on that image plugins install under /.claude/plugins, /.codex/skills, /.gemini/extensions. This is correct, not a bug. Codex and gemini auto-discover from their $HOME-relative directories — verified in-VM: codex creates /.codex, and the gemini CLI itself reports /.gemini/settings.json. The install must therefore resolve $HOME exactly the way the CLI does; deriving the install dir purely from $HOME guarantees it always matches the CLI's discovery dir, on any image. (claude is decoupled from this — it loads plugins via an explicit --plugin-dir flag.)

348 tests pass; type-check + lint clean.

Adds an `executorPlugins` config field that seeds Claude Code / Codex / Gemini plugins into the executor's agent CLI before each run. Lets you A/B-test whether shipping a plugin (skills, slash commands, marketplace bundles) measurably improves an agent's ability to use the SDK under evaluation. Plugins are installed in the executor sandbox only — the judge sandbox stays plugin-free so its scoring is independent of the executor's tooling. Run the same suite twice (once without `executorPlugins`, once with) and compare per-test-case judge scores. Per-adapter install layout (each adapter owns its own logic, matching how `installCommand` and `extractLog` already work): - claude: extracts the plugin dir to `\$HOME/.claude/plugins/<name>/` and loads it for each session via the documented `--plugin-dir` CLI flag. Requires `.claude-plugin/plugin.json` at the plugin root. Marketplace registration is intentionally skipped — Claude Code's marketplace flow prompts for trust acceptance, which can't be answered in `--print` mode. - codex: walks the plugin's `skills/` subtree and extracts each `SKILL.md`-bearing dir to `\$CODEX_HOME/skills/<name>/` (auto-discovered). Requires `.codex-plugin/plugin.json` and rejects duplicate skill names across plugins. - gemini: extracts the whole plugin dir to `\$HOME/.gemini/extensions/<name>/`. Requires `gemini-extension.json` at the plugin root. - custom: throws a clear error — silently no-op'ing would make an A/B run meaningless. Plumbing: - Extracted shared `uploadDirToSandbox()` helper in `scaffolding.ts`; `uploadSources()` now uses it too. - New `resolveExecutorPlugins()` reuses the existing source resolver so git-sourced plugins share the same `cache/repos/` cache. - Wired between agent-CLI install and agent run in `execute.ts`. Failures write `plugin-install-error.log` to the run dir. Validation: plugin names are slug-safe (letters/digits/\`.\`/\`_\`/\`-\`), unique within the array, and each entry's discriminator-required field (\`path\` for local, \`url\` for git) is checked at config-load time. Adapter-compatibility is enforced at install time, not load time, since support depends on the CLI mode the executor actually runs in. Also fixes a pre-existing framework bug surfaced along the way: the Claude adapter's \`spawnWithSchema\` now wraps non-object JSON schemas under a single \`result\` property and unwraps \`envelope.structured_output.result\` before returning. Claude's \`--json-schema\` rejects top-level non-object schemas; same envelope dance the Codex adapter already does. Docs: README + \`skills/_reference/config-schema.md\` document the new field and the cross-runtime support matrix. Type-check + lint clean; 348 tests pass.

…p work - Validate manifests and upload plugin dirs concurrently across plugins - Drop pre-emptive `mkdir -p` calls: uploadDirToSandbox already mkdirs the destination as part of the tar-extract command - Hardcode `/root/.codex/skills` and `/root/.gemini/extensions` (matching ClaudeAdapter); removes a per-install `printf HOME` RPC roundtrip in each - Collapse resolveExecutorPlugins to a single SourceConfig-mapping branch - Rename the `stat as fsStat` import alias back to `stat` by renaming the local var (skillStat → md) - Trim narrative block comments

…kill collision - Restore runtime $HOME / $CODEX_HOME probe in each adapter so install destinations match the documented `$HOME/.claude/plugins/<name>`, `$CODEX_HOME/skills/<name>`, `$HOME/.gemini/extensions/<name>` layout. Pre-cleanup, codex/gemini already used the probe; the cleanup hardcoded /root for them and applied the same hardcoding to claude. This restores $HOME derivation for all three with a `/root` shell-side fallback. - Codex: replace the seenSkillNames Set with a Map tracking the originating plugin per skill name; on collision the error now names both plugins instead of only the latest. Reviewer-flagged.

Some base images (e.g. node:20-slim) export HOME=/ for the root user. Deriving the plugin install dir straight from $HOME then lands plugins in top-level dot-dirs (/.claude/plugins, /.codex/skills, /.gemini/extensions). Treat a probed $HOME of '/' (or empty) as degenerate and fall back to /root, so plugins install under /root/.{claude,codex,gemini} as expected. Codex now probes $CODEX_HOME and $HOME separately so $CODEX_HOME is still honoured when set. Verified end-to-end against microsandbox node:20-slim VMs.

…o-discovery Reverts the /root fallback for degenerate $HOME. Verification on a node:20-slim microsandbox VM showed both codex and gemini resolve their home from the $HOME env var (which the image sets to '/' for root), and auto-discover skills/extensions from there — codex created /.codex, gemini reported '/.gemini/settings.json'. The /root fallback installed plugins where those CLIs never look, silently breaking discovery. The correct invariant: for an auto-discovery adapter, the install dir must equal the CLI's $HOME resolution. Since both the install probe and the agent CLI run in the same VM, deriving the install dir purely from $HOME guarantees they match — on any image. claude is unaffected either way (it loads plugins via an explicit --plugin-dir flag) but is reverted too so all three adapters stay consistent.

nickwinder self-assigned this May 14, 2026

nickwinder added 4 commits May 15, 2026 12:58

nickwinder requested a review from HungKNguyen May 15, 2026 06:56

nickwinder marked this pull request as ready for review May 15, 2026 06:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(executor): install agent CLI plugins into the executor sandbox#3

feat(executor): install agent CLI plugins into the executor sandbox#3
nickwinder wants to merge 5 commits into
mainfrom
nick/agentic-usability/executor-cli-plugins

nickwinder commented May 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nickwinder commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Summary

Smoke test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nickwinder commented May 14, 2026 •

edited

Loading