Skip to content

test(e2e): add sandbox learn+recall test#233

Merged
visahak merged 8 commits intoAgentToolkit:mainfrom
vinodmut:test-e2e-sandbox-learn-recall
Apr 29, 2026
Merged

test(e2e): add sandbox learn+recall test#233
visahak merged 8 commits intoAgentToolkit:mainfrom
vinodmut:test-e2e-sandbox-learn-recall

Conversation

@vinodmut
Copy link
Copy Markdown
Contributor

@vinodmut vinodmut commented Apr 29, 2026

Relates to #186 — a regression test for the Claude Code plugin end-to-end: does the Stop hook extract a guideline from a failed session, and does recall use it in a follow-up?

Summary

  • tests/e2e/test_sandbox_learn_recall.py runs two Claude Code sessions inside the Docker sandbox back-to-back:

    1. Asks for EXIF metadata on a sample photo. The sandbox lacks exiftool and PIL, so Claude hits dead ends and recovers via Python stdlib. The Stop hook chain fires: save-trajectory writes the transcript, then learn extracts a guideline.
    2. Asks a similar follow-up question. UserPromptSubmit recall injects the guideline; Claude should skip the failed tools entirely.
  • Asserts session 1 produces a .md file under .evolve/entities/ and that session 2's bash commands do not invoke exiftool / PIL / piexif / exifread.

  • Uses logging so progress lines stream live under pytest --log-cli-level=INFO.

  • justfile — drop the obsolete trace=true / learn=true options from the claude-prompt target now that the Stop hooks handle transcript saving and guideline learning automatically.

  • sandbox/README.md — document how to run the E2E test, the env vars forwarded into the container, and an example env file.

Why

With the save-trajectory and learn hooks (PRs #229 and #230) both landed, the full learn + recall loop now works end-to-end. This test locks that behavior in so future plugin changes can't silently break it.

Test plan

just sandbox-build claude
export ANTHROPIC_API_KEY=sk-ant-...   # or source an env file
uv run pytest tests/e2e/test_sandbox_learn_recall.py --run-e2e -m e2e -v --log-cli-level=INFO
  • Passes locally (~3m 34s)
  • Skipped by default (marked @pytest.mark.e2e; pyproject.toml's addopts = "-m 'not llm and not e2e'" excludes it from CI)
  • Skips cleanly if Docker, the sandbox image, or ANTHROPIC_API_KEY is missing

Summary by CodeRabbit

  • Tests

    • Added an end-to-end test validating the learn→recall flow across two sandboxed sessions.
  • Documentation

    • Added sandbox test docs with setup steps, prerequisites, example env and run commands.
  • Chores

    • Updated security baseline metadata and recorded a non-secret scan entry.
    • Simplified sandbox run configuration by removing optional trace/learn flags.

Add tests/e2e/test_sandbox_learn_recall.py which runs two Claude Code
sessions in the Docker sandbox back-to-back: the first hits dead ends
on a photo-EXIF task (exiftool/PIL unavailable), so the Stop hook
extracts a guideline; the second asks a similar question, and recall
should inject the guideline and steer Claude away from the failed
tools. Uses pytest live logging for real-time progress.

Also simplify the claude-prompt justfile target — drop the obsolete
trace/learn options now that the Stop hooks handle transcript saving
and guideline learning automatically — and document how to run the
test in sandbox/README.md.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

Warning

Rate limit exceeded

@vinodmut has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 29 minutes and 52 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dafaecbd-aea8-4b5a-8d62-1225cbaa0bc7

📥 Commits

Reviewing files that changed from the base of the PR and between 8ed92eb and e7b5dfc.

📒 Files selected for processing (1)
  • tests/e2e/test_sandbox_learn_recall.py
📝 Walkthrough

Walkthrough

Adds an E2E test and docs for the evolve-lite learn→recall flow, removes trace/learn variables from the justfile and related conditional logic, and updates .secrets.baseline with a new secret-scan entry and timestamp.

Changes

Cohort / File(s) Summary
Configuration & Secrets
\.secrets.baseline
Updates baseline timestamp and records a new secret-scan entry for sandbox/README.md (not secret, not verified). Also adds trailing newline.
Build / CLI config
justfile
Removes trace and learn variables and strips conditional logic from the claude-prompt target; container invocation simplified to always run claude with SANDBOX_PROMPT.
Documentation
sandbox/README.md
Adds documentation for the new sandbox E2E test, prerequisites, example env, and commands to run the test with live logging; describes session 1 learn and session 2 recall behavior.
E2E Tests
tests/e2e/test_sandbox_learn_recall.py
Introduces a new E2E test with session-scoped fixtures that skip when Docker/image/credentials are unavailable; runs two sequential sandbox sessions, asserts guideline artifacts and transcripts from session 1, and asserts session 2 avoids banned EXIF-related tool invocations by inspecting JSONL transcripts.

Sequence Diagram(s)

sequenceDiagram
  participant Test as Test Runner (pytest)
  participant Docker as Docker Engine
  participant Claude as claude-sandbox Container
  participant FS as Workspace / .evolve filesystem
  Test->>Docker: start container (session 1) with mounts & env
  Docker->>Claude: run claude with SANDBOX_PROMPT (ask for photo location)
  Claude->>FS: write session1 transcript JSONL
  Claude->>FS: create .evolve/entities/ guideline.md
  FS-->>Docker: session1 artifacts available
  Test->>Docker: start container (session 2) with mounts & env
  Docker->>Claude: run claude with SANDBOX_PROMPT (ask for focal length)
  Claude->>FS: write session2 transcript JSONL
  Test->>FS: read session1/session2 transcripts
  Test->>Test: assert guideline exists and session2 commands avoid banned tools
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly Related PRs

Suggested Reviewers

  • gaodan-fang

Poem

🐰
I hopped through tests and wrote a clue,
Session one learns, then session two,
Guidelines tucked in carrot-pouch tight,
Commands now skip the EXIF fright,
A tiny hop for code — how bright! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'test(e2e): add sandbox learn+recall test' directly and concisely describes the primary change: adding an E2E test for the learn+recall functionality in the sandbox environment.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 29 minutes and 52 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Fixes failing CI checks: check-formatting, check-linting
Fixes failing CI check: tekton/code-detect-secrets

The example env file snippet in sandbox/README.md tripped IBM's
detect-secrets scanner. Add the finding to .secrets.baseline — the
repo's existing convention (14 entries already present) — rather than
polluting the doc with inline pragmas. Change the placeholder to the
canonical sk-ant-xxxx form used elsewhere.
@vinodmut vinodmut force-pushed the test-e2e-sandbox-learn-recall branch from 1fd2211 to 2fef74b Compare April 29, 2026 16:15
vinodmut and others added 2 commits April 29, 2026 11:34
Fixes failing CI check: tekton/pr-code-checks/code-detect-secrets

The baseline entry for sandbox/README.md needed `is_secret: false` to
pass tekton's `detect-secrets audit --fail-on-unaudited` check.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/e2e/test_sandbox_learn_recall.py`:
- Around line 153-156: The test currently picks session2_transcript
nondeterministically via session2_transcripts[0]; change this to
deterministically select the intended file by using a stable selector — e.g.,
compute the newest file by modification time from session2_transcripts (use
Path.stat().st_mtime) or filter by a known session identifier in the
filename/content, then assign that path to session2_transcript; ensure you still
assert session2_transcripts is not empty before selecting. Use the existing
variables trajectories_dir, transcripts, and session2_transcripts to locate and
choose the correct transcript.
- Around line 166-170: The assertion that checks for "exiftool$" is ineffective
because "$" is a literal char in substring checks; update the test in
tests/e2e/test_sandbox_learn_recall.py to detect whole-word invocations of
"exiftool" (use joined and commands to build the failure message) by replacing
the substring check with a regex word-boundary search (e.g.,
re.search(r"\bexiftool\b", joined)) or by tokenizing joined and checking tokens,
and keep the existing failure message that includes "\n".join(commands) so any
match still prints the commands for debugging.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bf940c87-73b0-436c-b462-ea80f1955412

📥 Commits

Reviewing files that changed from the base of the PR and between e51a7f2 and 5dcf498.

📒 Files selected for processing (4)
  • .secrets.baseline
  • justfile
  • sandbox/README.md
  • tests/e2e/test_sandbox_learn_recall.py

Comment thread tests/e2e/test_sandbox_learn_recall.py
Comment thread tests/e2e/test_sandbox_learn_recall.py Outdated
@visahak visahak self-requested a review April 29, 2026 17:33
Addresses CodeRabbit review finding: Session 2 transcript selection is
nondeterministic and can yield false passes

`glob()` order isn't guaranteed; if multiple new `*.jsonl` files ever
appear between session 1 and session 2 (e.g. a future change adds a
second writer), `[0]` could pick the wrong one. Use
`max(key=mtime)` to pick the newest deterministically.
Addresses CodeRabbit review finding: `exiftool$` check is ineffective
as written

The `"exiftool$" not in joined` check treated `$` as a literal
character — never matched anything. Replace with
`re.search(r"\bexiftool\b", joined)` which catches every bash
invocation of exiftool regardless of trailing character.
visahak
visahak previously approved these changes Apr 29, 2026
The banned list previously included PIL/piexif/exifread, which are all
pip-installable. A valid recall guideline can legitimately recommend
"pip install <pkg>" for any of them, causing the assertion to fail on
correct behavior. Only exiftool is definitively absent in the sandbox
(not installed, not pip-installable), so it's the only reliable
marker that recall did NOT steer Claude away from the unavailable
tool.
@visahak visahak merged commit ea24a31 into AgentToolkit:main Apr 29, 2026
16 checks passed
@vinodmut vinodmut deleted the test-e2e-sandbox-learn-recall branch April 29, 2026 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants