test(e2e): add sandbox learn+recall test#233
Conversation
Add tests/e2e/test_sandbox_learn_recall.py which runs two Claude Code sessions in the Docker sandbox back-to-back: the first hits dead ends on a photo-EXIF task (exiftool/PIL unavailable), so the Stop hook extracts a guideline; the second asks a similar question, and recall should inject the guideline and steer Claude away from the failed tools. Uses pytest live logging for real-time progress. Also simplify the claude-prompt justfile target — drop the obsolete trace/learn options now that the Stop hooks handle transcript saving and guideline learning automatically — and document how to run the test in sandbox/README.md.
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds an E2E test and docs for the evolve-lite learn→recall flow, removes Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Test Runner (pytest)
participant Docker as Docker Engine
participant Claude as claude-sandbox Container
participant FS as Workspace / .evolve filesystem
Test->>Docker: start container (session 1) with mounts & env
Docker->>Claude: run claude with SANDBOX_PROMPT (ask for photo location)
Claude->>FS: write session1 transcript JSONL
Claude->>FS: create .evolve/entities/ guideline.md
FS-->>Docker: session1 artifacts available
Test->>Docker: start container (session 2) with mounts & env
Docker->>Claude: run claude with SANDBOX_PROMPT (ask for focal length)
Claude->>FS: write session2 transcript JSONL
Test->>FS: read session1/session2 transcripts
Test->>Test: assert guideline exists and session2 commands avoid banned tools
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly Related PRs
Suggested Reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 29 minutes and 52 seconds.Comment |
Fixes failing CI checks: check-formatting, check-linting
Fixes failing CI check: tekton/code-detect-secrets The example env file snippet in sandbox/README.md tripped IBM's detect-secrets scanner. Add the finding to .secrets.baseline — the repo's existing convention (14 entries already present) — rather than polluting the doc with inline pragmas. Change the placeholder to the canonical sk-ant-xxxx form used elsewhere.
1fd2211 to
2fef74b
Compare
Fixes failing CI check: tekton/pr-code-checks/code-detect-secrets The baseline entry for sandbox/README.md needed `is_secret: false` to pass tekton's `detect-secrets audit --fail-on-unaudited` check.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/e2e/test_sandbox_learn_recall.py`:
- Around line 153-156: The test currently picks session2_transcript
nondeterministically via session2_transcripts[0]; change this to
deterministically select the intended file by using a stable selector — e.g.,
compute the newest file by modification time from session2_transcripts (use
Path.stat().st_mtime) or filter by a known session identifier in the
filename/content, then assign that path to session2_transcript; ensure you still
assert session2_transcripts is not empty before selecting. Use the existing
variables trajectories_dir, transcripts, and session2_transcripts to locate and
choose the correct transcript.
- Around line 166-170: The assertion that checks for "exiftool$" is ineffective
because "$" is a literal char in substring checks; update the test in
tests/e2e/test_sandbox_learn_recall.py to detect whole-word invocations of
"exiftool" (use joined and commands to build the failure message) by replacing
the substring check with a regex word-boundary search (e.g.,
re.search(r"\bexiftool\b", joined)) or by tokenizing joined and checking tokens,
and keep the existing failure message that includes "\n".join(commands) so any
match still prints the commands for debugging.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bf940c87-73b0-436c-b462-ea80f1955412
📒 Files selected for processing (4)
.secrets.baselinejustfilesandbox/README.mdtests/e2e/test_sandbox_learn_recall.py
Addresses CodeRabbit review finding: Session 2 transcript selection is nondeterministic and can yield false passes `glob()` order isn't guaranteed; if multiple new `*.jsonl` files ever appear between session 1 and session 2 (e.g. a future change adds a second writer), `[0]` could pick the wrong one. Use `max(key=mtime)` to pick the newest deterministically.
Addresses CodeRabbit review finding: `exiftool$` check is ineffective as written The `"exiftool$" not in joined` check treated `$` as a literal character — never matched anything. Replace with `re.search(r"\bexiftool\b", joined)` which catches every bash invocation of exiftool regardless of trailing character.
The banned list previously included PIL/piexif/exifread, which are all pip-installable. A valid recall guideline can legitimately recommend "pip install <pkg>" for any of them, causing the assertion to fail on correct behavior. Only exiftool is definitively absent in the sandbox (not installed, not pip-installable), so it's the only reliable marker that recall did NOT steer Claude away from the unavailable tool.
Relates to #186 — a regression test for the Claude Code plugin end-to-end: does the Stop hook extract a guideline from a failed session, and does recall use it in a follow-up?
Summary
tests/e2e/test_sandbox_learn_recall.pyruns two Claude Code sessions inside the Docker sandbox back-to-back:exiftoolandPIL, so Claude hits dead ends and recovers via Python stdlib. The Stop hook chain fires: save-trajectory writes the transcript, then learn extracts a guideline.Asserts session 1 produces a
.mdfile under.evolve/entities/and that session 2's bash commands do not invokeexiftool/PIL/piexif/exifread.Uses
loggingso progress lines stream live underpytest --log-cli-level=INFO.justfile— drop the obsoletetrace=true/learn=trueoptions from theclaude-prompttarget now that the Stop hooks handle transcript saving and guideline learning automatically.sandbox/README.md— document how to run the E2E test, the env vars forwarded into the container, and an example env file.Why
With the save-trajectory and learn hooks (PRs #229 and #230) both landed, the full learn + recall loop now works end-to-end. This test locks that behavior in so future plugin changes can't silently break it.
Test plan
@pytest.mark.e2e;pyproject.toml'saddopts = "-m 'not llm and not e2e'"excludes it from CI)ANTHROPIC_API_KEYis missingSummary by CodeRabbit
Tests
Documentation
Chores