test(e2e): add sandbox learn+recall test by vinodmut · Pull Request #233 · AgentToolkit/altk-evolve

vinodmut · 2026-04-29T15:58:56Z

Relates to #186 — a regression test for the Claude Code plugin end-to-end: does the Stop hook extract a guideline from a failed session, and does recall use it in a follow-up?

Summary

tests/e2e/test_sandbox_learn_recall.py runs two Claude Code sessions inside the Docker sandbox back-to-back:
1. Asks for EXIF metadata on a sample photo. The sandbox lacks exiftool and PIL, so Claude hits dead ends and recovers via Python stdlib. The Stop hook chain fires: save-trajectory writes the transcript, then learn extracts a guideline.
2. Asks a similar follow-up question. UserPromptSubmit recall injects the guideline; Claude should skip the failed tools entirely.
Asserts session 1 produces a .md file under .evolve/entities/ and that session 2's bash commands do not invoke exiftool / PIL / piexif / exifread.
Uses logging so progress lines stream live under pytest --log-cli-level=INFO.
justfile — drop the obsolete trace=true / learn=true options from the claude-prompt target now that the Stop hooks handle transcript saving and guideline learning automatically.
sandbox/README.md — document how to run the E2E test, the env vars forwarded into the container, and an example env file.

Why

With the save-trajectory and learn hooks (PRs #229 and #230) both landed, the full learn + recall loop now works end-to-end. This test locks that behavior in so future plugin changes can't silently break it.

Test plan

just sandbox-build claude
export ANTHROPIC_API_KEY=sk-ant-...   # or source an env file
uv run pytest tests/e2e/test_sandbox_learn_recall.py --run-e2e -m e2e -v --log-cli-level=INFO

Passes locally (~3m 34s)
Skipped by default (marked @pytest.mark.e2e; pyproject.toml's addopts = "-m 'not llm and not e2e'" excludes it from CI)
Skips cleanly if Docker, the sandbox image, or ANTHROPIC_API_KEY is missing

Summary by CodeRabbit

Tests
- Added an end-to-end test validating the learn→recall flow across two sandboxed sessions.
Documentation
- Added sandbox test docs with setup steps, prerequisites, example env and run commands.
Chores
- Updated security baseline metadata and recorded a non-secret scan entry.
- Simplified sandbox run configuration by removing optional trace/learn flags.

Add tests/e2e/test_sandbox_learn_recall.py which runs two Claude Code sessions in the Docker sandbox back-to-back: the first hits dead ends on a photo-EXIF task (exiftool/PIL unavailable), so the Stop hook extracts a guideline; the second asks a similar question, and recall should inject the guideline and steer Claude away from the failed tools. Uses pytest live logging for real-time progress. Also simplify the claude-prompt justfile target — drop the obsolete trace/learn options now that the Stop hooks handle transcript saving and guideline learning automatically — and document how to run the test in sandbox/README.md.

coderabbitai · 2026-04-29T15:59:04Z

Warning

Rate limit exceeded

@vinodmut has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 29 minutes and 52 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dafaecbd-aea8-4b5a-8d62-1225cbaa0bc7

📥 Commits

Reviewing files that changed from the base of the PR and between 8ed92eb and e7b5dfc.

📒 Files selected for processing (1)

tests/e2e/test_sandbox_learn_recall.py

📝 Walkthrough

Walkthrough

Adds an E2E test and docs for the evolve-lite learn→recall flow, removes trace/learn variables from the justfile and related conditional logic, and updates .secrets.baseline with a new secret-scan entry and timestamp.

Changes

Cohort / File(s)	Summary
Configuration & Secrets `\.secrets.baseline`	Updates baseline timestamp and records a new secret-scan entry for `sandbox/README.md` (not secret, not verified). Also adds trailing newline.
Build / CLI config `justfile`	Removes `trace` and `learn` variables and strips conditional logic from the `claude-prompt` target; container invocation simplified to always run `claude` with `SANDBOX_PROMPT`.
Documentation `sandbox/README.md`	Adds documentation for the new sandbox E2E test, prerequisites, example env, and commands to run the test with live logging; describes session 1 learn and session 2 recall behavior.
E2E Tests `tests/e2e/test_sandbox_learn_recall.py`	Introduces a new E2E test with session-scoped fixtures that skip when Docker/image/credentials are unavailable; runs two sequential sandbox sessions, asserts guideline artifacts and transcripts from session 1, and asserts session 2 avoids banned EXIF-related tool invocations by inspecting JSONL transcripts.

Sequence Diagram(s)

sequenceDiagram
  participant Test as Test Runner (pytest)
  participant Docker as Docker Engine
  participant Claude as claude-sandbox Container
  participant FS as Workspace / .evolve filesystem
  Test->>Docker: start container (session 1) with mounts & env
  Docker->>Claude: run claude with SANDBOX_PROMPT (ask for photo location)
  Claude->>FS: write session1 transcript JSONL
  Claude->>FS: create .evolve/entities/ guideline.md
  FS-->>Docker: session1 artifacts available
  Test->>Docker: start container (session 2) with mounts & env
  Docker->>Claude: run claude with SANDBOX_PROMPT (ask for focal length)
  Claude->>FS: write session2 transcript JSONL
  Test->>FS: read session1/session2 transcripts
  Test->>Test: assert guideline exists and session2 commands avoid banned tools

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly Related PRs

codex example #165 — Related to evolve-lite learn/recall workflow and entity handling/contract.
docs(repo): reorganize documentation structure #131 — Also modifies .secrets.baseline entries.

Suggested Reviewers

gaodan-fang

Poem

🐰
I hopped through tests and wrote a clue,
Session one learns, then session two,
Guidelines tucked in carrot-pouch tight,
Commands now skip the EXIF fright,
A tiny hop for code — how bright! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'test(e2e): add sandbox learn+recall test' directly and concisely describes the primary change: adding an E2E test for the learn+recall functionality in the sandbox environment.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 29 minutes and 52 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Fixes failing CI checks: check-formatting, check-linting

Fixes failing CI check: tekton/code-detect-secrets The example env file snippet in sandbox/README.md tripped IBM's detect-secrets scanner. Add the finding to .secrets.baseline — the repo's existing convention (14 entries already present) — rather than polluting the doc with inline pragmas. Change the placeholder to the canonical sk-ant-xxxx form used elsewhere.

Fixes failing CI check: tekton/pr-code-checks/code-detect-secrets The baseline entry for sandbox/README.md needed `is_secret: false` to pass tekton's `detect-secrets audit --fail-on-unaudited` check.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/e2e/test_sandbox_learn_recall.py`:
- Around line 153-156: The test currently picks session2_transcript
nondeterministically via session2_transcripts[0]; change this to
deterministically select the intended file by using a stable selector — e.g.,
compute the newest file by modification time from session2_transcripts (use
Path.stat().st_mtime) or filter by a known session identifier in the
filename/content, then assign that path to session2_transcript; ensure you still
assert session2_transcripts is not empty before selecting. Use the existing
variables trajectories_dir, transcripts, and session2_transcripts to locate and
choose the correct transcript.
- Around line 166-170: The assertion that checks for "exiftool$" is ineffective
because "$" is a literal char in substring checks; update the test in
tests/e2e/test_sandbox_learn_recall.py to detect whole-word invocations of
"exiftool" (use joined and commands to build the failure message) by replacing
the substring check with a regex word-boundary search (e.g.,
re.search(r"\bexiftool\b", joined)) or by tokenizing joined and checking tokens,
and keep the existing failure message that includes "\n".join(commands) so any
match still prints the commands for debugging.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bf940c87-73b0-436c-b462-ea80f1955412

📥 Commits

Reviewing files that changed from the base of the PR and between e51a7f2 and 5dcf498.

📒 Files selected for processing (4)

.secrets.baseline
justfile
sandbox/README.md
tests/e2e/test_sandbox_learn_recall.py

Addresses CodeRabbit review finding: Session 2 transcript selection is nondeterministic and can yield false passes `glob()` order isn't guaranteed; if multiple new `*.jsonl` files ever appear between session 1 and session 2 (e.g. a future change adds a second writer), `[0]` could pick the wrong one. Use `max(key=mtime)` to pick the newest deterministically.

Addresses CodeRabbit review finding: `exiftool$` check is ineffective as written The `"exiftool$" not in joined` check treated `$` as a literal character — never matched anything. Replace with `re.search(r"\bexiftool\b", joined)` which catches every bash invocation of exiftool regardless of trailing character.

The banned list previously included PIL/piexif/exifread, which are all pip-installable. A valid recall guideline can legitimately recommend "pip install <pkg>" for any of them, causing the assertion to fail on correct behavior. Only exiftool is definitively absent in the sandbox (not installed, not pip-installable), so it's the only reliable marker that recall did NOT steer Claude away from the unavailable tool.

vinodmut added 2 commits April 29, 2026 11:10

fix(tests): ruff format and remove extraneous f-string prefix

3c6bd3e

Fixes failing CI checks: check-formatting, check-linting

vinodmut force-pushed the test-e2e-sandbox-learn-recall branch from 1fd2211 to 2fef74b Compare April 29, 2026 16:15

vinodmut and others added 2 commits April 29, 2026 11:34

fix(sandbox): audit README placeholder as non-secret

967acaa

Fixes failing CI check: tekton/pr-code-checks/code-detect-secrets The baseline entry for sandbox/README.md needed `is_secret: false` to pass tekton's `detect-secrets audit --fail-on-unaudited` check.

Merge branch 'main' into test-e2e-sandbox-learn-recall

5dcf498

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread tests/e2e/test_sandbox_learn_recall.py

Comment thread tests/e2e/test_sandbox_learn_recall.py Outdated

visahak self-requested a review April 29, 2026 17:33

vinodmut added 2 commits April 29, 2026 12:42

visahak previously approved these changes Apr 29, 2026

View reviewed changes

vinodmut dismissed visahak’s stale review via e7b5dfc April 29, 2026 18:12

vinodmut requested review from illeatmyhat and visahak April 29, 2026 18:26

visahak approved these changes Apr 29, 2026

View reviewed changes

visahak merged commit ea24a31 into AgentToolkit:main Apr 29, 2026
16 checks passed

vinodmut deleted the test-e2e-sandbox-learn-recall branch April 29, 2026 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(e2e): add sandbox learn+recall test#233

test(e2e): add sandbox learn+recall test#233
visahak merged 8 commits intoAgentToolkit:mainfrom
vinodmut:test-e2e-sandbox-learn-recall

vinodmut commented Apr 29, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

Possibly Related PRs

Suggested Reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vinodmut commented Apr 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

Possibly Related PRs

Suggested Reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vinodmut commented Apr 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading