Skip to content

feat: Add evaluations support to ManagedAgent.run()#153

Open
jsonbailey wants to merge 3 commits intojb/aic-2174/langchain-graph-runnerfrom
jb/aic-2174/agent-evaluations
Open

feat: Add evaluations support to ManagedAgent.run()#153
jsonbailey wants to merge 3 commits intojb/aic-2174/langchain-graph-runnerfrom
jb/aic-2174/agent-evaluations

Conversation

@jsonbailey
Copy link
Copy Markdown
Contributor

@jsonbailey jsonbailey commented Apr 28, 2026

Summary

  • Wires judge evaluations into ManagedAgent.run() via asyncio.Task, mirroring ManagedModel.run() (PR 7 / PR 8)
  • run() returns immediately; await result.evaluations guarantees both evaluation and tracker.track_judge_result() complete
  • Uses ai_config.evaluator.evaluate(input, content) — resolves to empty list with Evaluator.noop()
  • Failed judge results (success=False) do NOT call track_judge_result()
  • Adds 6 new tests covering the full evaluations contract

Depends on

Test plan

  • All existing tests pass (uv run pytest packages/sdk/server-ai/tests/)
  • New TestManagedAgentEvaluations tests: run returns before evaluations resolve, collect results, tracking fires on await, noop evaluator returns empty list, failed results not tracked

🤖 Generated with Claude Code


Note

Medium Risk
Introduces new async evaluation/telemetry side-effects in ManagedAgent.run() via background tasks; risk is moderate due to potential concurrency/lifecycle issues (unawaited tasks, exception handling) affecting tracking reliability rather than core auth/data safety.

Overview
ManagedAgent.run() now kicks off judge evaluations via ai_config.evaluator.evaluate(input, output) and returns a ManagedResult that includes an evaluations asyncio.Task alongside the normal content/metrics.

Awaiting result.evaluations runs per-judge tracking (tracker.track_judge_result) for successful results, logs failures/exceptions without raising, and returns the collected JudgeResult list; tests were expanded to cover the non-blocking behavior, result collection, tracking-on-await contract, noop evaluator behavior, and failed-result handling.

Reviewed by Cursor Bugbot for commit 9f9c880. Bugbot is set up for automated code reviews on this repo. Configure here.

@jsonbailey jsonbailey force-pushed the jb/aic-2174/agent-evaluations branch from 4f29d99 to 0ea4a04 Compare April 28, 2026 23:56
@jsonbailey jsonbailey changed the base branch from jb/aic-2388/enrich-metrics to jb/aic-2174/langchain-graph-runner April 28, 2026 23:57
@jsonbailey jsonbailey force-pushed the jb/aic-2174/langchain-graph-runner branch from 0539ba1 to 404670d Compare April 29, 2026 13:15
@jsonbailey jsonbailey force-pushed the jb/aic-2174/agent-evaluations branch from 0ea4a04 to 04e80a8 Compare April 29, 2026 13:15
@jsonbailey jsonbailey force-pushed the jb/aic-2174/langchain-graph-runner branch from 404670d to f132154 Compare April 29, 2026 13:19
@jsonbailey jsonbailey force-pushed the jb/aic-2174/agent-evaluations branch from 04e80a8 to 29ced10 Compare April 29, 2026 13:19
@jsonbailey jsonbailey force-pushed the jb/aic-2174/langchain-graph-runner branch from f132154 to eb1004c Compare April 29, 2026 13:22
@jsonbailey jsonbailey force-pushed the jb/aic-2174/agent-evaluations branch from 29ced10 to c343602 Compare April 29, 2026 13:23
@jsonbailey jsonbailey force-pushed the jb/aic-2174/langchain-graph-runner branch from eb1004c to 8a049e2 Compare April 29, 2026 13:53
@jsonbailey jsonbailey force-pushed the jb/aic-2174/agent-evaluations branch from c343602 to 1a24a4f Compare April 29, 2026 13:55
@jsonbailey jsonbailey force-pushed the jb/aic-2174/langchain-graph-runner branch from 8a049e2 to cea3780 Compare April 29, 2026 13:57
@jsonbailey jsonbailey force-pushed the jb/aic-2174/agent-evaluations branch from 1a24a4f to 78a7ded Compare April 29, 2026 13:57
@jsonbailey jsonbailey force-pushed the jb/aic-2174/langchain-graph-runner branch from cea3780 to f27f9b8 Compare April 29, 2026 14:39
@jsonbailey jsonbailey force-pushed the jb/aic-2174/agent-evaluations branch from 38951a6 to 52756c7 Compare April 29, 2026 14:39
@jsonbailey jsonbailey force-pushed the jb/aic-2174/langchain-graph-runner branch from f27f9b8 to d892533 Compare April 29, 2026 16:34
@jsonbailey jsonbailey force-pushed the jb/aic-2174/agent-evaluations branch from 52756c7 to ff2de9a Compare April 29, 2026 16:34
@jsonbailey jsonbailey force-pushed the jb/aic-2174/langchain-graph-runner branch from d892533 to 13ee088 Compare April 30, 2026 14:06
@jsonbailey jsonbailey force-pushed the jb/aic-2174/agent-evaluations branch from ff2de9a to 1054ef7 Compare April 30, 2026 14:07
@jsonbailey jsonbailey force-pushed the jb/aic-2174/langchain-graph-runner branch from 13ee088 to 3159524 Compare April 30, 2026 14:24
@jsonbailey jsonbailey force-pushed the jb/aic-2174/agent-evaluations branch from 1054ef7 to 4a0923d Compare April 30, 2026 14:25
@jsonbailey jsonbailey force-pushed the jb/aic-2174/langchain-graph-runner branch from 3159524 to 2c5671d Compare April 30, 2026 14:48
jsonbailey and others added 3 commits April 30, 2026 09:48
Wire judge evaluations into ManagedAgent.run() via an asyncio.Task, mirroring
ManagedModel.run(). Awaiting result.evaluations guarantees both evaluation and
tracker.track_judge_result() complete. run() returns immediately; the
evaluations task resolves asynchronously.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Mirror the managed_model.py fix in managed_agent.py: wrap
tracker.track_judge_result() in try/except so a tracking failure
does not destroy successfully computed evaluation results, and log
a warning when a judge evaluation fails (r.success is False) so
failures are visible rather than silently skipped.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jsonbailey jsonbailey force-pushed the jb/aic-2174/agent-evaluations branch from 4a0923d to 9f9c880 Compare April 30, 2026 14:48
@jsonbailey jsonbailey marked this pull request as ready for review May 1, 2026 18:05
@jsonbailey jsonbailey requested a review from a team as a code owner May 1, 2026 18:05
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 9f9c880. Configure here.

log.warning("Judge evaluation failed: %s", r.error_message)
return results

return asyncio.create_task(_run_and_track(evaluator_task))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated _track_judge_results logic across managed classes

Low Severity

The _track_judge_results method in ManagedAgent is a character-for-character duplicate of the same method in ManagedModel. Both take tracker, input_text, output_text, call evaluator.evaluate(), wrap it in an async task that iterates results, tracks successful ones, and logs failures. This duplicated logic increases maintenance burden — a bug fix or behavior change in one would need to be manually replicated in the other.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 9f9c880. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will consider a refactor in the future if needed. Its light enough that we will leave it as is for the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants