Iter 7: Eval framework, architecture extraction, and tool behavior refactor by echo-xiao · Pull Request #7 · RocketChat/Agentic.Code.Analyzer

echo-xiao · 2026-06-08T21:22:09Z

Summary

Architecture extraction: Added architecture.json and refactored registry.ts/retriever.ts for graph-native code analysis
Eval framework: Added layered evaluation (layer0-baseline, layer1-tool, layer2-agent) with compare utility and Claude judge scoring
Gemini answers: Updated all 27+ answer logs with improved formatting and refined content
Cleanup: Renamed eval files to layer-based naming, added executive report, removed stale agent-eval.md

Executive Report

$(cat logs/executive-report-flash.md)

Layer 0 — Baseline Eval (No Tools)

$(cat logs/layer0-baseline-eval.md)

Layer 1 — Tool Eval

$(cat logs/layer1-tool-eval.md)

Layer 2 — Agent Eval (Summary)

Full report is too large for PR description. See logs/layer2-agent-eval.md for complete details.

Metric	Value
Good answers (3+ file paths)	22/34 (64.7%)
Weak answers (has content, <3 paths)	4/34
Empty answers	8/34
File hit rate (avg, string match)	33.8%
Symbol coverage (avg, string match)	63.5%
Avg tool calls / question	7.2
Avg tokens / question	31,295
Total tokens (all 34)	1,064,042

By Question Type

Type	Count	Passed	Rate
architecture	9	0	0.0%
call-chain	4	0	0.0%
pattern	6	0	0.0%
locate	8	1	12.5%
routing	4	1	25.0%
impact	3	0	0.0%

By Difficulty

Difficulty	Count	Passed	Rate
medium	17	1	5.9%
hard	17	1	5.9%

🤖 Generated with Claude Code

… eval Major changes: - Extract architecture knowledge from AGENTS.md to architecture.json (30 entries, source-verified) - AGENTS.md stripped to pure rules (tool order, answer format, navigation strategy) - implement: class skeleton mode (10K+ → ~500 tokens), ClassName.methodName support - implement: enforce search/graph before implement (session tracking) - search/graph/implement: navigation hints in responses - graph: architecture context injection from architecture.json - grep: sorted by relevance, limited to top 10 - Callee skeletons removed (graph(down) replaces at 1/10th cost) Bug fixes: - retriever.ts: callee skeleton bug (objects treated as strings) - Remove 5 unused deps, unused exports, stale params Results: - L1: 25/34 (unchanged, no artificial inflation) - L2 tokens: 69K → 29K avg/question (-58%) - L2 implement: 3,070 → 544 avg tokens (-82%) - Claude judge: 6 GOOD, 16 ACCEPTABLE, 9 WEAK, 3 WRONG (65% usable) Eval renames: tool-eval → layer1-tool-eval, agent-eval → layer2-agent-eval New: compare.ts generates 3-way comparison report, comparison-report.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove unused eval layers (layer0-baseline, layer1-tool, compare) and their scripts. Add executive-report.md with manual quality assessment: 70% of answers rated GOOD/ACCEPTABLE. Update gemini-answers with latest eval run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CLAassistant · 2026-06-08T21:22:18Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Echo Xiao seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

…swers with improved formatting - Add layer0-baseline-eval and layer1-tool-eval scripts and logs - Add compare.ts eval utility - Update all gemini answer logs with refined content - Update layer2-agent-eval with latest results - Replace executive-report.md with executive-report-flash.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Echo Xiao and others added 3 commits June 8, 2026 11:31

Move executive-report.md to logs/

641eb50

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

echo-xiao changed the title ~~Iter 7: Graph-native code analyzer with eval framework~~ Iter 7: Eval framework, architecture extraction, and tool behavior refactor Jun 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iter 7: Eval framework, architecture extraction, and tool behavior refactor#7

Iter 7: Eval framework, architecture extraction, and tool behavior refactor#7
echo-xiao wants to merge 4 commits into
RocketChat:mainfrom
echo-xiao:initial-codebase

echo-xiao commented Jun 8, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

echo-xiao commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Executive Report

Layer 0 — Baseline Eval (No Tools)

Layer 1 — Tool Eval

Layer 2 — Agent Eval (Summary)

By Question Type

By Difficulty

Uh oh!

CLAassistant commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

echo-xiao commented Jun 8, 2026 •

edited

Loading