Skip to content

Iter 7: Eval framework, architecture extraction, and tool behavior refactor#7

Open
echo-xiao wants to merge 4 commits into
RocketChat:mainfrom
echo-xiao:initial-codebase
Open

Iter 7: Eval framework, architecture extraction, and tool behavior refactor#7
echo-xiao wants to merge 4 commits into
RocketChat:mainfrom
echo-xiao:initial-codebase

Conversation

@echo-xiao

@echo-xiao echo-xiao commented Jun 8, 2026

Copy link
Copy Markdown

Summary

  • Architecture extraction: Added architecture.json and refactored registry.ts/retriever.ts for graph-native code analysis
  • Eval framework: Added layered evaluation (layer0-baseline, layer1-tool, layer2-agent) with compare utility and Claude judge scoring
  • Gemini answers: Updated all 27+ answer logs with improved formatting and refined content
  • Cleanup: Renamed eval files to layer-based naming, added executive report, removed stale agent-eval.md

Executive Report

$(cat logs/executive-report-flash.md)


Layer 0 — Baseline Eval (No Tools)

$(cat logs/layer0-baseline-eval.md)


Layer 1 — Tool Eval

$(cat logs/layer1-tool-eval.md)


Layer 2 — Agent Eval (Summary)

Full report is too large for PR description. See logs/layer2-agent-eval.md for complete details.

Metric Value
Good answers (3+ file paths) 22/34 (64.7%)
Weak answers (has content, <3 paths) 4/34
Empty answers 8/34
File hit rate (avg, string match) 33.8%
Symbol coverage (avg, string match) 63.5%
Avg tool calls / question 7.2
Avg tokens / question 31,295
Total tokens (all 34) 1,064,042

By Question Type

Type Count Passed Rate
architecture 9 0 0.0%
call-chain 4 0 0.0%
pattern 6 0 0.0%
locate 8 1 12.5%
routing 4 1 25.0%
impact 3 0 0.0%

By Difficulty

Difficulty Count Passed Rate
medium 17 1 5.9%
hard 17 1 5.9%

🤖 Generated with Claude Code

Echo Xiao and others added 3 commits June 8, 2026 11:31
… eval

Major changes:
- Extract architecture knowledge from AGENTS.md to architecture.json (30 entries, source-verified)
- AGENTS.md stripped to pure rules (tool order, answer format, navigation strategy)
- implement: class skeleton mode (10K+ → ~500 tokens), ClassName.methodName support
- implement: enforce search/graph before implement (session tracking)
- search/graph/implement: navigation hints in responses
- graph: architecture context injection from architecture.json
- grep: sorted by relevance, limited to top 10
- Callee skeletons removed (graph(down) replaces at 1/10th cost)

Bug fixes:
- retriever.ts: callee skeleton bug (objects treated as strings)
- Remove 5 unused deps, unused exports, stale params

Results:
- L1: 25/34 (unchanged, no artificial inflation)
- L2 tokens: 69K → 29K avg/question (-58%)
- L2 implement: 3,070 → 544 avg tokens (-82%)
- Claude judge: 6 GOOD, 16 ACCEPTABLE, 9 WEAK, 3 WRONG (65% usable)

Eval renames: tool-eval → layer1-tool-eval, agent-eval → layer2-agent-eval
New: compare.ts generates 3-way comparison report, comparison-report.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove unused eval layers (layer0-baseline, layer1-tool, compare) and
their scripts. Add executive-report.md with manual quality assessment:
70% of answers rated GOOD/ACCEPTABLE. Update gemini-answers with latest
eval run.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Echo Xiao seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

…swers with improved formatting

- Add layer0-baseline-eval and layer1-tool-eval scripts and logs
- Add compare.ts eval utility
- Update all gemini answer logs with refined content
- Update layer2-agent-eval with latest results
- Replace executive-report.md with executive-report-flash.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@echo-xiao echo-xiao changed the title Iter 7: Graph-native code analyzer with eval framework Iter 7: Eval framework, architecture extraction, and tool behavior refactor Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants