Iter 7: Eval framework, architecture extraction, and tool behavior refactor#7
Open
echo-xiao wants to merge 4 commits into
Open
Iter 7: Eval framework, architecture extraction, and tool behavior refactor#7echo-xiao wants to merge 4 commits into
echo-xiao wants to merge 4 commits into
Conversation
… eval Major changes: - Extract architecture knowledge from AGENTS.md to architecture.json (30 entries, source-verified) - AGENTS.md stripped to pure rules (tool order, answer format, navigation strategy) - implement: class skeleton mode (10K+ → ~500 tokens), ClassName.methodName support - implement: enforce search/graph before implement (session tracking) - search/graph/implement: navigation hints in responses - graph: architecture context injection from architecture.json - grep: sorted by relevance, limited to top 10 - Callee skeletons removed (graph(down) replaces at 1/10th cost) Bug fixes: - retriever.ts: callee skeleton bug (objects treated as strings) - Remove 5 unused deps, unused exports, stale params Results: - L1: 25/34 (unchanged, no artificial inflation) - L2 tokens: 69K → 29K avg/question (-58%) - L2 implement: 3,070 → 544 avg tokens (-82%) - Claude judge: 6 GOOD, 16 ACCEPTABLE, 9 WEAK, 3 WRONG (65% usable) Eval renames: tool-eval → layer1-tool-eval, agent-eval → layer2-agent-eval New: compare.ts generates 3-way comparison report, comparison-report.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove unused eval layers (layer0-baseline, layer1-tool, compare) and their scripts. Add executive-report.md with manual quality assessment: 70% of answers rated GOOD/ACCEPTABLE. Update gemini-answers with latest eval run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Echo Xiao seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
…swers with improved formatting - Add layer0-baseline-eval and layer1-tool-eval scripts and logs - Add compare.ts eval utility - Update all gemini answer logs with refined content - Update layer2-agent-eval with latest results - Replace executive-report.md with executive-report-flash.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
architecture.jsonand refactoredregistry.ts/retriever.tsfor graph-native code analysisagent-eval.mdExecutive Report
$(cat logs/executive-report-flash.md)
Layer 0 — Baseline Eval (No Tools)
$(cat logs/layer0-baseline-eval.md)
Layer 1 — Tool Eval
$(cat logs/layer1-tool-eval.md)
Layer 2 — Agent Eval (Summary)
By Question Type
By Difficulty
🤖 Generated with Claude Code