feat: add experimental BinEval evaluation support#42100
Conversation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
|
@copilot run inference in AWF and use /reflect to find a suitable inference endpoint. |
This comment has been minimized.
This comment has been minimized.
|
Hey A few things that stand out positively:
This looks ready for review. 🚀 Warning Firewall blocked 1 domainThe following domain was blocked by the firewall during workflow execution:
network:
allowed:
- defaults
- "patchdiff.githubusercontent.com"See Network Configuration for more information.
|
- Remove direct GitHub Models API calls from eval_harness.cjs; keep only shared utility functions (readEvalSpec, buildEvalPrompt, aggregateResults, renderMarkdownSummary, sanitizeEvalError)
- Add actions/setup/md/eval.md: eval prompt template instructing the engine to output EVAL_RESULT:{...json...}
- Add actions/setup/js/setup_eval.cjs: prompt setup script (mirrors setup_threat_detection.cjs)
- Add actions/setup/js/parse_eval_results.cjs: result parser extracting EVAL_RESULT from engine log
- Update compiler_evals.go: eval job now follows the detection job pattern — pulls AWF containers, clears MCP config, installs the agentic engine, runs it inside AWF, parses results
- Add EvalLogPath, EvalDir, DefaultEvalMaxAICredits constants to pkg/constants
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
|
@copilot keep working |
Add missing test files for the two BinEval JS modules introduced in the AWF-engine refactor: - setup_eval.test.cjs: 14 tests covering prompt template rendering, missing/empty context files, eval spec parsing, and step summary output - parse_eval_results.test.cjs: 25 tests covering EVAL_RESULT extraction from plain and stream-json logs, main() error/success paths, and result normalisation Mirrors the test coverage pattern of setup_threat_detection.test.cjs and parse_threat_detection_results.test.cjs. Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
🤖 PR Triage — §28332039983
Rationale: Large experimental draft (+2 181 lines). Adds a BinEval-style LLM evaluation harness in workflow frontmatter — significant new capability but high scope/risk. No CI results yet (merge state: UNSTABLE). Defer until draft is promoted to ready and CI passes. Labels applied:
|
🤖 PR Triage — §28342769269
Score breakdown: Impact 25 · Urgency 6 · Quality 7 Rationale: Experimental BinEval LLM evaluation harness (+2181 lines). Large addition, still draft with no CI checks passing. ~10h old with no new activity. Defer until promoted to ready and CI established.
|
🤖 PR Triage — §28357644191
Carried over — 17.3h old. Experimental BinEval LLM evaluation harness (+2181 lines). Large addition, no CI yet, draft. Defer until promoted to ready and CI validates.
|
🤖 PR Triage — §28376613466
Score breakdown: Impact 20 + Urgency 5 + Quality 5 Rationale: Adds native BinEval-style evaluations — large experimental addition (2181+/0−, 19 files), draft, no CI. High risk, no reviewer engagement yet. Defer until promoted from draft, CI passes, and feature scope is scoped down or approved.
|
🔍 PR Triage — §28395315609
Score breakdown: impact 15 + urgency 3 + quality 8
|
Adds native BinEval-style evaluations to gh-aw — small, binary questions declared in workflow frontmatter, executed post-run via an LLM harness, with results aggregated and reported as CI artifacts.
Schema (
evalsfrontmatter)evalsarray withid+questionfields; validated for unique IDs and non-empty questionsEvaluation model
EvalDefinition,EvalResult,EvalSummarytypes infrontmatter_types.goWorkflowData.Evals []EvalDefinitionfor downstream consumersEval job
evaljob injected after agent + detection jobs in the compiled workfloweval_harness.cjs) calls GitHub Models API (gpt-4o-mini) per question independently — no MCPs, no checkoutevalartifact with a markdown step summaryNot included
Phase 8 (persisting results to a git branch, à la experiments) is deferred.