-
Notifications
You must be signed in to change notification settings - Fork 0
feat: vitest assertions #926
Description
Status: P3 — Research complete, defer implementation
Research into Vitest assertion patterns for AI eval is complete. Conclusion: AgentV's existing SDK already surpasses what Vitest matchers would offer. A Vitest bridge is a CI/DX convenience, not a capability gap.
Why AgentV Doesn't Need This Yet
AgentV already has two grading layers that cover the same ground:
@agentv/core — In-process TypeScript SDK
await evaluate({
tests: [{
id: 'test-1',
input: 'Capital of France?',
assert: [
{ type: 'contains', value: 'Paris' }, // 22+ built-in types
{ type: 'llm-grader', prompt: './rubric.md' }, // LLM-as-judge
({ output }) => ({ name: 'len', score: output.length > 10 ? 1 : 0 }), // inline fn
],
}],
task: async (input) => callMyAgent(input),
});@agentv/eval — Subprocess graders (stdin/stdout)
// .agentv/assertions/sentiment.ts — auto-discovered by filename
import { defineAssertion } from '@agentv/eval';
export default defineAssertion(({ output }) => ({
pass: true,
score: 0.85,
assertions: [{ text: 'Professional tone', passed: true }],
}));Advantages over Vitest expect.extend() matchers
- Continuous scoring (0-1) — Vitest matchers return binary
pass: boolean - Multi-assertion results per evaluator —
assertions[]array with per-criterion verdicts - Subprocess isolation —
defineCodeGrader/defineAssertionrun in separate processes - Auto-discovery — drop
.tsin.agentv/assertions/→ type name from filename - YAML-first — AI agents can read/write EVAL.yaml without TypeScript
- Composite evaluators — weighted average, threshold, code/LLM aggregation built-in
What a Vitest Bridge Would Look Like (If Implemented)
A thin @agentv/vitest package — NOT a rewrite of the assertion engine:
import { installAgentVMatchers } from '@agentv/vitest';
installAgentVMatchers();
test('my eval', async () => {
const { summary } = await evaluate({
specFile: './evals/EVAL.yaml',
task: async (input) => myAgent(input),
});
expect(summary.passed).toBe(summary.total);
});Or wrapping individual assertions as custom matchers:
// Uses expect.extend() under the hood
await expect(response).toPassAgentVAssertion({ type: 'llm-grader', prompt: './rubric.md' });
await expect(response).toPassAgentVAssertion({ type: 'contains', value: 'Paris' });Industry Context
Frameworks that have Vitest/Jest integration:
- Promptfoo:
installMatchers()→ 4 custom matchers (toMatchSemanticSimilarity,toPassLLMRubric, etc.) - LangSmith:
ls.describe()/ls.test()wrappers around native Vitest - vitest-evals (Sentry):
describeEval()with multi-scorer + threshold - Evalite (Matt Pocock): Vitest runner + web UI + trace capture
These serve users who want eval assertions inside existing Vitest test suites. AgentV's evaluate() API serves users who want a dedicated eval framework. Different audiences.
Decision
Defer. Focus on:
- feat: configurable threshold + naming taxonomy cleanup #925 — Per-assertion configurable thresholds (P0)
- Shared assertion templates (P1)
- AND/OR logic operators (P2)
Revisit Vitest bridge if user demand emerges. The integration surface is small (~100 LOC) so it can be built quickly when needed.
Research
Full findings: agentevals-research/research/findings/test-framework-assertion-patterns/README.md