Skip to content

feat: configurable threshold + naming taxonomy cleanup #925

@christso

Description

@christso

Problem

AgentV has threshold/scoring naming inconsistencies across surfaces, and PASS_THRESHOLD = 0.8 is hardcoded in places that should respect the configurable execution.threshold.

Naming Taxonomy Audit

Concept A — Pass/fail boundary ("what score counts as passing?")

Surface Current name Consistent?
CLI --threshold Yes
YAML execution.threshold Yes
Orchestrator threshold param Yes
Studio config studio.pass_threshold No — should be threshold
Core constant PASS_THRESHOLD No — should be DEFAULT_THRESHOLD
Composite aggregator { type: 'threshold', threshold: 0.7 } Yes (different concept — aggregation strategy)

Industry comparison: DeepEval, Promptfoo, vitest-evals all use threshold. Nobody uses pass_threshold.

Concept B — Required gate ("if this evaluator fails, zero the aggregate")

Surface Current name Consistent?
Assertion required: true Yes
Rubric required: true Yes

This is fine — boolean required is clear.

Concept C — Minimum score ("what's the minimum acceptable score?")

Surface Current name Scale Consistent?
Assertion required: 0.7 0-1 No — overloads boolean required with completely different semantics
Rubric required_min_score: 7 0-10 No — different name, different scale from everything else

Both should be min_score on a 0-1 scale.

Scoring scale principle: Everything the user configures as a threshold or boundary is 0-1. The only place 0-10 appears in YAML is score_ranges on rubric criteria — those aren't thresholds, they're label definitions describing what each integer band means to the LLM grader.

Agent Skills format — has no threshold concept at all. Assertions are strings promoted to llm-grader. pass_rate is a computed output metric, not a configurable boundary. No naming conflict.

Hardcoded Threshold Bugs

  1. scoreToVerdict(score) ignores configurable threshold — always uses hardcoded 0.8
  2. evaluate() API computeSummary() — uses PASS_THRESHOLD directly
  3. required: true fallbackconst minScore = typeof entry.required === 'number' ? entry.required : PASS_THRESHOLD should use test threshold instead
  4. Per-test execution.threshold — schema allows it but yaml-parser only reads skip_defaults from per-test execution, ignores threshold

Proposed Changes

1. Rename PASS_THRESHOLDDEFAULT_THRESHOLD

// packages/core/src/evaluation/evaluators/scoring.ts
export const DEFAULT_THRESHOLD = 0.8;
/** @deprecated Use DEFAULT_THRESHOLD */
export const PASS_THRESHOLD = DEFAULT_THRESHOLD;

2. Make scoreToVerdict() threshold-aware

export function scoreToVerdict(score: number, threshold = DEFAULT_THRESHOLD): EvaluationVerdict {
  return score >= threshold ? 'pass' : 'fail';
}

Thread caseThreshold (already available in orchestrator) to all call sites.

3. Wire per-test execution.threshold in yaml-parser

The schema already allows execution.threshold on test cases, but the parser ignores it. Wire it through so each test can override the suite threshold.

Resolution order: --threshold (CLI) > test execution.threshold > suite execution.threshold > DEFAULT_THRESHOLD (0.8)

4. Fix evaluate() API

export interface EvalConfig {
  readonly threshold?: number;  // NEW
}

function computeSummary(results, durationMs, threshold = DEFAULT_THRESHOLD): EvalSummary { ... }

5. Split required: numberrequired: boolean + min_score: number on assertions

# Before (confusing — is required boolean or number?)
assertions:
  - type: llm-grader
    prompt: ./safety.md
    required: 0.9

# After (clear)
assertions:
  - type: llm-grader
    prompt: ./safety.md
    required: true
    min_score: 0.9
  • required: boolean — gate semantics (if fails, aggregate → 0)
  • min_score: number (0-1) — minimum acceptable score for this evaluator
  • min_score without required: true still sets the score floor but doesn't gate the aggregate
  • Deprecation: required: number continues to work, parsed as required: true + min_score: <value>

6. Rename required_min_scoremin_score on rubrics, change to 0-1 scale

# Before (0-10 scale, verbose name)
rubrics:
  - id: accuracy
    outcome: "Factually correct"
    required_min_score: 7
    score_ranges:
      - score_range: [0, 5]
        outcome: "Incomplete"
      - score_range: [6, 10]
        outcome: "Satisfactory"

# After (0-1 scale, consistent name)
rubrics:
  - id: accuracy
    outcome: "Factually correct"
    min_score: 0.7
    score_ranges:            # stays 0-10 — these describe LLM integer output bands
      - score_range: [0, 5]
        outcome: "Incomplete"
      - score_range: [6, 10]
        outcome: "Satisfactory"
  • min_score is 0-1 (internally multiplied by 10 for comparison against LLM's 0-10 output)
  • score_ranges stay 0-10 — they're label definitions for the LLM's integer scoring, not user thresholds
  • required_min_score continues to work as deprecated alias (value interpreted as 0-10, converted to 0-1)

7. Rename studio pass_thresholdthreshold

# .agentv/config.yaml
# Before
studio:
  pass_threshold: 0.8

# After
studio:
  threshold: 0.8
  • Read pass_threshold as fallback for backward compat
  • Write threshold on save

Scoring Scale Summary

Everything the user configures is 0-1. The only 0-10 in YAML is score_ranges (LLM output band labels, not a threshold).

Level Field Scale
Suite/test execution.threshold 0-1
CLI --threshold 0-1
All graders min_score 0-1
Rubric criteria min_score 0-1 (internally × 10 for LLM comparison)
Rubric criteria score_ranges 0-10 (LLM integer band labels — NOT a threshold)

Files to modify

File Change
packages/core/src/evaluation/evaluators/scoring.ts Rename constant, add threshold param to scoreToVerdict()
packages/core/src/evaluation/orchestrator.ts Thread threshold to scoreToVerdict(), fix required: true fallback
packages/core/src/evaluation/evaluate.ts Accept threshold in EvalConfig, pass to computeSummary()
packages/core/src/evaluation/yaml-parser.ts Read per-test execution.threshold, pass to orchestrator
packages/core/src/evaluation/validation/eval-file.schema.ts Add min_score to EvaluatorCommonSchema and RubricItemSchema (0-1 scale)
packages/core/src/evaluation/loaders/evaluator-parser.ts Parse min_score, handle deprecated required: number and required_min_score (with 0-10 → 0-1 conversion)
packages/core/src/evaluation/evaluators/llm-grader.ts Convert rubric min_score from 0-1 to 0-10 for internal LLM comparison
packages/core/src/evaluation/types.ts Update type definitions
packages/core/src/evaluation/evaluators/index.ts Re-export DEFAULT_THRESHOLD + deprecated PASS_THRESHOLD
apps/cli/src/commands/results/studio-config.ts Read/write threshold, fallback to pass_threshold
apps/cli/src/commands/eval/benchmark-writer.ts Import DEFAULT_THRESHOLD instead of local duplicate

Backward compatibility

All changes use deprecation aliases — no breaking changes:

  • PASS_THRESHOLD re-exported as deprecated
  • required: number parsed as required: true + min_score
  • required_min_score accepted alongside min_score (value in 0-10 converted to 0-1 internally)
  • pass_threshold read as fallback in studio config

Non-goals

  • Changing the default value (0.8 stays)
  • Changing score_ranges to 0-1 (they describe LLM integer output bands)
  • Renaming threshold on latency evaluator (different concept — max ms)
  • Renaming composite aggregator type: 'threshold' (different concept — aggregation strategy)
  • Adding threshold to Agent Skills format (they don't have it, no conflict)

Documentation Requirements

The implementer MUST update documentation to make the scoring scale and naming clear:

  1. eval-schema.json — regenerate after schema changes (bun run generate:schema)
  2. README.md — update any threshold/scoring examples to use new field names
  3. YAML schema JSDoc — add clear comments on min_score explaining 0-1 scale at all levels
  4. Migration note — document deprecated fields (required: number, required_min_score, pass_threshold) and their new equivalents in a visible location (CHANGELOG or migration section)
  5. Scoring scale principle — document in CLAUDE.md or schema comments: "All user-configurable score thresholds use 0-1 scale. The only 0-10 values in YAML are score_ranges which define LLM integer output band labels."

min_score Behavior (Clarification)

min_score and required are independent:

Config Evaluator verdict Aggregate effect
min_score: 0.7 only fail if score < 0.7, pass if >= 0.7 Score contributes to weighted average normally
required: true only fail if score < threshold (suite/test level) If fails, entire aggregate → 0
min_score: 0.7 + required: true fail if score < 0.7 If score < 0.7, entire aggregate → 0
Neither fail if score < threshold (suite/test level) Score contributes normally

min_score controls the evaluator's own verdict. required controls whether failing gates the aggregate. They compose independently.

Acceptance Signals

  1. Threshold threading: scoreToVerdict() accepts and uses configurable threshold at all call sites
  2. Per-test threshold: execution.threshold on a test case overrides suite threshold — verified by test
  3. Resolution order: CLI --threshold > test execution.threshold > suite execution.threshold > DEFAULT_THRESHOLD (0.8) — verified by test
  4. evaluate() API: EvalConfig.threshold is respected in computeSummary() — verified by test
  5. min_score on assertions: new field accepted in YAML, controls per-evaluator verdict — verified by test
  6. min_score on rubrics: new field accepted in YAML (0-1 scale), internally converted to 0-10 — verified by test
  7. Deprecation aliases work: required: 0.7 parsed as required: true + min_score: 0.7; required_min_score: 7 parsed as min_score: 0.7 — verified by test
  8. Studio config: reads threshold, falls back to pass_threshold — verified by test
  9. Schema regenerated: bun run generate:schema produces updated eval-schema.json
  10. All existing tests pass: no regressions from rename/deprecation
  11. Documentation updated: per Documentation Requirements section above

Metadata

Metadata

Assignees

No one assigned

    Labels

    in-progressClaimed by an agent — do not duplicate work

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions