feat: configurable threshold + naming taxonomy cleanup

## Problem

AgentV has threshold/scoring naming inconsistencies across surfaces, and `PASS_THRESHOLD = 0.8` is hardcoded in places that should respect the configurable `execution.threshold`.

### Naming Taxonomy Audit

**Concept A — Pass/fail boundary** ("what score counts as passing?")

| Surface | Current name | Consistent? |
|---------|-------------|-------------|
| CLI | `--threshold` | Yes |
| YAML | `execution.threshold` | Yes |
| Orchestrator | `threshold` param | Yes |
| Studio config | `studio.pass_threshold` | **No** — should be `threshold` |
| Core constant | `PASS_THRESHOLD` | **No** — should be `DEFAULT_THRESHOLD` |
| Composite aggregator | `{ type: 'threshold', threshold: 0.7 }` | Yes (different concept — aggregation strategy) |

Industry comparison: DeepEval, Promptfoo, vitest-evals all use `threshold`. Nobody uses `pass_threshold`.

**Concept B — Required gate** ("if this evaluator fails, zero the aggregate")

| Surface | Current name | Consistent? |
|---------|-------------|-------------|
| Assertion | `required: true` | Yes |
| Rubric | `required: true` | Yes |

This is fine — boolean `required` is clear.

**Concept C — Minimum score** ("what's the minimum acceptable score?")

| Surface | Current name | Scale | Consistent? |
|---------|-------------|-------|-------------|
| Assertion | `required: 0.7` | 0-1 | **No** — overloads boolean `required` with completely different semantics |
| Rubric | `required_min_score: 7` | 0-10 | **No** — different name, different scale from everything else |

Both should be `min_score` on a **0-1 scale**. 

**Scoring scale principle**: Everything the user configures as a threshold or boundary is **0-1**. The only place 0-10 appears in YAML is `score_ranges` on rubric criteria — those aren't thresholds, they're label definitions describing what each integer band means to the LLM grader.

**Agent Skills format** — has no threshold concept at all. Assertions are strings promoted to llm-grader. `pass_rate` is a computed output metric, not a configurable boundary. No naming conflict.

### Hardcoded Threshold Bugs

1. **`scoreToVerdict(score)` ignores configurable threshold** — always uses hardcoded 0.8
2. **`evaluate()` API `computeSummary()`** — uses `PASS_THRESHOLD` directly
3. **`required: true` fallback** — `const minScore = typeof entry.required === 'number' ? entry.required : PASS_THRESHOLD` should use test threshold instead
4. **Per-test `execution.threshold`** — schema allows it but yaml-parser only reads `skip_defaults` from per-test execution, ignores `threshold`

## Proposed Changes

### 1. Rename `PASS_THRESHOLD` → `DEFAULT_THRESHOLD`

```typescript
// packages/core/src/evaluation/evaluators/scoring.ts
export const DEFAULT_THRESHOLD = 0.8;
/** @deprecated Use DEFAULT_THRESHOLD */
export const PASS_THRESHOLD = DEFAULT_THRESHOLD;
```

### 2. Make `scoreToVerdict()` threshold-aware

```typescript
export function scoreToVerdict(score: number, threshold = DEFAULT_THRESHOLD): EvaluationVerdict {
  return score >= threshold ? 'pass' : 'fail';
}
```

Thread `caseThreshold` (already available in orchestrator) to all call sites.

### 3. Wire per-test `execution.threshold` in yaml-parser

The schema already allows `execution.threshold` on test cases, but the parser ignores it. Wire it through so each test can override the suite threshold.

Resolution order: `--threshold` (CLI) > test `execution.threshold` > suite `execution.threshold` > `DEFAULT_THRESHOLD` (0.8)

### 4. Fix `evaluate()` API

```typescript
export interface EvalConfig {
  readonly threshold?: number;  // NEW
}

function computeSummary(results, durationMs, threshold = DEFAULT_THRESHOLD): EvalSummary { ... }
```

### 5. Split `required: number` → `required: boolean` + `min_score: number` on assertions

```yaml
# Before (confusing — is required boolean or number?)
assertions:
  - type: llm-grader
    prompt: ./safety.md
    required: 0.9

# After (clear)
assertions:
  - type: llm-grader
    prompt: ./safety.md
    required: true
    min_score: 0.9
```

- `required: boolean` — gate semantics (if fails, aggregate → 0)
- `min_score: number` (0-1) — minimum acceptable score for this evaluator
- `min_score` without `required: true` still sets the score floor but doesn't gate the aggregate
- Deprecation: `required: number` continues to work, parsed as `required: true` + `min_score: <value>`

### 6. Rename `required_min_score` → `min_score` on rubrics, change to 0-1 scale

```yaml
# Before (0-10 scale, verbose name)
rubrics:
  - id: accuracy
    outcome: "Factually correct"
    required_min_score: 7
    score_ranges:
      - score_range: [0, 5]
        outcome: "Incomplete"
      - score_range: [6, 10]
        outcome: "Satisfactory"

# After (0-1 scale, consistent name)
rubrics:
  - id: accuracy
    outcome: "Factually correct"
    min_score: 0.7
    score_ranges:            # stays 0-10 — these describe LLM integer output bands
      - score_range: [0, 5]
        outcome: "Incomplete"
      - score_range: [6, 10]
        outcome: "Satisfactory"
```

- `min_score` is 0-1 (internally multiplied by 10 for comparison against LLM's 0-10 output)
- `score_ranges` stay 0-10 — they're label definitions for the LLM's integer scoring, not user thresholds
- `required_min_score` continues to work as deprecated alias (value interpreted as 0-10, converted to 0-1)

### 7. Rename studio `pass_threshold` → `threshold`

```yaml
# .agentv/config.yaml
# Before
studio:
  pass_threshold: 0.8

# After
studio:
  threshold: 0.8
```

- Read `pass_threshold` as fallback for backward compat
- Write `threshold` on save

## Scoring Scale Summary

Everything the user configures is **0-1**. The only 0-10 in YAML is `score_ranges` (LLM output band labels, not a threshold).

| Level | Field | Scale |
|-------|-------|-------|
| Suite/test | `execution.threshold` | 0-1 |
| CLI | `--threshold` | 0-1 |
| All graders | `min_score` | 0-1 |
| Rubric criteria | `min_score` | 0-1 (internally × 10 for LLM comparison) |
| Rubric criteria | `score_ranges` | 0-10 (LLM integer band labels — NOT a threshold) |

## Files to modify

| File | Change |
|------|--------|
| `packages/core/src/evaluation/evaluators/scoring.ts` | Rename constant, add threshold param to `scoreToVerdict()` |
| `packages/core/src/evaluation/orchestrator.ts` | Thread threshold to `scoreToVerdict()`, fix `required: true` fallback |
| `packages/core/src/evaluation/evaluate.ts` | Accept threshold in `EvalConfig`, pass to `computeSummary()` |
| `packages/core/src/evaluation/yaml-parser.ts` | Read per-test `execution.threshold`, pass to orchestrator |
| `packages/core/src/evaluation/validation/eval-file.schema.ts` | Add `min_score` to `EvaluatorCommonSchema` and `RubricItemSchema` (0-1 scale) |
| `packages/core/src/evaluation/loaders/evaluator-parser.ts` | Parse `min_score`, handle deprecated `required: number` and `required_min_score` (with 0-10 → 0-1 conversion) |
| `packages/core/src/evaluation/evaluators/llm-grader.ts` | Convert rubric `min_score` from 0-1 to 0-10 for internal LLM comparison |
| `packages/core/src/evaluation/types.ts` | Update type definitions |
| `packages/core/src/evaluation/evaluators/index.ts` | Re-export `DEFAULT_THRESHOLD` + deprecated `PASS_THRESHOLD` |
| `apps/cli/src/commands/results/studio-config.ts` | Read/write `threshold`, fallback to `pass_threshold` |
| `apps/cli/src/commands/eval/benchmark-writer.ts` | Import `DEFAULT_THRESHOLD` instead of local duplicate |

## Backward compatibility

All changes use deprecation aliases — no breaking changes:
- `PASS_THRESHOLD` re-exported as deprecated
- `required: number` parsed as `required: true` + `min_score`
- `required_min_score` accepted alongside `min_score` (value in 0-10 converted to 0-1 internally)
- `pass_threshold` read as fallback in studio config

## Non-goals

- Changing the default value (0.8 stays)
- Changing `score_ranges` to 0-1 (they describe LLM integer output bands)
- Renaming `threshold` on latency evaluator (different concept — max ms)
- Renaming composite aggregator `type: 'threshold'` (different concept — aggregation strategy)
- Adding threshold to Agent Skills format (they don't have it, no conflict)

## Documentation Requirements

The implementer MUST update documentation to make the scoring scale and naming clear:

1. **eval-schema.json** — regenerate after schema changes (`bun run generate:schema`)
2. **README.md** — update any threshold/scoring examples to use new field names
3. **YAML schema JSDoc** — add clear comments on `min_score` explaining 0-1 scale at all levels
4. **Migration note** — document deprecated fields (`required: number`, `required_min_score`, `pass_threshold`) and their new equivalents in a visible location (CHANGELOG or migration section)
5. **Scoring scale principle** — document in CLAUDE.md or schema comments: "All user-configurable score thresholds use 0-1 scale. The only 0-10 values in YAML are `score_ranges` which define LLM integer output band labels."

## `min_score` Behavior (Clarification)

`min_score` and `required` are independent:

| Config | Evaluator verdict | Aggregate effect |
|--------|------------------|-----------------|
| `min_score: 0.7` only | fail if score < 0.7, pass if >= 0.7 | Score contributes to weighted average normally |
| `required: true` only | fail if score < threshold (suite/test level) | If fails, entire aggregate → 0 |
| `min_score: 0.7` + `required: true` | fail if score < 0.7 | If score < 0.7, entire aggregate → 0 |
| Neither | fail if score < threshold (suite/test level) | Score contributes normally |

`min_score` controls the evaluator's **own verdict**. `required` controls whether failing **gates the aggregate**. They compose independently.

## Acceptance Signals

1. **Threshold threading**: `scoreToVerdict()` accepts and uses configurable threshold at all call sites
2. **Per-test threshold**: `execution.threshold` on a test case overrides suite threshold — verified by test
3. **Resolution order**: CLI `--threshold` > test `execution.threshold` > suite `execution.threshold` > `DEFAULT_THRESHOLD` (0.8) — verified by test
4. **`evaluate()` API**: `EvalConfig.threshold` is respected in `computeSummary()` — verified by test
5. **`min_score` on assertions**: new field accepted in YAML, controls per-evaluator verdict — verified by test
6. **`min_score` on rubrics**: new field accepted in YAML (0-1 scale), internally converted to 0-10 — verified by test
7. **Deprecation aliases work**: `required: 0.7` parsed as `required: true` + `min_score: 0.7`; `required_min_score: 7` parsed as `min_score: 0.7` — verified by test
8. **Studio config**: reads `threshold`, falls back to `pass_threshold` — verified by test
9. **Schema regenerated**: `bun run generate:schema` produces updated `eval-schema.json`
10. **All existing tests pass**: no regressions from rename/deprecation
11. **Documentation updated**: per Documentation Requirements section above


File	Change
`packages/core/src/evaluation/evaluators/scoring.ts`	Rename constant, add threshold param to `scoreToVerdict()`
`packages/core/src/evaluation/orchestrator.ts`	Thread threshold to `scoreToVerdict()`, fix `required: true` fallback
`packages/core/src/evaluation/evaluate.ts`	Accept threshold in `EvalConfig`, pass to `computeSummary()`
`packages/core/src/evaluation/yaml-parser.ts`	Read per-test `execution.threshold`, pass to orchestrator
`packages/core/src/evaluation/validation/eval-file.schema.ts`	Add `min_score` to `EvaluatorCommonSchema` and `RubricItemSchema` (0-1 scale)
`packages/core/src/evaluation/loaders/evaluator-parser.ts`	Parse `min_score`, handle deprecated `required: number` and `required_min_score` (with 0-10 → 0-1 conversion)
`packages/core/src/evaluation/evaluators/llm-grader.ts`	Convert rubric `min_score` from 0-1 to 0-10 for internal LLM comparison
`packages/core/src/evaluation/types.ts`	Update type definitions
`packages/core/src/evaluation/evaluators/index.ts`	Re-export `DEFAULT_THRESHOLD` + deprecated `PASS_THRESHOLD`
`apps/cli/src/commands/results/studio-config.ts`	Read/write `threshold`, fallback to `pass_threshold`
`apps/cli/src/commands/eval/benchmark-writer.ts`	Import `DEFAULT_THRESHOLD` instead of local duplicate

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: configurable threshold + naming taxonomy cleanup #925

Problem

Naming Taxonomy Audit

Hardcoded Threshold Bugs

Proposed Changes

1. Rename `PASS_THRESHOLD` → `DEFAULT_THRESHOLD`

2. Make `scoreToVerdict()` threshold-aware

3. Wire per-test `execution.threshold` in yaml-parser

4. Fix `evaluate()` API

5. Split `required: number` → `required: boolean` + `min_score: number` on assertions

6. Rename `required_min_score` → `min_score` on rubrics, change to 0-1 scale

7. Rename studio `pass_threshold` → `threshold`

Scoring Scale Summary

Files to modify

Backward compatibility

Non-goals

Documentation Requirements

`min_score` Behavior (Clarification)

Acceptance Signals

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Surface	Current name	Consistent?
CLI	`--threshold`	Yes
YAML	`execution.threshold`	Yes
Orchestrator	`threshold` param	Yes
Studio config	`studio.pass_threshold`	No — should be `threshold`
Core constant	`PASS_THRESHOLD`	No — should be `DEFAULT_THRESHOLD`
Composite aggregator	`{ type: 'threshold', threshold: 0.7 }`	Yes (different concept — aggregation strategy)

Surface	Current name	Scale	Consistent?
Assertion	`required: 0.7`	0-1	No — overloads boolean `required` with completely different semantics
Rubric	`required_min_score: 7`	0-10	No — different name, different scale from everything else

Level	Field	Scale
Suite/test	`execution.threshold`	0-1
CLI	`--threshold`	0-1
All graders	`min_score`	0-1
Rubric criteria	`min_score`	0-1 (internally × 10 for LLM comparison)
Rubric criteria	`score_ranges`	0-10 (LLM integer band labels — NOT a threshold)

Config	Evaluator verdict	Aggregate effect
`min_score: 0.7` only	fail if score < 0.7, pass if >= 0.7	Score contributes to weighted average normally
`required: true` only	fail if score < threshold (suite/test level)	If fails, entire aggregate → 0
`min_score: 0.7` + `required: true`	fail if score < 0.7	If score < 0.7, entire aggregate → 0
Neither	fail if score < threshold (suite/test level)	Score contributes normally

feat: configurable threshold + naming taxonomy cleanup #925

Description

Problem

Naming Taxonomy Audit

Hardcoded Threshold Bugs

Proposed Changes

1. Rename PASS_THRESHOLD → DEFAULT_THRESHOLD

2. Make scoreToVerdict() threshold-aware

3. Wire per-test execution.threshold in yaml-parser

4. Fix evaluate() API

5. Split required: number → required: boolean + min_score: number on assertions

6. Rename required_min_score → min_score on rubrics, change to 0-1 scale

7. Rename studio pass_threshold → threshold

Scoring Scale Summary

Files to modify

Backward compatibility

Non-goals

Documentation Requirements

min_score Behavior (Clarification)

Acceptance Signals

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Rename `PASS_THRESHOLD` → `DEFAULT_THRESHOLD`

2. Make `scoreToVerdict()` threshold-aware

3. Wire per-test `execution.threshold` in yaml-parser

4. Fix `evaluate()` API

5. Split `required: number` → `required: boolean` + `min_score: number` on assertions

6. Rename `required_min_score` → `min_score` on rubrics, change to 0-1 scale

7. Rename studio `pass_threshold` → `threshold`

`min_score` Behavior (Clarification)