feat(judges): continuous-value inter-rater agreement (ICC + weighted κ) by tangletools · Pull Request #50 · tangle-network/agent-eval

tangletools · 2026-05-16T09:51:53Z

Summary

The original calibrateJudge rounded judge scores to ints before computing Cohen's κ. For fine-grained [0,1] judges that's lossy — 0.78 vs 0.81 both round to "1" and the integer κ claims perfect agreement when raters actually disagree by 3pp.
Adds continuousAgreement(scores, opts?) returning Cohen's κ_w (quadratic or linear weights on raw scores), ICC(2,1) per Shrout & Fleiss (two-way random effects, absolute agreement, single rater), Pearson + Spearman averaged over rater pairs, and bootstrap 95% percentile CIs on κ_w and ICC (default n=1000, seeded for reproducibility).
Adds calibrateJudgeContinuous(golden, candidate, opts?) as a drop-in superset of calibrateJudge — every legacy field is preserved, new fields (weightedKappaContinuous, icc, spearman, ci) layered on top. calibrateJudge is unchanged.
Bumps to 0.26.0 (non-breaking add). Python client version pinned in lockstep.

ICC catches the bug Pearson hides: if judge B = 2× judge A, Pearson stays ≈ 1 while ICC drops. The evolve scripts can now report κ_w + ICC per judge dimension with CIs — calibration findings become quantitative.

API

export interface ContinuousAgreement {
  weightedKappa: number       // Cohen's κ_w, quadratic weights on raw scores
  icc: number                 // ICC(2,1): two-way random effects, single rater
  pearson: number             // averaged over rater pairs when N ≥ 2
  spearman: number
  ci: {
    icc: [number, number]     // 95% bootstrap percentile CI
    weightedKappa: [number, number]
  }
  n: number
  raters: number
}

export function continuousAgreement(
  scores: number[][],
  opts?: { bootstrap?: number; weights?: 'linear' | 'quadratic'; seed?: number; ciLevel?: number },
): ContinuousAgreement

export function calibrateJudgeContinuous(
  golden: GoldenItem[],
  candidate: CandidateScore[],
  opts?: ContinuousAgreementOptions,
): ContinuousCalibrationResult  // CalibrationResult ⊕ { weightedKappaContinuous, icc, spearman, ci }

Test plan

Files

src/judge-calibration.ts — new public API + internals (continuous κ_w, ICC(2,1), Spearman with tie-handling, seeded Mulberry32 bootstrap, percentile quantile).
src/index.ts — re-exports.
tests/judge-calibration.test.ts — extended with 11 new tests (no new file; one canonical suite).
docs/concepts.md — new "Judge calibration" section.
README.md — primitive table row.
CHANGELOG.md, package.json, clients/python/{pyproject.toml,src/agent_eval_rpc/__init__.py} — version bump to 0.26.0.

The original calibrateJudge rounded scores to ints before computing Cohen's κ. For fine-grained [0,1] judges that's lossy — 0.78 vs 0.81 both round to "1" and the integer κ claims perfect agreement when the raters actually disagree by 3pp. This release adds continuous-value agreement metrics so calibration findings on continuous judges become quantitative. - continuousAgreement(scores, opts?): κ_w (quadratic or linear weights on raw scores), ICC(2,1) per Shrout & Fleiss (two-way random effects, absolute agreement, single rater), Pearson + Spearman averaged over rater pairs, bootstrap 95% percentile CIs on ICC and κ_w (default n=1000, seeded). - calibrateJudgeContinuous(golden, candidate, opts?): drop-in superset of calibrateJudge — preserves all legacy fields and adds the continuous metrics + CIs. - calibrateJudge is unchanged for backwards compat. ICC catches systematic bias Pearson misses: if judge B = 2× judge A, Pearson stays ≈ 1 while ICC drops. Tests assert this exact failure mode (identical raters → 1.0, independent uniform → ≈ 0, 2× scaling → high Pearson + lower ICC, bootstrap CI brackets truth on synthetic ground-truth data). Version bump to 0.26.0 (non-breaking add).

tangletools · 2026-05-16T09:57:21Z

⚠️ Review Failed — `dc2efcbf`

All review passes errored. No verdict published. Push a new commit to trigger a re-review.

_{tangletools · 2026-05-16T19:18:30Z}

drewstone merged commit da4e642 into main May 16, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(judges): continuous-value inter-rater agreement (ICC + weighted κ)#50

feat(judges): continuous-value inter-rater agreement (ICC + weighted κ)#50
drewstone merged 1 commit into
mainfrom
feat/continuous-agreement

tangletools commented May 16, 2026

Uh oh!

tangletools commented May 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tangletools commented May 16, 2026

Summary

API

Test plan

Files

Uh oh!

tangletools commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Review Failed — dc2efcbf

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tangletools commented May 16, 2026 •

edited

Loading

⚠️ Review Failed — `dc2efcbf`