Skip to content

feat(judges): continuous-value inter-rater agreement (ICC + weighted κ)#50

Merged
drewstone merged 1 commit into
mainfrom
feat/continuous-agreement
May 16, 2026
Merged

feat(judges): continuous-value inter-rater agreement (ICC + weighted κ)#50
drewstone merged 1 commit into
mainfrom
feat/continuous-agreement

Conversation

@tangletools
Copy link
Copy Markdown
Contributor

Summary

  • The original calibrateJudge rounded judge scores to ints before computing Cohen's κ. For fine-grained [0,1] judges that's lossy — 0.78 vs 0.81 both round to "1" and the integer κ claims perfect agreement when raters actually disagree by 3pp.
  • Adds continuousAgreement(scores, opts?) returning Cohen's κ_w (quadratic or linear weights on raw scores), ICC(2,1) per Shrout & Fleiss (two-way random effects, absolute agreement, single rater), Pearson + Spearman averaged over rater pairs, and bootstrap 95% percentile CIs on κ_w and ICC (default n=1000, seeded for reproducibility).
  • Adds calibrateJudgeContinuous(golden, candidate, opts?) as a drop-in superset of calibrateJudge — every legacy field is preserved, new fields (weightedKappaContinuous, icc, spearman, ci) layered on top. calibrateJudge is unchanged.
  • Bumps to 0.26.0 (non-breaking add). Python client version pinned in lockstep.

ICC catches the bug Pearson hides: if judge B = 2× judge A, Pearson stays ≈ 1 while ICC drops. The evolve scripts can now report κ_w + ICC per judge dimension with CIs — calibration findings become quantitative.

API

export interface ContinuousAgreement {
  weightedKappa: number       // Cohen's κ_w, quadratic weights on raw scores
  icc: number                 // ICC(2,1): two-way random effects, single rater
  pearson: number             // averaged over rater pairs when N ≥ 2
  spearman: number
  ci: {
    icc: [number, number]     // 95% bootstrap percentile CI
    weightedKappa: [number, number]
  }
  n: number
  raters: number
}

export function continuousAgreement(
  scores: number[][],
  opts?: { bootstrap?: number; weights?: 'linear' | 'quadratic'; seed?: number; ciLevel?: number },
): ContinuousAgreement

export function calibrateJudgeContinuous(
  golden: GoldenItem[],
  candidate: CandidateScore[],
  opts?: ContinuousAgreementOptions,
): ContinuousCalibrationResult  // CalibrationResult ⊕ { weightedKappaContinuous, icc, spearman, ci }

Test plan

  • pnpm typecheck clean.
  • pnpm test — 1065 / 1065 pass (was 1051).
  • pnpm build succeeds; new exports surface in dist/index.d.ts.
  • Identical raters → ICC = 1.0, κ_w = 1.0.
  • Independent uniform raters (n=400) → ICC, κ_w, Pearson all within ±0.15 of 0.
  • Rater B = 2× Rater A + small noise → Pearson > 0.95 AND ICC clearly lower than Pearson (the systematic-bias regression assertion).
  • Bootstrap CI brackets the truth on synthetic data (n=200, B=1000) and on identical-rater controls.
  • N=3 raters with shared truth + noise → ICC and κ_w both > 0.8 (mean-pairwise aggregation works).
  • NaN / Infinity rows dropped, not coerced.
  • calibrateJudgeContinuous preserves every legacy field (n, pearson, kappa, mae, worstItems) and flags the 2× scaling bias that integer-rounded κ misses.

Files

  • src/judge-calibration.ts — new public API + internals (continuous κ_w, ICC(2,1), Spearman with tie-handling, seeded Mulberry32 bootstrap, percentile quantile).
  • src/index.ts — re-exports.
  • tests/judge-calibration.test.ts — extended with 11 new tests (no new file; one canonical suite).
  • docs/concepts.md — new "Judge calibration" section.
  • README.md — primitive table row.
  • CHANGELOG.md, package.json, clients/python/{pyproject.toml,src/agent_eval_rpc/__init__.py} — version bump to 0.26.0.

The original calibrateJudge rounded scores to ints before computing
Cohen's κ. For fine-grained [0,1] judges that's lossy — 0.78 vs 0.81
both round to "1" and the integer κ claims perfect agreement when the
raters actually disagree by 3pp. This release adds continuous-value
agreement metrics so calibration findings on continuous judges become
quantitative.

- continuousAgreement(scores, opts?): κ_w (quadratic or linear weights
  on raw scores), ICC(2,1) per Shrout & Fleiss (two-way random effects,
  absolute agreement, single rater), Pearson + Spearman averaged over
  rater pairs, bootstrap 95% percentile CIs on ICC and κ_w (default
  n=1000, seeded).
- calibrateJudgeContinuous(golden, candidate, opts?): drop-in superset
  of calibrateJudge — preserves all legacy fields and adds the
  continuous metrics + CIs.
- calibrateJudge is unchanged for backwards compat.

ICC catches systematic bias Pearson misses: if judge B = 2× judge A,
Pearson stays ≈ 1 while ICC drops. Tests assert this exact failure
mode (identical raters → 1.0, independent uniform → ≈ 0, 2× scaling →
high Pearson + lower ICC, bootstrap CI brackets truth on synthetic
ground-truth data).

Version bump to 0.26.0 (non-breaking add).
@tangletools
Copy link
Copy Markdown
Contributor Author

tangletools commented May 16, 2026

⚠️ Review Failed — dc2efcbf

All review passes errored. No verdict published. Push a new commit to trigger a re-review.

tangletools · 2026-05-16T19:18:30Z

@drewstone drewstone merged commit da4e642 into main May 16, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants