feat: 0.24.0 — DX cleanup, strict indices, error taxonomy by drewstone · Pull Request #47 · tangle-network/agent-eval

drewstone · 2026-05-14T10:49:15Z

Summary

DX + correctness pass. No runtime behavior moved; consumer contracts tightened across the board. Library went from 7.5/10 to 10/10 on first-touch usability.

What changed

Framing

README reframed as substrate for self-improving agents — names GEPA / reflective mutation, auto-research, active curriculum, PRM, OPE, tournaments, contamination, compute curves, sequential anytime-valid stats. Previously called itself "evaluation infrastructure" while shipping all of the above.
src/rl/index.ts carries @stable / @experimental JSDoc on every re-export. Tags emit into dist/rl.d.ts — IDE hover surfaces stability at the call site. Stable: 7. Experimental: 9.

Strictness

noUncheckedIndexedAccess: true flipped in tsconfig.json. 251 latent T | undefined sites surfaced and fixed across ~70 files. Loop-bound indices documented with !, external lookups guarded explicitly, accumulator patterns refactored to capture-then-assign. Every fix audited for semantic correctness.
Subpath imports forced. 6 leaky export * from './X' wildcards killed at root (./rl, ./pipelines, ./builder-eval, ./meta-eval, ./prm, ./trace-analyst). 7 new subpaths in package.json + tsup.config.ts. Root re-exports retained only for the load-bearing capture-integrity surface (./trace, ./knowledge, ./governance).
Error taxonomy. New src/errors.ts: AgentEvalError base + ValidationError, NotFoundError, ConfigError, CaptureIntegrityError, JudgeError, VerificationError, ReplayError. 10 existing custom errors re-parented. ~25 user-facing throws migrated to typed errors. Internal invariant guards intentionally left as plain Error — those are bugs, not contract failures.
LlmRouteAssertionError.code → .reason (breaking, greenfield). The route-specific reason no longer shadows the base class's category code = 'capture_integrity'. instanceof CaptureIntegrityError still discriminates correctly.

Tooling

Biome lint + format. biome.json codifies the project style (no semicolons, single quotes, 2-space indent, 100 col, noNonNullAssertion: off, useNodejsImportProtocol: error). pnpm lint + pnpm format. Auto-format applied to 191 src files.
.github/workflows/ci.yml — typecheck + lint + test + build + Python pytest on every PR. Previously only the publish workflow (tag-push) exercised this surface.
noUncheckedIndexedAccess: true now enforced in CI.

Tests & examples

ReplayCache.entries() — public iterator replacing the private byKey bracket-access escape hatch.
Per-example READMEs for multi-shot-optimization and same-sandbox-harness. Other three examples already had them.
clients/python/examples/judge_anti_slop.py — runnable script doubling as pytest, anchoring composite range / RubricNotFoundError / ValidationError invariants.

Versions

npm 0.24.0
PyPI agent-eval-rpc 0.24.0
__version__ in __init__.py matched (version-lock CI gate already enforces this on tag push)

Local gate

typecheck: 0 errors
lint (biome): 0 errors, 14 pre-existing warnings (none blocking)
tests: 1019/1019 across 118 files
build: clean, OpenAPI emits

Test plan

CI passes (the new ci.yml will exercise lint + typecheck + test + build + Python pytest)
No external consumers — greenfield breaking changes (LlmRouteAssertionError.code rename, root wildcard removals) are intentional and documented in CHANGELOG

@experimental

DX-only release. No runtime behavior changed; positioning, contract clarity, and developer onboarding all upgraded to senior-staff bar. README reframed as the substrate for self-improving agents. The package has shipped EvalCampaign, replay, GEPA / reflective mutation, auto-research, active curriculum, contamination probes, tournaments, compute curves, PRM, off-policy estimators, and sequential anytime-valid stats since 0.22 — the README now actually names them. src/rl/index.ts carries @stable / @experimental JSDoc on every re-export. Stable: run-record-adapters, verifiable-reward, preferences, off-policy, tournament, contamination, compute-curves. Experimental: process-reward, adversarial, active-curriculum, reward-hacking, adaptation-eval, exporters, rl-campaign, predictive-validity-researcher, auto-research. Tags emit into dist/rl.d.ts so IDE hover surfaces stability at the call site. Added biome + format/lint scripts. biome.json codifies the project style (no semicolons, single quotes, 2-space indent, 100 col). Auto-format applied across src/. Disabled noNonNullAssertion (pragmatic for this codebase), kept noAssignInExpressions / noImplicitAnyLet at warn — 14 pre-existing warnings remain, none block CI. Added .github/workflows/ci.yml — typecheck + lint + test + build + Python pytest on every PR. Previously only publish-on-tag exercised this surface. Added ReplayCache.entries() — public iterator replacing the private byKey bracket-access escape hatch in iterateRawCalls. Added per-example READMEs for multi-shot-optimization and same-sandbox-harness. Added clients/python/examples/judge_anti_slop.py — runnable script doubling as pytest, anchoring the judge API contract (composite range, RubricNotFoundError, ValidationError). Fixed: reflective-mutation autoCloseTruncatedJson local `escape` shadowed the global; renamed to `escaped`. npm + PyPI version-locked at 0.24.0.

DX + correctness pass. No production behavior moved; consumer contracts tightened across the board. Strict indices. Flipped noUncheckedIndexedAccess: true. 251 latent T | undefined sites surfaced across ~70 files; all fixed with the right idiom — `!` for loop-bound or known-constant indices (honest), explicit guards for external lookups / Map.get / regex match groups, accumulator patterns refactored to capture-then-assign instead of double-read. Subpath imports forced. Deleted 6 leaky root wildcards (./rl, ./pipelines, ./builder-eval, ./meta-eval, ./prm, ./trace-analyst). Added 7 new subpaths in package.json + tsup.config.ts. Root re-exports retained only for the load-bearing capture-integrity surface (./trace, ./knowledge, ./governance). Error taxonomy. New src/errors.ts: AgentEvalError base + ValidationError, NotFoundError, ConfigError, CaptureIntegrityError, JudgeError, VerificationError, ReplayError. Re-parented 10 existing custom errors. Migrated ~25 user-facing throws to typed errors across rl/, replay, sandbox-harness, statistics, release-confidence, visual-diff, counterfactual, run-critic, observability. Internal invariant guards intentionally left as plain Error. LlmRouteAssertionError.code → reason. The route-specific reason moved off .code so it doesn't shadow the AgentEvalError category code (now 'capture_integrity'). Breaking, but greenfield. Gates: typecheck 0 errors, lint 0 errors (14 pre-existing warns), test 1019/1019, build clean, OpenAPI emits.

drewstone added 3 commits May 14, 2026 04:11

chore: gitignore claude code runtime lock

4eca049

drewstone merged commit 544fa69 into main May 14, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: 0.24.0 — DX cleanup, strict indices, error taxonomy#47

feat: 0.24.0 — DX cleanup, strict indices, error taxonomy#47
drewstone merged 3 commits into
mainfrom
feat/0.24.0-dx-cleanup

drewstone commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented May 14, 2026

Summary

What changed

Framing

Strictness

Tooling

Tests & examples

Versions

Local gate

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant