Bug: evaluator token usage is silently dropped from per-task reports

In `Benchmark._run_task_repetition` (`maseval/core/benchmark.py`, around lines 1210–1239), `collect_all_usage()` is called *before* `evaluate()` runs. Evaluators that hold their own `ModelAdapter` (LLM judges) only make their LLM calls during `evaluate()`, so when the usage snapshot is taken their `_usage_records` are still empty. The snapshot is written into `report["usage"]` and `clear_registry()` runs immediately after the report is built, discarding the judge adapters along with the tokens they accumulated. Net effect: every entry under `report["usage"]["models"]` that corresponds to an evaluator-owned model shows `input_tokens=0, output_tokens=0, cost=0.0`, even though the judges clearly ran (their outputs are present under `report["eval"]`). The persistent `benchmark.usage_by_component` / `total_usage` aggregates inherit the same gap because they are populated from the same pre-eval `collect_usage()` call. Reproducible on `LiteLLMModelAdapter` judges registered via `self.register("models", ...)` in `setup_evaluators`; agent-side models are unaffected because they finish executing before step 3. Suggested fix: call `collect_all_usage()` *after* `evaluate()` (or call it a second time post-eval and merge), so judge token counts make it into both the per-task report and the run-level aggregates. This matters increasingly as multi-judge eval stacks (uncertainty / response-type / faithfulness / per-family tracking judges / LLM-graded goals) drive judge token volume well past agent token volume.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: evaluator token usage is silently dropped from per-task reports #60

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: evaluator token usage is silently dropped from per-task reports #60

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions