Skip to content

Bug: evaluator token usage is silently dropped from per-task reports #60

@cemde

Description

@cemde

In Benchmark._run_task_repetition (maseval/core/benchmark.py, around lines 1210–1239), collect_all_usage() is called before evaluate() runs. Evaluators that hold their own ModelAdapter (LLM judges) only make their LLM calls during evaluate(), so when the usage snapshot is taken their _usage_records are still empty. The snapshot is written into report["usage"] and clear_registry() runs immediately after the report is built, discarding the judge adapters along with the tokens they accumulated. Net effect: every entry under report["usage"]["models"] that corresponds to an evaluator-owned model shows input_tokens=0, output_tokens=0, cost=0.0, even though the judges clearly ran (their outputs are present under report["eval"]). The persistent benchmark.usage_by_component / total_usage aggregates inherit the same gap because they are populated from the same pre-eval collect_usage() call. Reproducible on LiteLLMModelAdapter judges registered via self.register("models", ...) in setup_evaluators; agent-side models are unaffected because they finish executing before step 3. Suggested fix: call collect_all_usage() after evaluate() (or call it a second time post-eval and merge), so judge token counts make it into both the per-task report and the run-level aggregates. This matters increasingly as multi-judge eval stacks (uncertainty / response-type / faithfulness / per-family tracking judges / LLM-graded goals) drive judge token volume well past agent token volume.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcoreIn regards to the core package `maseval/core`

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions