In Benchmark._run_task_repetition (maseval/core/benchmark.py, around lines 1210–1239), collect_all_usage() is called before evaluate() runs. Evaluators that hold their own ModelAdapter (LLM judges) only make their LLM calls during evaluate(), so when the usage snapshot is taken their _usage_records are still empty. The snapshot is written into report["usage"] and clear_registry() runs immediately after the report is built, discarding the judge adapters along with the tokens they accumulated. Net effect: every entry under report["usage"]["models"] that corresponds to an evaluator-owned model shows input_tokens=0, output_tokens=0, cost=0.0, even though the judges clearly ran (their outputs are present under report["eval"]). The persistent benchmark.usage_by_component / total_usage aggregates inherit the same gap because they are populated from the same pre-eval collect_usage() call. Reproducible on LiteLLMModelAdapter judges registered via self.register("models", ...) in setup_evaluators; agent-side models are unaffected because they finish executing before step 3. Suggested fix: call collect_all_usage() after evaluate() (or call it a second time post-eval and merge), so judge token counts make it into both the per-task report and the run-level aggregates. This matters increasingly as multi-judge eval stacks (uncertainty / response-type / faithfulness / per-family tracking judges / LLM-graded goals) drive judge token volume well past agent token volume.
In
Benchmark._run_task_repetition(maseval/core/benchmark.py, around lines 1210–1239),collect_all_usage()is called beforeevaluate()runs. Evaluators that hold their ownModelAdapter(LLM judges) only make their LLM calls duringevaluate(), so when the usage snapshot is taken their_usage_recordsare still empty. The snapshot is written intoreport["usage"]andclear_registry()runs immediately after the report is built, discarding the judge adapters along with the tokens they accumulated. Net effect: every entry underreport["usage"]["models"]that corresponds to an evaluator-owned model showsinput_tokens=0, output_tokens=0, cost=0.0, even though the judges clearly ran (their outputs are present underreport["eval"]). The persistentbenchmark.usage_by_component/total_usageaggregates inherit the same gap because they are populated from the same pre-evalcollect_usage()call. Reproducible onLiteLLMModelAdapterjudges registered viaself.register("models", ...)insetup_evaluators; agent-side models are unaffected because they finish executing before step 3. Suggested fix: callcollect_all_usage()afterevaluate()(or call it a second time post-eval and merge), so judge token counts make it into both the per-task report and the run-level aggregates. This matters increasingly as multi-judge eval stacks (uncertainty / response-type / faithfulness / per-family tracking judges / LLM-graded goals) drive judge token volume well past agent token volume.