Skip to content

feat: add evaluation-level expectedOutput to EvaluationItem#1387

Merged
Chibionos merged 4 commits intomainfrom
feat/eval-level-expected-output
Feb 27, 2026
Merged

feat: add evaluation-level expectedOutput to EvaluationItem#1387
Chibionos merged 4 commits intomainfrom
feat/eval-level-expected-output

Conversation

@Chibionos
Copy link
Contributor

@Chibionos Chibionos commented Feb 27, 2026

What

Adds an optional expectedOutput field at the evaluation item level in the v1.0 schema. Output-based evaluators (ExactMatch, JsonSimilarity, LLMJudgeOutput) can now read expected output from the evaluation itself, instead of requiring it to be duplicated in every evaluator's evaluationCriterias entry.

Why

Today, when three evaluators need the same expected output, you write it three times:

"evaluationCriterias": {
  "exact-match":       { "expectedOutput": { "result": 4 } },
  "json-similarity":   { "expectedOutput": { "result": 4 } },
  "llm-judge-output":  { "expectedOutput": { "result": 4 } }
}

With this change you write it once:

"expectedOutput": { "result": 4 },
"evaluationCriterias": {
  "exact-match": null,
  "json-similarity": null,
  "llm-judge-output": null
}

Per-evaluator override still works — if a criteria entry has its own expectedOutput, it wins.

How it works

Model — One new optional field on EvaluationItem:

expected_output: dict[str, Any] | str | None = Field(default=None, alias="expectedOutput")

Runtime — 15 lines of merge logic in _execute_eval() before criteria is passed to evaluators:

  1. If the evaluator's criteria type extends OutputEvaluationCriteria and the evaluation has expectedOutput:
    • Null criteria → inject {"expectedOutput": eval_item.expected_output}
    • Criteria without expectedOutput → merge it in
    • Criteria with expectedOutput → keep as-is (per-evaluator wins)
  2. Non-output evaluators (Contains, ToolCall*, Trajectory) are completely untouched.

Backward compatibility

  • The new field defaults to None — existing eval-set JSONs parse and run without any changes.
  • Legacy evaluation sets are unaffected (different model, different migration path).
  • No changes needed in downstream repos (uipath-agents, uipath-langchain, uipath-runtime).

Tests

25 new tests across 7 test classes covering:

  • Model parsing (dict, string, null, missing, serialization roundtrip)
  • Runtime merge logic (null criteria, override, injection, non-output evaluators, edge cases)
  • Evaluator integration (ExactMatch, JsonSimilarity, LLMJudgeOutput)
  • Legacy migration compatibility
  • End-to-end flow simulation (mixed evaluator types, override precedence)

All 1574 tests pass (25 new + 1549 existing, zero regressions).

Jira

AE-1066

Spec

Confluence: Evaluation-Level ExpectedOutput Schema Enhancement

🤖 Generated with Claude Code

@github-actions github-actions bot added test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-llamaindex Triggers tests in the uipath-llamaindex-python repository labels Feb 27, 2026
Chibi Vikram and others added 2 commits February 27, 2026 00:03
Add an optional `expectedOutput` field at the evaluation level so
output-based evaluators can share a common expected output instead
of duplicating it in every evaluator's criteria entry.

Resolution order:
1. Per-evaluator criteria expectedOutput (highest priority)
2. Evaluation-level expectedOutput (fallback)
3. Evaluator config default / error (existing behavior)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Chibionos Chibionos force-pushed the feat/eval-level-expected-output branch from 6b5e80e to f73fda2 Compare February 27, 2026 08:04
Chibi Vikram and others added 2 commits February 27, 2026 01:12
Add evaluation set JSON files exercising the new expectedOutput field on
EvaluationItem and a matching testcase with run.sh + assert.py that
validates scores from deterministic and LLM judge evaluators.
Bump version to 2.11.0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
uipath-langchain==0.7.11 pins uipath>=2.10.0,<2.11.0, so bumping
to 2.11.0 breaks the cross-compatibility testcase. The version bump
should be coordinated with a uipath-langchain release.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Chibionos Chibionos merged commit 40de0d3 into main Feb 27, 2026
104 of 117 checks passed
@Chibionos Chibionos deleted the feat/eval-level-expected-output branch February 27, 2026 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-llamaindex Triggers tests in the uipath-llamaindex-python repository

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants