Fix case-sensitive metric name matching in RAI safety evaluators#46177
Fix case-sensitive metric name matching in RAI safety evaluators#46177w-javed wants to merge 3 commits intoAzure:mainfrom
Conversation
The _parse_eval_result method in RaiServiceEvaluatorBase used
case-sensitive string comparison when matching metric names from the
RAI service response against SDK metric constants. The sync_evals
endpoint returns capitalized metric names (e.g., 'Violence') while
SDK constants use lowercase (e.g., 'violence'), causing the match
to fail silently and return empty {'evaluation_per_turn': {}}.
This change makes the metric name comparison case-insensitive by
normalizing both sides to lowercase before comparing. Added 5 new
test cases covering capitalized, fully uppercased, mixed-case, and
builtin-prefixed metric name variants.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes metric-name matching in RaiServiceEvaluatorBase._parse_eval_result() to be case-insensitive so sync_evals responses like "Violence" can be correctly matched against SDK metric constants like "violence".
Changes:
- Lowercase-normalizes service metric names and expected metric names before comparing in
_parse_eval_result. - Adds new unit tests to validate case-insensitive parsing across several metric-name casing variants.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_base_rai_svc_eval.py |
Updates metric-name matching logic to use case-insensitive comparisons during sync_evals result parsing. |
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_content_safety_rai_script.py |
Adds unit tests intended to validate the new case-insensitive metric-name matching behavior. |
...evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_base_rai_svc_eval.py
Outdated
Show resolved
Hide resolved
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_content_safety_rai_script.py
Outdated
Show resolved
Hide resolved
Address PR review feedback: lowercase + strip underscores so both snake_case (self_harm) and PascalCase (SelfHarm) service responses match the SDK constants. Also strip optional builtin. prefix before comparing. Add PascalCase test cases for SelfHarm, HateUnfairness, ProtectedMaterial, and builtin.* prefixed variants. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
46c6693 to
d95f329
Compare
| def _normalize(name: str) -> str: | ||
| """Lowercase and remove underscores for flexible metric matching.""" | ||
| return name.lower().replace("_", "") | ||
|
|
There was a problem hiding this comment.
Nit: _normalize is redefined on every iteration of the for result_item in results loop. Since it doesn't close over any loop variable, consider hoisting it to a @staticmethod on the class or a module-level helper. Same applies to expected_metric and normalized_expected — they're invariant across iterations and can be computed once before the loop.
Suggested shape:
# As a static method on the class or module-level helper:
def _normalize_metric(name: str) -> str:
return name.lower().replace('_', '')
# Before the 'for result_item in results:' loop:
expected_metric = self._eval_metric.value if hasattr(self._eval_metric, 'value') else self._eval_metric
normalized_expected = self._normalize_metric(expected_metric)| == _normalize(self._eval_metric if isinstance(self._eval_metric, str) else "") | ||
| ): |
There was a problem hiding this comment.
This third branch is dead code. EvaluationMetrics extends str, Enum, so isinstance(self._eval_metric, str) is always True. That means _normalize(self._eval_metric) just normalizes the enum's string value — which is identical to normalized_expected (computed from self._eval_metric.value, which returns the same string for a str Enum). So this condition is always equivalent to the first condition (normalized_metric_no_prefix == normalized_expected).
Suggest removing this branch entirely and simplifying to:
if (
normalized_metric_no_prefix == normalized_expected
or normalized_metric == normalized_expected
):|
Overall: the core fix is correct and well-tested — nice work catching this case-sensitivity gap. A few things to address before merge: CI is red:
Code cleanup (non-blocking but recommended):
Tests are thorough and cover the important variants (PascalCase, builtin prefix, ALL_CAPS, label-type evaluators). 👍 |
Problem
The
_parse_eval_resultmethod inRaiServiceEvaluatorBaseuses case-sensitive string comparison when matching metric names from the RAI service response. The sync_evals endpoint returns capitalized metric names (e.g.,"Violence") while SDK constants use lowercase (e.g.,"violence"), causing the match to fail silently and return empty results:{'evaluation_per_turn': {}}Root Cause
In
_base_rai_svc_eval.py, the metric matching logic:All three comparisons fail because of case mismatch, so
_parse_eval_resultreturns an empty dict, which propagates to_aggregate_resultsproducing{'evaluation_per_turn': {}}.Fix
Normalize both sides of the comparison to lowercase before matching:
Tests
Added 5 new test cases to
TestParseEvalResult:"Violence")"Builtin.Sexual")"SELF_HARM")"Protected_Material")All 29 tests in
test_content_safety_rai_script.pypass.