Fix case-sensitive metric name matching in RAI safety evaluators by w-javed · Pull Request #46177 · Azure/azure-sdk-for-python

w-javed · 2026-04-07T03:39:10Z

Problem

The _parse_eval_result method in RaiServiceEvaluatorBase uses case-sensitive string comparison when matching metric names from the RAI service response. The sync_evals endpoint returns capitalized metric names (e.g., "Violence") while SDK constants use lowercase (e.g., "violence"), causing the match to fail silently and return empty results:

{'evaluation_per_turn': {}}

Root Cause

In _base_rai_svc_eval.py, the metric matching logic:

if (
    metric_name == expected_metric        # "Violence" != "violence" → False
    or metric_name == f"builtin.{expected_metric}"  # False
    or metric_name == self._eval_metric   # False
):

All three comparisons fail because of case mismatch, so _parse_eval_result returns an empty dict, which propagates to _aggregate_results producing {'evaluation_per_turn': {}}.

Fix

Normalize both sides of the comparison to lowercase before matching:

metric_name_lower = metric_name.lower()
expected_metric_lower = expected_metric.lower()
if (
    metric_name_lower == expected_metric_lower
    or metric_name_lower == f"builtin.{expected_metric_lower}"
    ...
):

Tests

Added 5 new test cases to TestParseEvalResult:

Capitalized metric name ("Violence")
Capitalized builtin prefix ("Builtin.Sexual")
Fully uppercased ("SELF_HARM")
Label-type evaluator with mixed case ("Protected_Material")
Multiple mixed-case variants in a loop

All 29 tests in test_content_safety_rai_script.py pass.

The _parse_eval_result method in RaiServiceEvaluatorBase used case-sensitive string comparison when matching metric names from the RAI service response against SDK metric constants. The sync_evals endpoint returns capitalized metric names (e.g., 'Violence') while SDK constants use lowercase (e.g., 'violence'), causing the match to fail silently and return empty {'evaluation_per_turn': {}}. This change makes the metric name comparison case-insensitive by normalizing both sides to lowercase before comparing. Added 5 new test cases covering capitalized, fully uppercased, mixed-case, and builtin-prefixed metric name variants. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR fixes metric-name matching in RaiServiceEvaluatorBase._parse_eval_result() to be case-insensitive so sync_evals responses like "Violence" can be correctly matched against SDK metric constants like "violence".

Changes:

Lowercase-normalizes service metric names and expected metric names before comparing in _parse_eval_result.
Adds new unit tests to validate case-insensitive parsing across several metric-name casing variants.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_base_rai_svc_eval.py`	Updates metric-name matching logic to use case-insensitive comparisons during sync_evals result parsing.
`sdk/evaluation/azure-ai-evaluation/tests/unittests/test_content_safety_rai_script.py`	Adds unit tests intended to validate the new case-insensitive metric-name matching behavior.

...evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_base_rai_svc_eval.py

sdk/evaluation/azure-ai-evaluation/tests/unittests/test_content_safety_rai_script.py

Address PR review feedback: lowercase + strip underscores so both snake_case (self_harm) and PascalCase (SelfHarm) service responses match the SDK constants. Also strip optional builtin. prefix before comparing. Add PascalCase test cases for SelfHarm, HateUnfairness, ProtectedMaterial, and builtin.* prefixed variants. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

slister1001 · 2026-04-07T17:51:17Z

...evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_base_rai_svc_eval.py

+                def _normalize(name: str) -> str:
+                    """Lowercase and remove underscores for flexible metric matching."""
+                    return name.lower().replace("_", "")
+


Nit: _normalize is redefined on every iteration of the for result_item in results loop. Since it doesn't close over any loop variable, consider hoisting it to a @staticmethod on the class or a module-level helper. Same applies to expected_metric and normalized_expected — they're invariant across iterations and can be computed once before the loop.

Suggested shape:

# As a static method on the class or module-level helper: def _normalize_metric(name: str) -> str: return name.lower().replace('_', '') # Before the 'for result_item in results:' loop: expected_metric = self._eval_metric.value if hasattr(self._eval_metric, 'value') else self._eval_metric normalized_expected = self._normalize_metric(expected_metric)

slister1001 · 2026-04-07T17:51:32Z

...evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_base_rai_svc_eval.py

+                    == _normalize(self._eval_metric if isinstance(self._eval_metric, str) else "")
                ):


This third branch is dead code. EvaluationMetrics extends str, Enum, so isinstance(self._eval_metric, str) is always True. That means _normalize(self._eval_metric) just normalizes the enum's string value — which is identical to normalized_expected (computed from self._eval_metric.value, which returns the same string for a str Enum). So this condition is always equivalent to the first condition (normalized_metric_no_prefix == normalized_expected).

Suggest removing this branch entirely and simplifying to:

if ( normalized_metric_no_prefix == normalized_expected or normalized_metric == normalized_expected ):

slister1001 · 2026-04-07T17:51:48Z

Overall: the core fix is correct and well-tested — nice work catching this case-sensitivity gap. A few things to address before merge:

CI is red:

Build Analyze is failing — this is likely the black formatting gate. This repo requires running tox run -e black -c ../../../eng/tox/tox.ini --root . from the azure-ai-evaluation directory before committing.
Merge state is dirty — there are conflicts with main that need to be resolved.

Code cleanup (non-blocking but recommended):

Hoist _normalize out of the loop (see inline comment) — it's recreated per result item unnecessarily.
Remove the redundant third branch in the if condition (see inline comment) — it's dead code given EvaluationMetrics is a str Enum.

Tests are thorough and cover the important variants (PascalCase, builtin prefix, ALL_CAPS, label-type evaluators). 👍

Copilot AI review requested due to automatic review settings April 7, 2026 03:39

w-javed requested a review from a team as a code owner April 7, 2026 03:39

github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Apr 7, 2026

Copilot started reviewing on behalf of w-javed April 7, 2026 03:40 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

...evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_base_rai_svc_eval.py Outdated Show resolved Hide resolved

sdk/evaluation/azure-ai-evaluation/tests/unittests/test_content_safety_rai_script.py Outdated Show resolved Hide resolved

w-javed and others added 2 commits April 7, 2026 07:44

Add changelog entry for case-insensitive metric matching fix

d95f329

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

w-javed force-pushed the fix/eval-metric-case-insensitive-match branch from 46c6693 to d95f329 Compare April 7, 2026 16:31

slister1001 reviewed Apr 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix case-sensitive metric name matching in RAI safety evaluators#46177

Fix case-sensitive metric name matching in RAI safety evaluators#46177
w-javed wants to merge 3 commits intoAzure:mainfrom
w-javed:fix/eval-metric-case-insensitive-match

w-javed commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

slister1001 Apr 7, 2026

Uh oh!

slister1001 Apr 7, 2026

Uh oh!

slister1001 commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		== _normalize(self._eval_metric if isinstance(self._eval_metric, str) else "")
		):

Conversation

w-javed commented Apr 7, 2026

Problem

Root Cause

Fix

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

slister1001 Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

slister1001 Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

slister1001 commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants