Skip to content

Fix case-sensitive metric name matching in RAI safety evaluators#46177

Open
w-javed wants to merge 3 commits intoAzure:mainfrom
w-javed:fix/eval-metric-case-insensitive-match
Open

Fix case-sensitive metric name matching in RAI safety evaluators#46177
w-javed wants to merge 3 commits intoAzure:mainfrom
w-javed:fix/eval-metric-case-insensitive-match

Conversation

@w-javed
Copy link
Copy Markdown
Contributor

@w-javed w-javed commented Apr 7, 2026

Problem

The _parse_eval_result method in RaiServiceEvaluatorBase uses case-sensitive string comparison when matching metric names from the RAI service response. The sync_evals endpoint returns capitalized metric names (e.g., "Violence") while SDK constants use lowercase (e.g., "violence"), causing the match to fail silently and return empty results:

{'evaluation_per_turn': {}}

Root Cause

In _base_rai_svc_eval.py, the metric matching logic:

if (
    metric_name == expected_metric        # "Violence" != "violence" → False
    or metric_name == f"builtin.{expected_metric}"  # False
    or metric_name == self._eval_metric   # False
):

All three comparisons fail because of case mismatch, so _parse_eval_result returns an empty dict, which propagates to _aggregate_results producing {'evaluation_per_turn': {}}.

Fix

Normalize both sides of the comparison to lowercase before matching:

metric_name_lower = metric_name.lower()
expected_metric_lower = expected_metric.lower()
if (
    metric_name_lower == expected_metric_lower
    or metric_name_lower == f"builtin.{expected_metric_lower}"
    ...
):

Tests

Added 5 new test cases to TestParseEvalResult:

  • Capitalized metric name ("Violence")
  • Capitalized builtin prefix ("Builtin.Sexual")
  • Fully uppercased ("SELF_HARM")
  • Label-type evaluator with mixed case ("Protected_Material")
  • Multiple mixed-case variants in a loop

All 29 tests in test_content_safety_rai_script.py pass.

The _parse_eval_result method in RaiServiceEvaluatorBase used
case-sensitive string comparison when matching metric names from the
RAI service response against SDK metric constants. The sync_evals
endpoint returns capitalized metric names (e.g., 'Violence') while
SDK constants use lowercase (e.g., 'violence'), causing the match
to fail silently and return empty {'evaluation_per_turn': {}}.

This change makes the metric name comparison case-insensitive by
normalizing both sides to lowercase before comparing. Added 5 new
test cases covering capitalized, fully uppercased, mixed-case, and
builtin-prefixed metric name variants.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 7, 2026 03:39
@w-javed w-javed requested a review from a team as a code owner April 7, 2026 03:39
@github-actions github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Apr 7, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes metric-name matching in RaiServiceEvaluatorBase._parse_eval_result() to be case-insensitive so sync_evals responses like "Violence" can be correctly matched against SDK metric constants like "violence".

Changes:

  • Lowercase-normalizes service metric names and expected metric names before comparing in _parse_eval_result.
  • Adds new unit tests to validate case-insensitive parsing across several metric-name casing variants.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_base_rai_svc_eval.py Updates metric-name matching logic to use case-insensitive comparisons during sync_evals result parsing.
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_content_safety_rai_script.py Adds unit tests intended to validate the new case-insensitive metric-name matching behavior.

w-javed and others added 2 commits April 7, 2026 07:44
Address PR review feedback: lowercase + strip underscores so both
snake_case (self_harm) and PascalCase (SelfHarm) service responses
match the SDK constants. Also strip optional builtin. prefix before
comparing.

Add PascalCase test cases for SelfHarm, HateUnfairness,
ProtectedMaterial, and builtin.* prefixed variants.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@w-javed w-javed force-pushed the fix/eval-metric-case-insensitive-match branch from 46c6693 to d95f329 Compare April 7, 2026 16:31
def _normalize(name: str) -> str:
"""Lowercase and remove underscores for flexible metric matching."""
return name.lower().replace("_", "")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: _normalize is redefined on every iteration of the for result_item in results loop. Since it doesn't close over any loop variable, consider hoisting it to a @staticmethod on the class or a module-level helper. Same applies to expected_metric and normalized_expected — they're invariant across iterations and can be computed once before the loop.

Suggested shape:

# As a static method on the class or module-level helper:
def _normalize_metric(name: str) -> str:
    return name.lower().replace('_', '')

# Before the 'for result_item in results:' loop:
expected_metric = self._eval_metric.value if hasattr(self._eval_metric, 'value') else self._eval_metric
normalized_expected = self._normalize_metric(expected_metric)

Comment on lines +292 to 293
== _normalize(self._eval_metric if isinstance(self._eval_metric, str) else "")
):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This third branch is dead code. EvaluationMetrics extends str, Enum, so isinstance(self._eval_metric, str) is always True. That means _normalize(self._eval_metric) just normalizes the enum's string value — which is identical to normalized_expected (computed from self._eval_metric.value, which returns the same string for a str Enum). So this condition is always equivalent to the first condition (normalized_metric_no_prefix == normalized_expected).

Suggest removing this branch entirely and simplifying to:

if (
    normalized_metric_no_prefix == normalized_expected
    or normalized_metric == normalized_expected
):

@slister1001
Copy link
Copy Markdown
Member

Overall: the core fix is correct and well-tested — nice work catching this case-sensitivity gap. A few things to address before merge:

CI is red:

  • Build Analyze is failing — this is likely the black formatting gate. This repo requires running tox run -e black -c ../../../eng/tox/tox.ini --root . from the azure-ai-evaluation directory before committing.
  • Merge state is dirty — there are conflicts with main that need to be resolved.

Code cleanup (non-blocking but recommended):

  1. Hoist _normalize out of the loop (see inline comment) — it's recreated per result item unnecessarily.
  2. Remove the redundant third branch in the if condition (see inline comment) — it's dead code given EvaluationMetrics is a str Enum.

Tests are thorough and cover the important variants (PascalCase, builtin prefix, ALL_CAPS, label-type evaluators). 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants