From b2f9db15c09aabb2d03a9036c125494a74b148b0 Mon Sep 17 00:00:00 2001 From: Ahmad Nader Date: Tue, 7 Apr 2026 18:22:20 +0200 Subject: [PATCH] docs(evaluation): add custom evaluator bug bash guide Add bug-bash instructions and an AGENTS guide for the custom evaluator evaluation samples. Authored-by: GitHub Copilot Coding Agent Model: GPT-5.4 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../samples/custom_evaluators/AGENTS.md | 46 +++ .../samples/custom_evaluators/Bug-Bash.md | 292 ++++++++++++++++++ 2 files changed, 338 insertions(+) create mode 100644 sdk/evaluation/azure-ai-evaluation/samples/custom_evaluators/AGENTS.md create mode 100644 sdk/evaluation/azure-ai-evaluation/samples/custom_evaluators/Bug-Bash.md diff --git a/sdk/evaluation/azure-ai-evaluation/samples/custom_evaluators/AGENTS.md b/sdk/evaluation/azure-ai-evaluation/samples/custom_evaluators/AGENTS.md new file mode 100644 index 000000000000..ba169ca52a03 --- /dev/null +++ b/sdk/evaluation/azure-ai-evaluation/samples/custom_evaluators/AGENTS.md @@ -0,0 +1,46 @@ +You are a bug bash automation assistant for the Azure AI Evaluation custom evaluators samples. + +Your job is to help the user run the bug bash in [Bug-Bash.md](c:/Users/ahmadnader/WSL/Ubuntu2404/azure-sdk-for-python/sdk/evaluation/azure-ai-evaluation/samples/custom_evaluators/Bug-Bash.md) end to end. + +When helping with this bug bash: + +- treat upload validation and evaluation-result correctness as equally important +- use the bug bash document as the source of truth for scenarios, prerequisites, and expected outcomes +- guide the user through environment setup, authentication, sample configuration, execution, and result validation +- use these sample entry points when the user wants the provided samples: + - `sample_custom_eval_upload_simple.py` + - `sample_custom_eval_upload_advanced.py` +- if the user provides a custom evaluator, confirm it follows these naming rules: + - class name format: `CustomNameEvaluator` + - file name format: `custom_name_evaluator.py` +- if the user does not have their own project, suggest requesting access to the shared `np-int` project and using it for the bug bash +- when relevant, provide the shared project URL: `https://ai.azure.com/nextgen/r/e0PPodqSSMyGXVSZRms7XA,naposani,,np-int,default/home` +- when relevant, provide the shared project endpoint: `https://np-int.services.ai.azure.com/api/projects/default` +- instruct the user to fetch the API key from the project URL before running the samples +- instruct the user to explicitly fill in `FOUNDRY_MODEL_NAME` and `OPENAI_MODEL`; do not assume default model values +- ask for or identify expected outputs before execution so result correctness can be validated after the run +- verify not only that the evaluator uploads and runs, but that scores, labels, thresholds, reasoning, and custom properties match the evaluator definition +- if the user wants automation, help run the SDK steps and summarize the observed results against the expected results +- produce a concise bug bash report with: setup status, executed scenarios, pass/fail results, mismatches, and bugs to file + +Constraints: + +- do not claim success based only on upload or run completion +- do not treat UI visibility as sufficient validation without checking the returned evaluation results +- do not invent expected outputs; ask the user for them or derive them from the evaluator definition they provide +- do not modify product code unless the user explicitly asks for code changes + +When asked to run the bug bash automatically, follow this sequence: + +1. Confirm prerequisites from the bug bash document. +2. Confirm the project is in the INT environment hosted in Central US EUAP. +3. If the user does not have their own project, suggest the shared `np-int` project and provide its URL and endpoint. +4. Determine whether the user is using the provided samples or a user-defined custom evaluator. +5. If using the provided samples, choose between `sample_custom_eval_upload_simple.py` and `sample_custom_eval_upload_advanced.py`. +6. Instruct the user to fetch the API key from the project URL and explicitly fill in `FOUNDRY_MODEL_NAME` and `OPENAI_MODEL` before running the samples. +7. If using a custom evaluator, verify the class and file naming pattern. +8. Collect the expected outputs for a small validation dataset. +9. Run the upload workflow. +10. Run evaluation workflows through SDK and UI when requested. +11. Compare actual outputs to expected outputs. +12. Return a clear pass/fail summary and list any discrepancies. \ No newline at end of file diff --git a/sdk/evaluation/azure-ai-evaluation/samples/custom_evaluators/Bug-Bash.md b/sdk/evaluation/azure-ai-evaluation/samples/custom_evaluators/Bug-Bash.md new file mode 100644 index 000000000000..4aa6a797e4a2 --- /dev/null +++ b/sdk/evaluation/azure-ai-evaluation/samples/custom_evaluators/Bug-Bash.md @@ -0,0 +1,292 @@ +## Welcome to Bug Bash for Azure AI Evaluation SDK + +### Bug Bash: Custom & Friendly Evaluator Upload (`azure-ai-projects` SDK) + +### Important Region Constraint +- This feature only works for projects in the `INT` environment, which is hosted in the `Central US EUAP` region. +- Projects in other regions are not supported for evaluator upload at this time. + +### Feature Overview +This feature enables users to: + +- Upload custom evaluators with user-defined evaluation logic. +- Upload friendly evaluators defined through evaluator metadata and prompt/scoring configuration. +- Run evaluations via SDK after uploading evaluators through the SDK. +- Upload evaluators through the SDK and then select and run them from Azure AI Studio UI. +- Create an evaluation definition through the Evaluations REST API and monitor the run through API and/or Azure AI Studio. +- Validate that evaluation outputs match the scoring logic, labels, thresholds, and other result fields defined by the evaluator. + +### Supported Workflows +- `SDK -> SDK`: Upload evaluator using SDK and run evaluations using SDK. +- `SDK -> UI`: Upload evaluator using SDK, then select and run the evaluator from Azure AI Studio UI. +- `API -> API/UI`: Create an evaluation definition, including testing criteria, using the Evaluations REST API, then monitor results via API and/or Azure AI Studio. + +### Primary Validation Goal +This bug bash is not limited to upload and registration. The main validation target is end-to-end correctness: + +- the evaluator uploads successfully +- the evaluator can be selected and invoked successfully +- the evaluation run completes successfully +- the returned evaluation results match the evaluator definition +- labels, scores, thresholds, reasoning, and other evaluator-defined properties are preserved correctly in the final results + +### Note +- Uploading evaluators via UI is not supported, evaluator upload must be done through the SDK. + +### References +The bug bash scenarios are based on the following SDK samples: + +- Simple custom evaluator upload sample: + [sample_custom_eval_upload_simple.py](https://github.com/Azure/azure-sdk-for-python/blob/feature/azure-ai-projects/2.0.2/sdk/ai/azure-ai-projects/samples/evaluations/sample_custom_eval_upload_simple.py) +- Advanced custom evaluator upload sample: + [sample_custom_eval_upload_advanced.py](https://github.com/Azure/azure-sdk-for-python/blob/feature/azure-ai-projects/2.0.2/sdk/ai/azure-ai-projects/samples/evaluations/sample_custom_eval_upload_advanced.py) + +## Instructions + +### 1. Setup Virtualenv + +#### Recommended path +```bash +python -m venv .bugbashenv +``` + +#### Linux based +```bash +source .bugbashenv/bin/activate +``` + +#### Windows +```bash +.bugbashenv\Scripts\activate +``` + +### 2. Checkout the Branch and Locate the Samples +```bash +git clone https://github.com/Azure/azure-sdk-for-python.git +cd azure-sdk-for-python +git checkout feature/azure-ai-projects/2.0.2 +cd sdk/ai/azure-ai-projects/samples/evaluations +``` + +### 3. Install Dependencies + +Authenticate with Azure: + +```bash +az login +``` + +Install the SDK package: + +```bash +pip install azure-ai-projects +``` + +### 4. Confirm Project Prerequisites +Make sure all of the following are true: + +- You have an Azure subscription with access to Azure AI Projects. +- Your Azure AI Project is in the `INT` environment hosted in the `Central US EUAP` region. +- The project is visible in `https://ai.azure.com`. +- You are using Python 3.9 or later. +- Your authentication method is ready, such as Azure CLI login or service principal. + +### 4.1 Shared Project Option +If you do not have your own Azure AI Project available for the bug bash, you can use the shared `np-int` project that the team has been using. + +- ask the team for access to the `np-int` project if you do not already have it +- use `np-int` as your project for upload, evaluation runs, and result validation if you do not have your own project +- `np-int` is in the `INT` environment hosted in `Central US EUAP` +- project URL: `https://ai.azure.com/nextgen/r/e0PPodqSSMyGXVSZRms7XA,naposani,,np-int,default/home` +- project endpoint: `https://np-int.services.ai.azure.com/api/projects/default` +- fetch the API key from the project URL before running the samples + +### 5. Azure AI Project Configuration +Configure the sample with your project connection details before running it. + +At minimum, verify: + +- project endpoint or project connection settings +- API key fetched from the project URL +- model names are explicitly filled in by the user for the samples they run +- evaluator name +- evaluator version +- evaluator definition or evaluator logic +- test inputs with known expected outputs + +If you are using the shared project, configure the samples against `np-int` after access has been granted, verify that you can open the project URL successfully, use `https://np-int.services.ai.azure.com/api/projects/default` as the endpoint, and explicitly fill in the model variables required by the sample you are running. + +### 5.2 Required Variables For The Provided Samples +For the provided samples, configure these values: + +- `FOUNDRY_PROJECT_ENDPOINT=https://np-int.services.ai.azure.com/api/projects/default` +- `FOUNDRY_MODEL_NAME=` +- `OPENAI_MODEL=` +- fetch the API key from the shared project URL and use it for the sample that requires `OPENAI_API_KEY` + +Do not rely on defaults for model configuration. Fill in `FOUNDRY_MODEL_NAME` and `OPENAI_MODEL` explicitly before running the samples. + +### 5.1 Using Your Own Custom Evaluator +Bug bash participants can use their own defined custom evaluators in addition to the provided samples. + +Use the following naming format: + +- evaluator class name: `CustomNameEvaluator` +- evaluator file name: `custom_name_evaluator.py` + +Validate that: + +- the class name follows the `CustomNameEvaluator` pattern +- the Python file follows the `custom_name_evaluator.py` pattern +- the evaluator definition is consistent with the expected output fields you want to validate during the run + +### 6. Prepare Validation Inputs +Before running the bug bash, prepare a small set of prompts or dataset rows where the expected evaluator output is known ahead of time. + +Examples: + +- an input that should clearly pass +- an input that should clearly fail +- an input that should produce a specific label +- an input that should exercise threshold boundaries +- an input that should produce evaluator-specific metadata or properties + +## Bug Bash Scenarios + +### Scenario 1: Upload Custom Evaluator via SDK + +#### Goal +Validate uploading a custom evaluator, ensuring it is registered successfully, and confirming that evaluation runs return the expected outputs defined by that evaluator. + +This scenario applies both to the provided sample evaluator and to a user-defined custom evaluator that follows the required class and file naming format. + +#### Steps +Open the sample: + +```bash +python sample_custom_eval_upload_simple.py +``` + +Configure: + +- project connection details +- evaluator name and version +- custom evaluator logic +- validation inputs with known expected results + +#### Expected Results +- Evaluator upload completes without error. +- Evaluator appears in SDK list APIs. +- Evaluator appears in Azure AI Studio UI under Evaluators. +- Evaluation runs using the uploaded evaluator complete successfully. +- Output fields returned by the run match the evaluator definition. +- Any evaluator-defined score, label, threshold, reasoning, or custom properties are returned with the expected values. + +#### What to Test +- invalid evaluator schema +- duplicate evaluator names +- versioning behavior +- deterministic inputs that should return known results +- mismatches between evaluator logic and actual returned outputs +- missing or incorrectly mapped result fields + +### Scenario 2: Upload Advanced Custom Evaluator via SDK + +#### Goal +Validate uploading a more advanced custom evaluator definition and confirming that evaluation runs using it return the expected outputs from the configured definition. + +#### Steps +Open the sample: + +```bash +python sample_custom_eval_upload_advanced.py +``` + +Configure: + +- advanced evaluator metadata or configuration +- evaluator logic and output definition +- validation inputs with known expected results + +#### Expected Results +- Advanced custom evaluator uploads successfully. +- Evaluator is selectable in Azure AI Studio UI. +- Evaluation runs complete successfully. +- Returned labels, scores, thresholds, and reasoning align with the evaluator configuration. +- Any additional evaluator-defined output fields are present and correct. + +#### What to Test +- missing required fields +- invalid evaluator configuration +- evaluator discoverability in UI +- incorrect scoring or labeling behavior +- incorrect threshold application +- result payloads that do not match the evaluator definition + +### Scenario 3: Run Evaluation via UI with SDK-Uploaded Evaluator + +#### Goal +Ensure evaluators uploaded via SDK can be used in Azure AI Studio UI and that the final evaluation results are correct. + +#### Steps +- Navigate to `https://ai.azure.com`. +- Open the project in the `INT` environment hosted in `Central US EUAP`. +- Go to Evaluations. +- Create a new evaluation run. +- Select the SDK-uploaded evaluator. +- Choose a dataset or inputs. +- Run the evaluation. +- Compare the resulting outputs against the expected values from the evaluator definition. + +#### Expected Results +- Evaluation run starts successfully. +- Results are generated and visible in the UI. +- Metrics match expectations from evaluator logic. +- Result fields are not missing, renamed incorrectly, or assigned incorrect values. + +### Scenario 4: Validate Result Correctness End to End + +#### Goal +Confirm that evaluator execution returns the proper results defined in the evaluator definition, not just that the run succeeds. + +#### Validation Checklist +- score values are correct for the given test inputs +- labels are correct for the given test inputs +- thresholds are applied correctly +- pass/fail outcomes are correct where applicable +- reasoning or explanation fields are returned when defined +- custom properties or evaluator-specific fields are returned when defined +- results are consistent between SDK and UI views of the same evaluation run + +#### Failure Cases to Capture +- run succeeds but returned values are wrong +- expected fields are missing from results +- field names differ from the evaluator definition +- values are present but assigned to the wrong result fields +- UI and SDK show different outputs for the same run + +## Known Limitations +- Evaluator upload is SDK-only. +- Feature is `INT` region only. +- Uploading evaluators directly through the UI is not supported. + +## Feedback to Capture +During bug bash, please report: + +- SDK usability issues +- error message clarity +- UI discoverability of evaluators +- documentation gaps +- region-related confusion or failures +- support issues for user-defined evaluators authored and uploaded by customers +- result correctness issues where the evaluation completed but the outputs did not match the evaluator definition +- field mapping issues for labels, scores, thresholds, explanations, or custom properties + +## Reporting Bugs +When filing bugs, include: + +- project region +- evaluator used +- SDK version +- full error messages or stack traces +- repro steps \ No newline at end of file