From 8296b10fa42ae65cee7f770f7973e089b0b498cf Mon Sep 17 00:00:00 2001
From: Omkar Gaikwad <omkargaikwad@google.com>
Date: Fri, 8 May 2026 10:46:34 +0000
Subject: [PATCH 1/2] docs: add documentation for automated EvalBench
 evaluation workflow and test configuration

---
 DEVELOPER.md | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/DEVELOPER.md b/DEVELOPER.md
index 69599c2..e8b4f40 100644
--- a/DEVELOPER.md
+++ b/DEVELOPER.md
@@ -48,6 +48,31 @@ All tools are currently tested in the [MCP Toolbox GitHub](https://github.com/go
 
 The skills themselves are validated using the `skills-validate.yml` workflow.
 
+### Automated Skill Evaluations (EvalBench)
+
+This repository uses the [EvalBench framework](https://github.com/GoogleCloudPlatform/evalbench) to automatically evaluate the quality, multi-turn conversational capabilities, and skill execution of the extension.
+
+Evaluations run automatically via Cloud Build (`cloudbuild.yaml`) on pull requests when the `ci:run-evals` or `autorelease: pending` label is applied. Because tests run against a live Cloud SQL instance, credentials are securely injected by Secret Manager during CI.
+
+#### Understanding Evaluation Files
+
+All evaluation configurations and datasets are located in the [`evals/`](evals/) directory:
+
+*   **Conversational Datasets (`*_dataset.json`):** Define test scenarios for different models (e.g., `gemini_dataset.json`, `claude_dataset.json`). Each scenario contains:
+    *   `starting_prompt`: The initial prompt sent to the agent.
+    *   `conversation_plan`: Instructions for the simulated user LLM to drive multi-turn interactions.
+    *   `expected_trajectory`: The sequence of tool/skill calls expected to successfully complete the task.
+*   **Run Configurations (`*_run_config.yaml`):** Configure the EvalBench orchestrator, target model configs, and qualitative/performance scorers (e.g., goal completion, behavioral metrics, latency, token consumption).
+
+#### Maintaining and Adding Scenarios
+
+When adding new skills or modifying existing behavior, you should add or update corresponding scenarios in the dataset files:
+
+1.  Open `evals/gemini_dataset.json` (and/or `evals/claude_dataset.json`).
+2.  Add a new scenario block with a unique `id`, a clear `starting_prompt`, a detailed `conversation_plan`, and the `expected_trajectory` of tool calls.
+3.  Apply the `ci:run-evals` label while creating your pull request to trigger the evaluation pipeline.
+4.  Evaluation metrics and outcomes are uploaded to BigQuery and can be monitored on the team's centralized evaluation dashboards.
+
 ### Other GitHub Checks
 
 *   **License Header Check:** A workflow ensures all necessary files contain the

From 0dedfcc7fd61b0260e620d82dd078f6a4fc864dc Mon Sep 17 00:00:00 2001
From: Omkar Gaikwad <omkargaikwad@google.com>
Date: Fri, 8 May 2026 12:42:11 +0000
Subject: [PATCH 2/2] docs: update evaluation pipeline instructions to
 reference Cloud Build and maintainer review process

---
 DEVELOPER.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/DEVELOPER.md b/DEVELOPER.md
index e8b4f40..7abbd8a 100644
--- a/DEVELOPER.md
+++ b/DEVELOPER.md
@@ -71,7 +71,7 @@ When adding new skills or modifying existing behavior, you should add or update
 1.  Open `evals/gemini_dataset.json` (and/or `evals/claude_dataset.json`).
 2.  Add a new scenario block with a unique `id`, a clear `starting_prompt`, a detailed `conversation_plan`, and the `expected_trajectory` of tool calls.
 3.  Apply the `ci:run-evals` label while creating your pull request to trigger the evaluation pipeline.
-4.  Evaluation metrics and outcomes are uploaded to BigQuery and can be monitored on the team's centralized evaluation dashboards.
+4.  The evaluation pipeline runs securely via Cloud Build. A maintainer will review the internal logs and results to verify your scenarios pass successfully.
 
 ### Other GitHub Checks