Skip to content

Commit 8296b10

Browse files
docs: add documentation for automated EvalBench evaluation workflow and test configuration
1 parent 24b2db3 commit 8296b10

1 file changed

Lines changed: 25 additions & 0 deletions

File tree

DEVELOPER.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,31 @@ All tools are currently tested in the [MCP Toolbox GitHub](https://github.com/go
4848

4949
The skills themselves are validated using the `skills-validate.yml` workflow.
5050

51+
### Automated Skill Evaluations (EvalBench)
52+
53+
This repository uses the [EvalBench framework](https://github.com/GoogleCloudPlatform/evalbench) to automatically evaluate the quality, multi-turn conversational capabilities, and skill execution of the extension.
54+
55+
Evaluations run automatically via Cloud Build (`cloudbuild.yaml`) on pull requests when the `ci:run-evals` or `autorelease: pending` label is applied. Because tests run against a live Cloud SQL instance, credentials are securely injected by Secret Manager during CI.
56+
57+
#### Understanding Evaluation Files
58+
59+
All evaluation configurations and datasets are located in the [`evals/`](evals/) directory:
60+
61+
* **Conversational Datasets (`*_dataset.json`):** Define test scenarios for different models (e.g., `gemini_dataset.json`, `claude_dataset.json`). Each scenario contains:
62+
* `starting_prompt`: The initial prompt sent to the agent.
63+
* `conversation_plan`: Instructions for the simulated user LLM to drive multi-turn interactions.
64+
* `expected_trajectory`: The sequence of tool/skill calls expected to successfully complete the task.
65+
* **Run Configurations (`*_run_config.yaml`):** Configure the EvalBench orchestrator, target model configs, and qualitative/performance scorers (e.g., goal completion, behavioral metrics, latency, token consumption).
66+
67+
#### Maintaining and Adding Scenarios
68+
69+
When adding new skills or modifying existing behavior, you should add or update corresponding scenarios in the dataset files:
70+
71+
1. Open `evals/gemini_dataset.json` (and/or `evals/claude_dataset.json`).
72+
2. Add a new scenario block with a unique `id`, a clear `starting_prompt`, a detailed `conversation_plan`, and the `expected_trajectory` of tool calls.
73+
3. Apply the `ci:run-evals` label while creating your pull request to trigger the evaluation pipeline.
74+
4. Evaluation metrics and outcomes are uploaded to BigQuery and can be monitored on the team's centralized evaluation dashboards.
75+
5176
### Other GitHub Checks
5277
5378
* **License Header Check:** A workflow ensures all necessary files contain the

0 commit comments

Comments
 (0)