You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: DEVELOPER.md
+25Lines changed: 25 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -48,6 +48,31 @@ All tools are currently tested in the [MCP Toolbox GitHub](https://github.com/go
48
48
49
49
The skills themselves are validated using the `skills-validate.yml` workflow.
50
50
51
+
### Automated Skill Evaluations (EvalBench)
52
+
53
+
This repository uses the [EvalBench framework](https://github.com/GoogleCloudPlatform/evalbench) to automatically evaluate the quality, multi-turn conversational capabilities, and skill execution of the extension.
54
+
55
+
Evaluations run automatically via Cloud Build (`cloudbuild.yaml`) on pull requests when the `ci:run-evals` or `autorelease: pending` label is applied. Because tests run against a live Cloud SQL instance, credentials are securely injected by Secret Manager during CI.
56
+
57
+
#### Understanding Evaluation Files
58
+
59
+
All evaluation configurations and datasets are located in the [`evals/`](evals/) directory:
60
+
61
+
***Conversational Datasets (`*_dataset.json`):** Define test scenarios for different models (e.g., `gemini_dataset.json`, `claude_dataset.json`). Each scenario contains:
62
+
*`starting_prompt`: The initial prompt sent to the agent.
63
+
*`conversation_plan`: Instructions for the simulated user LLM to drive multi-turn interactions.
64
+
*`expected_trajectory`: The sequence of tool/skill calls expected to successfully complete the task.
65
+
***Run Configurations (`*_run_config.yaml`):** Configure the EvalBench orchestrator, target model configs, and qualitative/performance scorers (e.g., goal completion, behavioral metrics, latency, token consumption).
66
+
67
+
#### Maintaining and Adding Scenarios
68
+
69
+
When adding new skills or modifying existing behavior, you should add or update corresponding scenarios in the dataset files:
70
+
71
+
1. Open `evals/gemini_dataset.json` (and/or `evals/claude_dataset.json`).
72
+
2. Add a new scenario block with a unique `id`, a clear `starting_prompt`, a detailed `conversation_plan`, and the `expected_trajectory` of tool calls.
73
+
3. Apply the `ci:run-evals` label while creating your pull request to trigger the evaluation pipeline.
74
+
4. Evaluation metrics and outcomes are uploaded to BigQuery and can be monitored on the team's centralized evaluation dashboards.
75
+
51
76
### Other GitHub Checks
52
77
53
78
* **License Header Check:** A workflow ensures all necessary files contain the
0 commit comments