Author eval suite for azure-cost-estimator skill by Copilot · Pull Request #168 · Azure/git-ape

Copilot · 2026-06-09T07:33:58Z

Adds the expanded-tier evaluation suite for the azure-cost-estimator skill, mirroring the established azure-stack-deploy / azure-stack-destroy layout.

Suite (`.github/evals/azure-cost-estimator/`)

eval.yaml — copilot-sdk executor, claude-sonnet-4.6, 2 trials/task, single trigger_precision metric (threshold 0.6), and a behavior budget grader (30 tool calls / 240s). No eval-level skill_invocation (deterministic 0.0 noise on negatives — see commit 2f699c79).
Positive tasks — answer_quality (LLM-judge) graders are scoped per-task with continue_session: true:
- positive-arm-template-estimate — B1ls + 30 GB Standard_LRS disk + static IP in southeastasia. Judge checks: names the Retail Prices API, uses ×730 hourly→monthly, flags free resources (VNet/NSG/NIC), produces a per-resource breakdown.
- positive-sku-compare — B2s vs B1ls in eastus. Judge checks: Retail Prices API, correct OData $filter shape (serviceName eq 'Virtual Machines' + armRegionName + armSkuName + priceType eq 'Consumption'), ×730 conversion, reports per-SKU figure and delta.
Negative tasks — trigger grader only (mode: negative), so refusals/out-of-scope acks score against SKILL.md scope (persona lock):
- negative-off-topic — Linux CFS scheduler question.
- negative-billing-history — actual invoiced usage via Cost Management (belongs to a billing/cost-analysis flow, not the retail-prices estimator).

Manifest

.github/evals/manifest.yaml — appended at tier: expanded (2-model fan-out: claude-sonnet-4.6, gpt-5.3-codex).

  - name: azure-stack-destroy
    tier: expanded
  - name: azure-cost-estimator
    tier: expanded

Mock CI runs automatically on this PR; a maintainer will dispatch the real-model run before merge.

sendtoshailesh · 2026-06-15T07:56:23Z

On whether delivering these skills via MCP would be better:

Short version: MCP doesn't really apply to this particular skill — but the question points at something worth checking, so I'm adding a test for it.

What "MCP" is here. Think of a skill as a recipe card the AI reads ("how to estimate Azure costs, step by step"). MCP isn't a different way to hand the AI the recipe — it's a toolbox the recipe tells the AI to reach into (e.g. a live Azure lookup service). So "skills via MCP" mixes two separate things:

how the AI gets the recipe → a markdown file (unchanged either way), and
what tool the recipe tells it to grab → MCP, or a plain command.

Our policy is already "use the MCP toolbox first, fall back to a plain command" (skills.md:155).

Why it doesn't apply to azure-cost-estimator. This skill doesn't touch the MCP toolbox at all — it calls a free, public Microsoft pricing API (prices.azure.com) with a simple curl. There's no MCP equivalent for retail prices, so there's nothing to A/B against here.

The real question underneath — and the useful one — is: "Is the skill actually looking up live prices, or making them up from memory?" Today the suite only grades the wording of the answer. So I'm adding a small, deterministic check (no second AI judging) that fails the run unless the agent actually called the live pricing API:

- type: tool_constraint
  name: grounded_via_retail_prices_api
  config:
    expect_tools:
      - tool: bash
        command_pattern: 'prices\.azure\.com'   # must actually call this

It's scoped to the positive tasks only, so it avoids the eval-root skill_invocation 0.0-on-negatives noise we removed in 2f699c79. (I didn't use a skill_invocation grader — that only confirms the skill activated, which the existing positive trigger grader already does.)

For skills that genuinely do use an Azure MCP tool (e.g. azure-security-analyzer), the same check just points at the MCP tool name instead — and that's where an MCP-vs-CLI comparison would actually be meaningful.

Initial plan

bfe7985

Copilot AI assigned Copilot and sendtoshailesh Jun 9, 2026

Copilot AI linked an issue Jun 9, 2026 that may be closed by this pull request

Author eval suite for skill azure-cost-estimator #99

Open

8 tasks

Copilot started work on behalf of sendtoshailesh June 9, 2026 07:34 View session

Author eval suite for azure-cost-estimator skill

908f270

Copilot AI changed the title ~~[WIP] Add eval suite for azure-cost-estimator skill~~ Author eval suite for azure-cost-estimator skill Jun 9, 2026

Copilot finished work on behalf of sendtoshailesh June 9, 2026 07:38

Copilot AI requested a review from sendtoshailesh June 9, 2026 07:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Author eval suite for azure-cost-estimator skill#168

Author eval suite for azure-cost-estimator skill#168
Copilot wants to merge 2 commits into
mainfrom
copilot/author-eval-suite-azure-cost-estimator

Copilot AI commented Jun 9, 2026 •

edited

Loading

Uh oh!

sendtoshailesh commented Jun 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Suite (.github/evals/azure-cost-estimator/)

Manifest

Uh oh!

sendtoshailesh commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jun 9, 2026 •

edited

Loading

Suite (`.github/evals/azure-cost-estimator/`)

sendtoshailesh commented Jun 15, 2026 •

edited

Loading