Skip to content

Author eval suite for azure-cost-estimator skill#168

Draft
Copilot wants to merge 2 commits into
mainfrom
copilot/author-eval-suite-azure-cost-estimator
Draft

Author eval suite for azure-cost-estimator skill#168
Copilot wants to merge 2 commits into
mainfrom
copilot/author-eval-suite-azure-cost-estimator

Conversation

Copilot AI commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Adds the expanded-tier evaluation suite for the azure-cost-estimator skill, mirroring the established azure-stack-deploy / azure-stack-destroy layout.

Suite (.github/evals/azure-cost-estimator/)

  • eval.yamlcopilot-sdk executor, claude-sonnet-4.6, 2 trials/task, single trigger_precision metric (threshold 0.6), and a behavior budget grader (30 tool calls / 240s). No eval-level skill_invocation (deterministic 0.0 noise on negatives — see commit 2f699c79).
  • Positive tasks — answer_quality (LLM-judge) graders are scoped per-task with continue_session: true:
    • positive-arm-template-estimate — B1ls + 30 GB Standard_LRS disk + static IP in southeastasia. Judge checks: names the Retail Prices API, uses ×730 hourly→monthly, flags free resources (VNet/NSG/NIC), produces a per-resource breakdown.
    • positive-sku-compare — B2s vs B1ls in eastus. Judge checks: Retail Prices API, correct OData $filter shape (serviceName eq 'Virtual Machines' + armRegionName + armSkuName + priceType eq 'Consumption'), ×730 conversion, reports per-SKU figure and delta.
  • Negative taskstrigger grader only (mode: negative), so refusals/out-of-scope acks score against SKILL.md scope (persona lock):
    • negative-off-topic — Linux CFS scheduler question.
    • negative-billing-history — actual invoiced usage via Cost Management (belongs to a billing/cost-analysis flow, not the retail-prices estimator).

Manifest

.github/evals/manifest.yaml — appended at tier: expanded (2-model fan-out: claude-sonnet-4.6, gpt-5.3-codex).

  - name: azure-stack-destroy
    tier: expanded
  - name: azure-cost-estimator
    tier: expanded

Mock CI runs automatically on this PR; a maintainer will dispatch the real-model run before merge.

Copilot AI linked an issue Jun 9, 2026 that may be closed by this pull request
8 tasks
@sendtoshailesh

sendtoshailesh commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

On whether delivering these skills via MCP would be better:

Short version: MCP doesn't really apply to this particular skill — but the question points at something worth checking, so I'm adding a test for it.

What "MCP" is here. Think of a skill as a recipe card the AI reads ("how to estimate Azure costs, step by step"). MCP isn't a different way to hand the AI the recipe — it's a toolbox the recipe tells the AI to reach into (e.g. a live Azure lookup service). So "skills via MCP" mixes two separate things:

  • how the AI gets the recipe → a markdown file (unchanged either way), and
  • what tool the recipe tells it to grab → MCP, or a plain command.

Our policy is already "use the MCP toolbox first, fall back to a plain command" (skills.md:155).

Why it doesn't apply to azure-cost-estimator. This skill doesn't touch the MCP toolbox at all — it calls a free, public Microsoft pricing API (prices.azure.com) with a simple curl. There's no MCP equivalent for retail prices, so there's nothing to A/B against here.

The real question underneath — and the useful one — is: "Is the skill actually looking up live prices, or making them up from memory?" Today the suite only grades the wording of the answer. So I'm adding a small, deterministic check (no second AI judging) that fails the run unless the agent actually called the live pricing API:

- type: tool_constraint
  name: grounded_via_retail_prices_api
  config:
    expect_tools:
      - tool: bash
        command_pattern: 'prices\.azure\.com'   # must actually call this

It's scoped to the positive tasks only, so it avoids the eval-root skill_invocation 0.0-on-negatives noise we removed in 2f699c79. (I didn't use a skill_invocation grader — that only confirms the skill activated, which the existing positive trigger grader already does.)

For skills that genuinely do use an Azure MCP tool (e.g. azure-security-analyzer), the same check just points at the MCP tool name instead — and that's where an MCP-vs-CLI comparison would actually be meaningful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Author eval suite for skill azure-cost-estimator

2 participants