Author eval suite for azure-cost-estimator skill#168
Conversation
|
On whether delivering these skills via MCP would be better: Short version: MCP doesn't really apply to this particular skill — but the question points at something worth checking, so I'm adding a test for it. What "MCP" is here. Think of a skill as a recipe card the AI reads ("how to estimate Azure costs, step by step"). MCP isn't a different way to hand the AI the recipe — it's a toolbox the recipe tells the AI to reach into (e.g. a live Azure lookup service). So "skills via MCP" mixes two separate things:
Our policy is already "use the MCP toolbox first, fall back to a plain command" ( Why it doesn't apply to The real question underneath — and the useful one — is: "Is the skill actually looking up live prices, or making them up from memory?" Today the suite only grades the wording of the answer. So I'm adding a small, deterministic check (no second AI judging) that fails the run unless the agent actually called the live pricing API: - type: tool_constraint
name: grounded_via_retail_prices_api
config:
expect_tools:
- tool: bash
command_pattern: 'prices\.azure\.com' # must actually call thisIt's scoped to the positive tasks only, so it avoids the eval-root For skills that genuinely do use an Azure MCP tool (e.g. |
Adds the expanded-tier evaluation suite for the
azure-cost-estimatorskill, mirroring the establishedazure-stack-deploy/azure-stack-destroylayout.Suite (
.github/evals/azure-cost-estimator/)eval.yaml—copilot-sdkexecutor,claude-sonnet-4.6, 2 trials/task, singletrigger_precisionmetric (threshold 0.6), and abehaviorbudget grader (30 tool calls / 240s). No eval-levelskill_invocation(deterministic 0.0 noise on negatives — see commit2f699c79).continue_session: true:positive-arm-template-estimate— B1ls + 30 GB Standard_LRS disk + static IP insoutheastasia. Judge checks: names the Retail Prices API, uses ×730 hourly→monthly, flags free resources (VNet/NSG/NIC), produces a per-resource breakdown.positive-sku-compare— B2s vs B1ls ineastus. Judge checks: Retail Prices API, correct OData$filtershape (serviceName eq 'Virtual Machines'+armRegionName+armSkuName+priceType eq 'Consumption'), ×730 conversion, reports per-SKU figure and delta.triggergrader only (mode: negative), so refusals/out-of-scope acks score against SKILL.md scope (persona lock):negative-off-topic— Linux CFS scheduler question.negative-billing-history— actual invoiced usage via Cost Management (belongs to a billing/cost-analysis flow, not the retail-prices estimator).Manifest
.github/evals/manifest.yaml— appended attier: expanded(2-model fan-out:claude-sonnet-4.6,gpt-5.3-codex).Mock CI runs automatically on this PR; a maintainer will dispatch the real-model run before merge.