feat(skills): add landing zone discovery skill with cross-shell scripts#177
feat(skills): add landing zone discovery skill with cross-shell scripts#177arnaudlh wants to merge 4 commits into
Conversation
- add azure-landing-zone-discovery skill, evals, fixtures, and docs - ship discover-lz/inject-lz in both bash and PowerShell parity ports - document dual-shell helper-script convention in authoring docs - wire landing-zone context into agents and copilot-instructions 🧭 - Generated by Copilot
|
🤖 Waza agent evals (advisory)
Ran 0 agent evals against
📊 Agent file token comparison vs
|
🧪 Waza skill evals (advisory)
Ran 16 matrix legs in parallel (skills × models). Results are non-blocking — investigate failures via the workflow logs and the per-leg
📊 Token comparison vs
|
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Editing an ARM template | 0.57 | ✅ | budget, trigger_relevance_negative |
| Negative — Azure service concept question | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — "command not found" failure | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — "What do I need to install?" | 0.89 | ❌ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Positive — "What do I need to install?": 67% pass rate, score=0.89±0.16
Failed Task Details
Positive — "What do I need to install?"
Run 1/3 (error):
- ❌ answer_quality (0.00): fail: Assistant never produced a response to grade: There is no prior assistant response in this session addressing the user's question about Git-Ape onboarding prerequisites. With no response to evaluate, none of the four PASS criteria can be met: (1) no listing of az/gh/jq/git, (2) no authentication notes (az login/gh auth login), (3) no version guidance, and (4) no install commands or verification script reference.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Benchmark: prereq-check-eval | Skill: prereq-check | Model: claude-opus-4.6
Results saved to: .waza-results/prereq-check-claude-opus-4.6.json
Model: claude-sonnet-4.6
Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers
✓ [1/4] Negative — Editing an ARM template
✓ [2/4] Negative — Azure service concept question
[ERROR] session error: Execution failed: CAPIError: 422 422 422 Unprocessable Entity
(Request ID: A82A:3E39BF:356F293:3D0E1DD:6A2F6E8B)
[ERROR] session error: Execution failed: CAPIError: 422 422 422 Unprocessable Entity
(Request ID: A829:FB2C:356B6F3:3D057AB:6A2F6E87)
✗ [4/4] Positive — "What do I need to install?"
[ERROR] session error: Execution failed: CAPIError: 422 422 422 Unprocessable Entity
(Request ID: A829:FB2C:3573D45:3D0ECFA:6A2F6EA7)
✗ [3/4] Positive — "command not found" failure
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.71 | Duration: 1m37.055s
- Tests: 4 total, 2 passed, 2 failed, 0 errors
- Success Rate: 50.0%
- Score Range: 0.57 - 0.89 (σ=0.1303)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Editing an ARM template | 0.57 | ✅ | budget, trigger_relevance_negative |
| Negative — Azure service concept question | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — "command not found" failure | 0.78 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — "What do I need to install?" | 0.89 | ❌ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Positive — "command not found" failure: 33% pass rate, score=0.78±0.16
- Positive — "What do I need to install?": 67% pass rate, score=0.89±0.16
Failed Task Details
Positive — "command not found" failure
Run 2/3 (error):
- ❌ answer_quality (0.00): fail: : The assistant's previous response did not contain any answer to the user. The tool calls returned errors ("unexpected user permission response") and no chat content listing the required tools (az, gh, jq, git), install commands for az, version verification guidance, or a verdict was produced. All four PASS criteria are missing.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Run 3/3 (error):
- ❌ answer_quality (0.00): fail: Assistant never produced a user-facing answer: The assistant invoked the prereq-check skill and attempted tool calls, but all bash/view calls returned "unexpected user permission response" errors. The assistant never produced any answer text listing the required tools (az, gh, jq, git), never provided install commands for az, never recommended version verification, and never reached a verdict. All four PASS criteria are missing.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Positive — "What do I need to install?"
Run 2/3 (error):
- ❌ answer_quality (0.00): fail: Assistant never produced a user-facing answer: The assistant invoked the prereq-check skill and attempted tool calls, but all tool calls returned "unexpected user permission response" errors, and the assistant never delivered any response listing required tools (az, gh, jq, git), auth requirements, versions, or install commands. None of the four PASS criteria are met because no substantive content was produced.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Benchmark: prereq-check-eval | Skill: prereq-check | Model: claude-sonnet-4.6
Results saved to: .waza-results/prereq-check-claude-sonnet-4.6.json
Model: gpt-5.3-codex
Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers
✓ [1/4] Negative — Editing an ARM template
✓ [2/4] Negative — Azure service concept question
✗ [4/4] Positive — "What do I need to install?"
✗ [3/4] Positive — "command not found" failure
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.74 | Duration: 1m48.583s
- Tests: 4 total, 2 passed, 2 failed, 0 errors
- Success Rate: 50.0%
- Score Range: 0.57 - 0.89 (σ=0.1519)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Editing an ARM template | 0.57 | ✅ | budget, trigger_relevance_negative |
| Negative — Azure service concept question | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — "command not found" failure | 0.89 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — "What do I need to install?" | 0.89 | ❌ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Positive — "command not found" failure: 67% pass rate, score=0.89±0.16
- Positive — "What do I need to install?": 67% pass rate, score=0.89±0.16
Failed Task Details
Positive — "command not found" failure
Run 1/3 (failed):
- ❌ answer_quality (0.00): fail: : Criterion 3 missing: the response does not recommend verifying versions after install (no
az --version/gh --versionverification step). Criteria 1 (all four tools named), 2 (curl install command for az on Linux), and 4 (verdict "⚠️ TOOLS MISSING" plus next steps) are met. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Positive — "What do I need to install?"
Run 3/3 (failed):
- ❌ answer_quality (0.00): fail: Missing criterion 4: The response lists az/gh/jq/git (1✓), mentions az login and gh auth login (2✓), and provides minimum versions (3✓), but does NOT provide install commands or point the user to a verification script/skill invocation (4✗). The tool-check script failed to run and the assistant did not fall back to printing install commands from references/install-commands.md or directing the user to run the prereq-check skill.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Benchmark: prereq-check-eval | Skill: prereq-check | Model: gpt-5.3-codex
Results saved to: .waza-results/prereq-check-gpt-5.3-codex.json
Model: gpt-5.4 *(baseline — A/B mode)*
Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: gpt-5.4
Judge Model: claude-opus-4.7
Parallel: 4 workers
════════════════════════════════════════════════════════════════
PASS 1: Skills-Enabled Run
════════════════════════════════════════════════════════════════
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: 801A:3E39BF:3531F3B:3CCA6D3:6A2F6DAD)
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: 801A:3E39BF:3531F80:3CCA749:6A2F6DAF)
[ERROR] session error: Execution failed: CAPIError: 422 422 Unprocessable Entity
(Request ID: 8019:185E1:36D9430:3E5D883:6A2F6DC3)
✓ [2/4] Negative — Azure service concept question
[ERROR] session error: Execution failed: CAPIError: 422 422 Unprocessable Entity
(Request ID: 801A:3E39BF:353EB07:3CD88D7:6A2F6DDD)
✗ [3/4] Positive — "command not found" failure
✗ [4/4] Positive — "What do I need to install?"
[ERROR] waiting for session.idle: context deadline exceeded
✗ [1/4] Negative — Editing an ARM template
════════════════════════════════════════════════════════════════
PASS 2: Skills Baseline (skills stripped)
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Editing an ARM template
✓ [2/4] Negative — Azure service concept question
✗ [3/4] Positive — "command not found" failure
✗ [4/4] Positive — "What do I need to install?"
════════════════════════════════════════════════════════════════
SKILL IMPACT ANALYSIS
════════════════════════════════════════════════════════════════
Overall Performance Delta:
With Skills: 25.0% (1/4 tasks passed)
Without Skills: 50.0% (2/4 tasks passed)
Impact: -25.0 percentage points
Per-Task Breakdown:
• Negative — Editing an ARM template [REGRESSED] 100% → 67% (-33pp)
• Negative — Azure service concept question [NEUTRAL] 100% → 100% (+0pp)
• Positive — "command not found" failure [REGRESSED] 33% → 0% (-33pp)
• Positive — "What do I need to install?" [REGRESSED] 67% → 33% (-33pp)
Verdict: Skills have NEGATIVE IMPACT (regressed 3/4 tasks)
════════════════════════════════════════════════════════════════
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.65 | Duration: 1m36.025s
- Tests: 4 total, 1 passed, 3 failed, 0 errors
- Success Rate: 25.0%
- Score Range: 0.57 - 0.78 (σ=0.0794)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Editing an ARM template | 0.57 | ❌ | budget, trigger_relevance_negative |
| Negative — Azure service concept question | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — "command not found" failure | 0.67 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — "What do I need to install?" | 0.78 | ❌ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Negative — Editing an ARM template: 67% pass rate, score=0.57±0.00
- Positive — "What do I need to install?": 33% pass rate, score=0.78±0.16
Failed Task Details
Negative — Editing an ARM template
Run 3/3 (error):
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_negative (0.14): Prompt correctly treated as non-trigger (score 0.14 < 0.50)
Positive — "command not found" failure
Run 1/3 (error):
- ❌ answer_quality (0.00): All prompts passed
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Run 2/3 (error):
- ❌ answer_quality (0.00): fail: Missing all four PASS criteria — response contained no substantive answer.: The assistant's response did not include any of the required content. It only invoked the prereq-check skill and attempted tool calls that failed ("unexpected user permission response"). It never listed the required tools (az, gh, jq, git), never provided install commands for az, never recommended version verification, and never reached a verdict or next step.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Run 3/3 (error):
- ❌ answer_quality (0.00): fail: : The assistant's response did not satisfy any of the four PASS criteria. It did not name the required tools (az, gh, jq, git), did not provide any install command for az on any platform, did not recommend version verification commands, and did not produce a readiness verdict or next step. Instead, both tool-calling attempts failed with "unexpected user permission response" errors, and the assistant ended without delivering substantive guidance to the user.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Positive — "What do I need to install?"
Run 1/3 (error):
- ❌ answer_quality (0.00): fail: No prior assistant response exists to grade: There is no previous assistant response in this session to evaluate. The session contains only the user's question followed immediately by the grading instructions. None of the four PASS criteria can be met because no response was produced: (1) no list of az/gh/jq/git, (2) no mention of az login/gh auth login, (3) no version guidance, (4) no install commands or prereq-check skill invocation.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Run 2/3 (failed):
- ❌ answer_quality (0.00): fail: : Missing criterion 4: The response did not provide install commands for the required tools, nor did it point the user to a verification script (e.g., the prereq-check skill's check-tools.sh) to run the checks. Criteria 1 (lists az/gh/jq/git), 2 (mentions az login and gh auth login), and 3 (lists minimum versions) are met.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Benchmark: prereq-check-eval | Skill: prereq-check | Model: gpt-5.4
Results saved to: .waza-results/prereq-check-gpt-5.4.json
🔢 Tokens (count + profile)
📊 prereq-check: 2,140 tokens (detailed ✓), 10 sections, 2 code blocks
🎯 Quality (5-dim table)
DIMENSION SCORE FEEDBACK
────────────────────────────────────────────
clarity █████ Purpose is immediately obvious from the title and description. Steps are numbered, ordered logically (platform detect → scan → check → report → auth → verdict), and every step references exactly where its logic lives. The status mapping table and verdict taxonomy eliminate ambiguity completely.
completeness █████ Covers tool checks, version minimums, auth sessions, platform variants (bash/PowerShell), reported-vs-found discrepancy, error table with 8 distinct failure modes, and a 'Next' handoff. Edge cases like headless az login, execution policy, and jq 1.5 are all explicitly addressed.
trigger_precision ████░ USE FOR triggers are extensive and include natural-language variants ('az: command not found', 'not logged in', 'outdated az version'), which gives good routing coverage. DO NOT USE FOR is a single catch-all sentence — listing 2-3 concrete mis-use examples (e.g., 'do not use for deployment validation, ARM template checks') would sharpen the boundary further.
scope_coverage █████ Scope is tightly defined: read-only, four specific tools, two auth sessions, three platforms. Related skills (git-ape-onboarding, azure-validate) are called out explicitly, and the Quick Reference table surfaces all constraints at a glance. Neither too broad nor too narrow.
anti_patterns █████ No vague instructions — every rule has a concrete rationale or example. Conflicting directives are avoided: the 'stop at first blocking failure' rule prevents partial-state confusion. Error handling is explicit and actionable. The 'Never' list prevents the most common agent mistakes (auto-installing, auto-chaining, ignoring user reports).
────────────────────────────────────────────
Overall: 4.8/5.0
Exceptionally well-structured skill with clear purpose, exhaustive edge-case coverage, and strong guard rails against common agent misbehaviors. The only minor gap is the DO NOT USE FOR clause being a single generic statement rather than concrete counter-examples, which marginally reduces routing precision for ambiguous requests.
✅ Check (compliance summary) (56 lines — click to expand)
ℹ️
waza checkexpectseval.yamlcolocated withSKILL.md. This repo separates them into.github/evals/prereq-check/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: prereq-check
📋 Compliance Score: Medium-High
⚠️ Good, but could be improved. Missing routing clarity.
Issues found:
❌ SKILL.md is 2140 tokens (hard limit 500)
📐 Spec Compliance: 9/9 checks passed
✅ Meets agentskills.io specification.
📎 Links: 4/4 valid
✅ All links valid.
📊 Token Budget: 2140 / 500 tokens
❌ Exceeds limit by 1640 tokens. Consider reducing content.
🧪 Evaluation Suite: Found
✅ eval.yaml detected. Run 'waza run eval.yaml' to test.
📐 Schema Validation: Passed
✅ eval.yaml schema valid
✅ 4 task file(s) validated
💡 Advisory Checks
✅ [module-count] Found 1 reference module(s)
❌ [complexity] Complexity: comprehensive (2140 tokens, 1 modules)
✅ [negative-delta-risk] No negative delta risk patterns detected
✅ [procedural-content] Description contains procedural language
✅ [over-specificity] No over-specificity patterns detected
❌ [cross-model-density] Advisory 16: word count is 122 (>60 may reduce cross-model effectiveness)
❌ [body-structure] Advisory 17: body structure quality — no examples section found
✅ [progressive-disclosure] Content structure supports progressive disclosure
✅ [scope-reduction] Capability scope: 8 signal(s) detected (8 level-2 heading(s), 2 numbered procedure(s))
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ Your skill needs some work before submission.
🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
To improve your skill:
1. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
2. Run 'waza dev' for interactive compliance improvement
3. Reduce SKILL.md by 1640 tokens. Run 'waza tokens suggest' for optimization tips
Skill: git-ape-onboarding
📈 Score (per model) + Suggestions/Recommendations
Model: claude-opus-4.6
Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: claude-opus-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers
✓ [1/4] Negative — Storage service comparison (off-topic)
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✗ [3/4] Positive — Multi-environment onboarding
✗ [2/4] Positive — First-time repo setup
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.68 | Duration: 54.66s
- Tests: 4 total, 2 passed, 2 failed, 0 errors
- Success Rate: 50.0%
- Score Range: 0.56 - 0.98 (σ=0.1748)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Storage service comparison (off-topic) | 0.56 | ✅ | budget, trigger_relevance_negative |
| Positive — First-time repo setup | 0.60 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Multi-environment onboarding | 0.58 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Scaffold honors skip-with-notice on collision | 0.98 | ✅ | answer_quality, budget, trigger_relevance_positive |
Failed Task Details
Positive — First-time repo setup
Run 1/1 (failed):
- ❌ answer_quality (0.00): fail: Missing prereq results and auth gate: Criterion 1 FAIL: No prereq check results were presented. The agent attempted to run the check-tools.sh script and manual
command -v/--versioncalls, but every bash invocation returned "unexpected user permission response". The agent did not produce a status table or list of tool versions (az, gh, jq, git) and did not show Azure or GitHub auth status. Criterion 2 FAIL: The auth/prereq gate was not explicitly surfaced. The agent surfaced a different blocker (shell execution being denied in this environment), not "Azure CLI is not authenticated" / "az login required" / a ❌ marker on the Azure auth row. Without any inspection of the environment, the auth gate state is unknown rather than explicitly surfaced. Criterion 3 PASS: The agent requested three inputs — repo URL, Azure subscription ID, and RBAC role. Criterion 4 PASS: The agent did not claim to have configured OIDC, federated credentials, GitHub environments, RBAC, or scaffolded workflow files; it correctly held back and asked the user to confirm details before proceeding. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.81): Prompt is trigger-aligned (score 0.81 >= 0.50)
Positive — Multi-environment onboarding
Run 1/1 (failed):
- ❌ answer_quality (0.00): fail: Missing prereq inspection and auth gate: Criteria 3 and 4 are met (asked for repo, subscription ID, RBAC role, app reg reuse decision; mentioned
azure-deploy-stagingenvironment, new federated credential, per-env secrets). However, Criteria 1 and 2 are NOT met: the response did not run or present any prereq/tool inspection (no az/gh status table) and did not surface an auth gate. It jumped straight to asking for inputs without checking local environment state. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.73): Prompt is trigger-aligned (score 0.73 >= 0.50)
Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: claude-opus-4.6
Results saved to: .waza-results/git-ape-onboarding-claude-opus-4.6.json
Model: claude-sonnet-4.6
Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: 9401:26DA14:34B1F28:3C49439:6A2F6DA8)
✗ [1/4] Negative — Storage service comparison (off-topic)
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: 9401:26DA14:34B1FA4:3C494BE:6A2F6DA9)
✗ [3/4] Positive — Multi-environment onboarding
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: 9401:26DA14:34B2025:3C4954C:6A2F6DA9)
✗ [4/4] Positive — Scaffold honors skip-with-notice on collision
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: 9402:1685DA:342D373:3BC515D:6A2F6DA8)
✗ [2/4] Positive — First-time repo setup
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.60 | Duration: 22.925s
- Tests: 4 total, 0 passed, 4 failed, 0 errors
- Success Rate: 0.0%
- Score Range: 0.56 - 0.65 (σ=0.0340)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Storage service comparison (off-topic) | 0.56 | ❌ | budget, trigger_relevance_negative |
| Positive — First-time repo setup | 0.60 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Multi-environment onboarding | 0.58 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Scaffold honors skip-with-notice on collision | 0.65 | ❌ | answer_quality, budget, trigger_relevance_positive |
Failed Task Details
Negative — Storage service comparison (off-topic)
Run 1/1 (error):
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_negative (0.11): Prompt correctly treated as non-trigger (score 0.11 < 0.50)
Positive — First-time repo setup
Run 1/1 (error):
- ❌ answer_quality (0.00): fail: No prior assistant response present to evaluate: There is no previous assistant turn in this session to grade. The transcript contains only the user's onboarding request immediately followed by the grading instructions, with no assistant reply in between. Consequently none of the four PASS criteria can be verified as met:
- ❌ No prereq check results are shown (no tool versions, no
az account show, nogh auth statusoutput). - ❌ No auth/prereq gate is surfaced (blocking or otherwise).
- ❌ No user inputs (repo, subscription ID, RBAC role, region, project name, mode) are requested.
- ❌ Cannot confirm the agent refrained from claiming work was done, because no reply exists at all — but by default an absent gated handoff fails this criterion too.
Since no gated step-1 response was produced, the response fails the rubric.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.81): Prompt is trigger-aligned (score 0.81 >= 0.50)
Positive — Multi-environment onboarding
Run 1/1 (error):
- ❌ answer_quality (0.00): fail: No prior assistant response exists to grade: There is no previous assistant response in this session to evaluate. The conversation contains only the user's request followed by the grading instructions. Therefore none of the four PASS criteria can be satisfied: (1) no prereq check results presented, (2) no auth/prereq gate surfaced, (3) no input-gathering questions asked, (4) no multi-environment awareness demonstrated.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.73): Prompt is trigger-aligned (score 0.73 >= 0.50)
Positive — Scaffold honors skip-with-notice on collision
Run 1/1 (error):
- ❌ answer_quality (0.00): fail: No prior assistant response exists to grade: There is no previous assistant response in this session to evaluate. The conversation contains only the user's onboarding question followed immediately by the grading instructions, with no assistant turn in between. None of the four PASS criteria (skip-on-collision policy, collision notice, backup/diff recommendation, opt-in overwrite guidance) can be satisfied because nothing was said.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.94): Prompt is trigger-aligned (score 0.94 >= 0.50)
Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: claude-sonnet-4.6
Results saved to: .waza-results/git-ape-onboarding-claude-sonnet-4.6.json
Model: gpt-5.3-codex
Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers
✓ [1/4] Negative — Storage service comparison (off-topic)
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✗ [3/4] Positive — Multi-environment onboarding
✗ [2/4] Positive — First-time repo setup
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.68 | Duration: 2m1.074s
- Tests: 4 total, 2 passed, 2 failed, 0 errors
- Success Rate: 50.0%
- Score Range: 0.56 - 0.98 (σ=0.1748)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Storage service comparison (off-topic) | 0.56 | ✅ | budget, trigger_relevance_negative |
| Positive — First-time repo setup | 0.60 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Multi-environment onboarding | 0.58 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Scaffold honors skip-with-notice on collision | 0.98 | ✅ | answer_quality, budget, trigger_relevance_positive |
Failed Task Details
Positive — First-time repo setup
Run 1/1 (failed):
- ❌ answer_quality (0.00): fail: Missing criterion 1 (and arguably 2): no prereq results presented: Criterion 1 FAIL: The agent attempted prereq checks but every tool invocation returned "unexpected user permission response", so no status table, no tool versions, and no inspection of the environment was actually presented to the user. Criterion 2 FAIL: Because no prereq run succeeded, the Azure/GitHub auth gate was not explicitly surfaced (no ❌ on an Azure auth row, no "az login required" message tied to an actual check) — only a generic "tool execution blocked" message. Criterion 3 PASS: The agent requested target repo, subscription IDs, RBAC model, compliance preference, and explicit confirmation (≥3). Criterion 4 PASS: The agent explicitly stated it is blocked and made no changes; no false claims of configuring OIDC, federated credentials, environments, RBAC, or scaffolding.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.81): Prompt is trigger-aligned (score 0.81 >= 0.50)
Positive — Multi-environment onboarding
Run 1/1 (failed):
- ❌ answer_quality (0.00): fail: Response jumped to a playbook instead of gating on prereqs and gathering inputs.: Missing criteria:
- FAIL — No prereq check results presented. The agent only told the user to run
/prereq-checkbut did not actually inspect or report tool/auth status (no az/gh/jq versions, no auth status table). - FAIL — No explicit auth/prereq gate surfaced. Since no inspection happened, no blocking state was reported either.
- FAIL — The agent did not ask for the required inputs. It used placeholders like
<org>/<repo>,<STAGING_SUBSCRIPTION_ID>,<APP_OBJECT_ID>,<SP_OBJECT_ID>inside commands but never posed numbered questions or an input block asking for the GitHub repo, staging subscription ID, RBAC role, App Registration reuse decision, environment name confirmation, or onboarding mode. - PASS — Multi-environment awareness is demonstrated: explicitly names the new
azure-deploy-stagingGitHub environment (b), creates a dedicated federated credential subjectrepo:<org>/<repo>:environment:azure-deploy-staging(a), and scopes secrets/variable (AZURE_SUBSCRIPTION_ID) per environment (d).
Three of four criteria failed, so overall grade is FAIL.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.73): Prompt is trigger-aligned (score 0.73 >= 0.50)
Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: gpt-5.3-codex
Results saved to: .waza-results/git-ape-onboarding-gpt-5.3-codex.json
Model: gpt-5.4 *(baseline — A/B mode)*
Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: gpt-5.4
Judge Model: claude-opus-4.7
Parallel: 4 workers
════════════════════════════════════════════════════════════════
PASS 1: Skills-Enabled Run
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Storage service comparison (off-topic)
[ERROR] running graders: failed to run grader answer_quality: failed to send prompt: session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: C808:156D13:AE7EA7:C56DBD:6A2F6DAE)
✗ [4/4] Positive — Scaffold honors skip-with-notice on collision
[ERROR] running graders: failed to run grader answer_quality: failed to send prompt: session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: C809:31F679:B7407E:CDDE8B:6A2F6DA9)
✗ [2/4] Positive — First-time repo setup
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: C808:156D13:AE7E5B:C56D74:6A2F6DAD)
✗ [3/4] Positive — Multi-environment onboarding
════════════════════════════════════════════════════════════════
PASS 2: Skills Baseline (skills stripped)
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Storage service comparison (off-topic)
✗ [4/4] Positive — Scaffold honors skip-with-notice on collision
✗ [3/4] Positive — Multi-environment onboarding
✗ [2/4] Positive — First-time repo setup
════════════════════════════════════════════════════════════════
SKILL IMPACT ANALYSIS
════════════════════════════════════════════════════════════════
Overall Performance Delta:
With Skills: 25.0% (1/4 tasks passed)
Without Skills: 25.0% (1/4 tasks passed)
Impact: no change
Per-Task Breakdown:
• Negative — Storage service comparison (off-topic) [NEUTRAL] 100% → 100% (+0pp)
• Positive — First-time repo setup [NEUTRAL] 0% → 0% (+0pp)
• Positive — Multi-environment onboarding [NEUTRAL] 0% → 0% (+0pp)
• Positive — Scaffold honors skip-with-notice on collision [NEUTRAL] 0% → 0% (+0pp)
Verdict: Skills have NEUTRAL IMPACT (no net change)
════════════════════════════════════════════════════════════════
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.28 | Duration: 27.54s
- Tests: 4 total, 1 passed, 3 failed, 0 errors
- Success Rate: 25.0%
- Score Range: 0.00 - 0.58 (σ=0.2834)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Storage service comparison (off-topic) | 0.56 | ✅ | budget, trigger_relevance_negative |
| Positive — First-time repo setup | 0.00 | ❌ | - |
| Positive — Multi-environment onboarding | 0.58 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Scaffold honors skip-with-notice on collision | 0.00 | ❌ | - |
Failed Task Details
Positive — First-time repo setup
Run 1/1 (error):
Positive — Multi-environment onboarding
Run 1/1 (error):
- ❌ answer_quality (0.00): fail: First-turn response did not gate on prereqs or gather multi-env inputs: The assistant's previous response consisted solely of invoking the
git-ape-onboardingskill via the skill tool, with no user-facing content. None of the four PASS criteria are met:
- ❌ No prereq check results presented — the assistant did not run
/prereq-checkor inspectaz,gh,jq, or auth status. - ❌ No auth/prereq gate surfaced — nothing about Azure CLI or GitHub auth state was reported.
- ❌ No inputs requested — the assistant asked zero questions. Expected to gather at least three of: target repo, staging subscription ID, RBAC role, App Registration reuse decision, environment name, onboarding mode.
- ❌ No multi-environment awareness demonstrated — no mention of a separate federated credential for staging, no mention of
azure-deploy-stagingenvironment name, no question about SP reuse vs. new SP, no mention of per-environment secrets/RBAC scoping.
The response was a tool-only turn that loaded the skill context but produced no visible gated step-1 reply to the user.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.73): Prompt is trigger-aligned (score 0.73 >= 0.50)
Positive — Scaffold honors skip-with-notice on collision
Run 1/1 (error):
Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: gpt-5.4
Results saved to: .waza-results/git-ape-onboarding-gpt-5.4.json
🔢 Tokens (count + profile)
📊 git-ape-onboarding: 3,377 tokens (detailed ✓), 18 sections, 17 code blocks
⚠️ token count 3377 exceeds 3000
🎯 Quality (5-dim table)
DIMENSION SCORE FEEDBACK
────────────────────────────────────────────
clarity █████ Instructions are exceptionally well-ordered with numbered steps, named subsections, concrete CLI commands, and clearly labeled canonical values (e.g., 'always refs/heads/main — never master'). The Suggested Agent Flow section serves as a concise executive summary that mirrors the detailed playbook, making the skill easy to scan and execute.
completeness █████ All major aspects are covered: prereqs, auth, OIDC subject variants, multi-env mode, scaffold behavior, compliance capture, landing zone discovery, error recovery, and verification commands. Edge cases like disabled subscriptions, org-level OIDC overrides, and pre-existing customized files are explicitly addressed with remediation steps.
trigger_precision ███░░ The 'When to Use' section lists scenarios but there is no explicit 'DO NOT USE FOR' block — the skill lacks negative routing guidance (e.g., 'do not use for re-running a partial setup' or 'do not use if OIDC already exists'). Adding a clear exclusion list would prevent misrouting when users want to update or repair an existing onboarding.
scope_coverage ████░ Capabilities are well-bounded: the skill explicitly states what it configures, what it leaves unstaged, and what falls outside its responsibility (e.g., file creation is the scaffold step's job, not Step 10's). Minor gap: the skill does not clarify whether it handles subscription-level vs. management-group-level RBAC, which could matter in ALZ environments.
anti_patterns ████░ The skill avoids most anti-patterns: Invariants block silent defaults, Safe-Execution Rules prevent secret leakage and destructive overwrites, and error handling is surfaced per-step. One minor issue: Step 4 (ask for confirmation) appears in the Suggested Agent Flow but not as a formal gate in the Command Playbook, creating a slight inconsistency an agent could skip under the playbook's sequential framing.
────────────────────────────────────────────
Overall: 4.2/5.0
A high-quality, production-grade skill with exceptional completeness and clarity. The detailed CLI playbook, explicit invariants, and edge-case coverage (OIDC subject formats, disabled subscriptions, scaffold collision handling) set a strong standard. The primary gaps are the missing 'DO NOT USE FOR' triggers — which reduces routing precision — and a minor confirmation-gate inconsistency between the Command Playbook and Suggested Agent Flow. Addressing those two items would make this a near-perfect skill definition.
✅ Check (compliance summary) (61 lines — click to expand)
ℹ️
waza checkexpectseval.yamlcolocated withSKILL.md. This repo separates them into.github/evals/git-ape-onboarding/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: git-ape-onboarding
📋 Compliance Score: Medium
⚠️ Needs improvement. Missing anti-triggers and routing clarity.
Issues found:
❌ SKILL.md is 3377 tokens (hard limit 500)
📐 Spec Compliance: 9/9 checks passed
✅ Meets agentskills.io specification.
📎 Links: 2/5 valid
⚠️ 3 link issue(s) found.
❌ [templates/copilot-instructions.md] → .github/skills/azure-stack-deploy/SKILL.md: target does not exist
❌ [templates/copilot-instructions.md] → website/docs/deployment/state.md: target does not exist
❌ [templates/copilot-instructions.md] → .github/skills/azure-stack-destroy/SKILL.md: target does not exist
📊 Token Budget: 3377 / 500 tokens
❌ Exceeds limit by 2877 tokens. Consider reducing content.
🧪 Evaluation Suite: Found
✅ eval.yaml detected. Run 'waza run eval.yaml' to test.
📐 Schema Validation: Passed
✅ eval.yaml schema valid
✅ 4 task file(s) validated
💡 Advisory Checks
✅ [module-count] Found 0 reference module(s)
❌ [complexity] Complexity: comprehensive (3377 tokens, 0 modules)
❌ [negative-delta-risk] Negative delta risk patterns detected: excessive constraints (12 constraint keywords found)
✅ [procedural-content] Description contains procedural language
✅ [over-specificity] No over-specificity patterns detected
✅ [cross-model-density] Advisory 16: first sentence doesn't lead with action verb (reduces clarity)
❌ [body-structure] Advisory 17: body structure quality — no examples section found; no error handling or troubleshooting section found
✅ [progressive-disclosure] Content structure supports progressive disclosure
✅ [scope-reduction] Capability scope: 10 signal(s) detected (10 level-2 heading(s), 5 numbered procedure(s))
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ Your skill needs some work before submission.
🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
To improve your skill:
1. Add a 'DO NOT USE FOR:' section to clarify when NOT to use this skill
2. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
3. Run 'waza dev' for interactive compliance improvement
4. Fix 3 broken link(s) — targets do not exist
5. Reduce SKILL.md by 2877 tokens. Run 'waza tokens suggest' for optimization tips
Skill: azure-stack-deploy
📈 Score (per model) + Suggestions/Recommendations
Model: claude-sonnet-4.6
Running benchmark: azure-stack-deploy-eval
Skill: azure-stack-deploy
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: B00A:3B2DA2:ABA975:C29BC8:6A2F6DA3)
✗ [1/5] Negative — Destroying / tearing down an existing deployment
✓ [2/5] Negative — Off-topic prompt (Linux kernel scheduling)
✓ [5/5] Positive — Re-deploy after template edit
✗ [3/5] Negative — What-if preview / preflight validation
✗ [4/5] Positive — Local deploy of an existing deployment artifact
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.78 | Duration: 1m25.55s
- Tests: 5 total, 2 passed, 3 failed, 0 errors
- Success Rate: 40.0%
- Score Range: 0.60 - 0.86 (σ=0.0946)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Destroying / tearing down an existing deployment | 0.86 | ❌ | budget, trigger_relevance_negative |
| Negative — Off-topic prompt (Linux kernel scheduling) | 0.60 | ✅ | budget, trigger_relevance_negative |
| Negative — What-if preview / preflight validation | 0.82 | ❌ | budget, trigger_relevance_negative |
| Positive — Local deploy of an existing deployment artifact | 0.78 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Re-deploy after template edit | 0.85 | ✅ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Positive — Local deploy of an existing deployment artifact: 50% pass rate, score=0.78±0.17
Failed Task Details
Negative — Destroying / tearing down an existing deployment
Run 1/2 (error):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)
Negative — What-if preview / preflight validation
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)
Positive — Local deploy of an existing deployment artifact
Run 1/2 (failed):
- ❌ answer_quality (0.00): fail: Missing state.json mention: Criteria 1, 2, 3 met. Criterion 4 missing: the response does not mention that state.json (schemaVersion 1.0) will be written to capture the stack ID and managed resources.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.83): Prompt is trigger-aligned (score 0.83 >= 0.50)
Benchmark: azure-stack-deploy-eval | Skill: azure-stack-deploy | Model: claude-sonnet-4.6
Results saved to: .waza-results/azure-stack-deploy-claude-sonnet-4.6.json
Model: gpt-5.3-codex
Running benchmark: azure-stack-deploy-eval
Skill: azure-stack-deploy
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: A82B:1685DA:342CD33:3BC4A2C:6A2F6DA6)
✗ [2/5] Negative — Off-topic prompt (Linux kernel scheduling)
✗ [1/5] Negative — Destroying / tearing down an existing deployment
✓ [5/5] Positive — Re-deploy after template edit
✗ [3/5] Negative — What-if preview / preflight validation
✗ [4/5] Positive — Local deploy of an existing deployment artifact
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.78 | Duration: 1m44.969s
- Tests: 5 total, 1 passed, 4 failed, 0 errors
- Success Rate: 20.0%
- Score Range: 0.60 - 0.86 (σ=0.0946)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Destroying / tearing down an existing deployment | 0.86 | ❌ | budget, trigger_relevance_negative |
| Negative — Off-topic prompt (Linux kernel scheduling) | 0.60 | ❌ | budget, trigger_relevance_negative |
| Negative — What-if preview / preflight validation | 0.82 | ❌ | budget, trigger_relevance_negative |
| Positive — Local deploy of an existing deployment artifact | 0.78 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Re-deploy after template edit | 0.85 | ✅ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Negative — Off-topic prompt (Linux kernel scheduling): 50% pass rate, score=0.60±0.00
- Positive — Local deploy of an existing deployment artifact: 50% pass rate, score=0.78±0.17
Failed Task Details
Negative — Destroying / tearing down an existing deployment
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)
Negative — Off-topic prompt (Linux kernel scheduling)
Run 1/2 (error):
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_negative (0.20): Prompt correctly treated as non-trigger (score 0.20 < 0.50)
Negative — What-if preview / preflight validation
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)
Positive — Local deploy of an existing deployment artifact
Run 1/2 (failed):
- ❌ answer_quality (0.00): fail: Missing schemaVersion 1.0 / managed resources detail: Criteria 1 (az stack sub create), 2 (--action-on-unmanage deleteAll), and 3 (deploy-stack.sh helper) are all met. Criterion 4 is only partially met: the response mentions state.json will capture the stackId, but does not state that it is schemaVersion 1.0 nor mention managed resources being captured.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.83): Prompt is trigger-aligned (score 0.83 >= 0.50)
Benchmark: azure-stack-deploy-eval | Skill: azure-stack-deploy | Model: gpt-5.3-codex
Results saved to: .waza-results/azure-stack-deploy-gpt-5.3-codex.json
🔢 Tokens (count + profile)
📊 azure-stack-deploy: 1,912 tokens (detailed ✓), 13 sections, 5 code blocks
🎯 Quality (5-dim table)
DIMENSION SCORE FEEDBACK
────────────────────────────────────────────
clarity █████ Purpose is immediately obvious from the first sentence. Steps are logically sequenced (locate → run → inspect → report) with concrete bash/PowerShell snippets at each stage. The 'What to tell the user after running' section removes ambiguity about required post-run communication.
completeness █████ Covers prerequisites (tool versions, auth, file presence), both bash and PowerShell invocation paths, the full state.json schema with a concrete example, failure modes with recovery steps, and cross-references to sibling skills. The soft-deletable resource type enumeration is a valuable edge-case detail.
trigger_precision ████░ USE FOR and DO NOT USE FOR are clearly separated with no overlap. The four DO NOT USE FOR entries map precisely to adjacent skills. Minor gap: doesn't explicitly state whether this skill should be preceded by preflight/security gate invocation, leaving sequencing ambiguous when invoked directly by a user.
scope_coverage █████ Scope is tightly bounded — deploy only, not author/destroy/validate. The explicit note that template authoring belongs to 'azure-prepare' and that destroy belongs to 'azure-stack-destroy' prevents scope creep. The fallback behavior is scoped and labeled with trade-off language, keeping it within defined boundaries.
anti_patterns ████░ No conflicting directives, error handling is concrete (failure mode table with root causes and recovery), and the auto-fallback to 'az deployment sub create' is flagged with a visible warning rather than silently changing semantics. Minor concern: the automatic fallback materially changes stack lifecycle guarantees without requiring explicit user consent — the '--no-fallback' flag exists but isn't the default, which could surprise users in production environments.
────────────────────────────────────────────
Overall: 4.6/5.0
A high-quality, production-ready skill definition. It is thorough without being bloated, clearly scoped, and actionable at every step. The main improvement opportunity is making the fallback behavior opt-in (or at minimum requiring explicit user confirmation) rather than opt-out, since silent fallback removes the multi-RG idempotency guarantee that is a primary reason to use Deployment Stacks.
✅ Check (compliance summary) (70 lines — click to expand)
ℹ️
waza checkexpectseval.yamlcolocated withSKILL.md. This repo separates them into.github/evals/azure-stack-deploy/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: azure-stack-deploy
📋 Compliance Score: Low
❌ Needs significant improvement. Description too short or missing triggers.
Issues found:
❌ SKILL.md is 1912 tokens (hard limit 500)
📐 Spec Compliance: 8/9 checks passed
❌ Does not fully meet agentskills.io specification.
❌ [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
📎 agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility
📎 Links: 0/8 valid
⚠️ 8 link issue(s) found.
❌ [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-deployment-preflight/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../../../website/docs/deployment/state.md: link escapes skill directory
❌ [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-deployment-preflight/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-security-analyzer/SKILL.md: link escapes skill directory
📊 Token Budget: 1912 / 500 tokens
❌ Exceeds limit by 1412 tokens. Consider reducing content.
🧪 Evaluation Suite: Found
✅ eval.yaml detected. Run 'waza run eval.yaml' to test.
📐 Schema Validation: Passed
✅ eval.yaml schema valid
✅ 5 task file(s) validated
💡 Advisory Checks
✅ [module-count] Found 0 reference module(s)
❌ [complexity] Complexity: comprehensive (1912 tokens, 0 modules)
✅ [negative-delta-risk] No negative delta risk patterns detected
✅ [procedural-content] Description contains procedural language
✅ [over-specificity] No over-specificity patterns detected
✅ [cross-model-density] Description density is optimal for cross-model use
❌ [body-structure] Advisory 17: body structure quality — no examples section found; no error handling or troubleshooting section found
✅ [progressive-disclosure] Content structure supports progressive disclosure
✅ [scope-reduction] Capability scope: 10 signal(s) detected (10 level-2 heading(s), 2 numbered procedure(s))
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ Your skill needs some work before submission.
🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
To improve your skill:
1. Add a 'USE FOR:' section with 3-5 trigger phrases that activate the skill
2. Add a 'DO NOT USE FOR:' section to clarify when NOT to use this skill
3. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
4. Run 'waza dev' for interactive compliance improvement
5. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
6. Fix 8 link(s) that escape the skill directory
7. Reduce SKILL.md by 1412 tokens. Run 'waza tokens suggest' for optimization tips
Skill: azure-stack-destroy
📈 Score (per model) + Suggestions/Recommendations
Model: claude-sonnet-4.6
Running benchmark: azure-stack-destroy-eval
Skill: azure-stack-destroy
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers
✗ [1/5] Negative — Deploying a new stack (opposite operation)
✓ [3/5] Negative — Off-topic prompt (Linux kernel scheduling)
✗ [2/5] Negative — Deleting a non-Git-Ape resource group
✓ [5/5] Positive — Local destroy of a Git-Ape deployment
✗ [4/5] Positive — Clean up the deployment stack
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.81 | Duration: 1m32.378s
- Tests: 5 total, 2 passed, 3 failed, 0 errors
- Success Rate: 40.0%
- Score Range: 0.60 - 0.96 (σ=0.1191)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Deploying a new stack (opposite operation) | 0.81 | ❌ | budget, trigger_relevance_negative |
| Negative — Deleting a non-Git-Ape resource group | 0.87 | ❌ | budget, trigger_relevance_negative |
| Negative — Off-topic prompt (Linux kernel scheduling) | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — Clean up the deployment stack | 0.79 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Local destroy of a Git-Ape deployment | 0.96 | ✅ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Positive — Clean up the deployment stack: 50% pass rate, score=0.79±0.17
Failed Task Details
Negative — Deploying a new stack (opposite operation)
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)
Negative — Deleting a non-Git-Ape resource group
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)
Positive — Clean up the deployment stack
Run 1/2 (failed):
- ❌ answer_quality (0.00): fail: : The assistant invoked the azure-stack-destroy skill but then hit environment permission errors and never delivered a response covering the four required criteria. The final message only asks the user to verify state.json exists. Missing: (1) explicit recommendation of destroy-stack scripts over raw az group delete with rationale, (3) mention of
az stack sub delete --action-on-unmanage deleteAll, and (4) soft-delete purge sweep behavior or purgeProtected retention. Only criterion (2) state.json requirement is partially addressed. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)
Benchmark: azure-stack-destroy-eval | Skill: azure-stack-destroy | Model: claude-sonnet-4.6
Results saved to: .waza-results/azure-stack-destroy-claude-sonnet-4.6.json
Model: gpt-5.3-codex
Running benchmark: azure-stack-destroy-eval
Skill: azure-stack-destroy
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers
✗ [2/5] Negative — Deleting a non-Git-Ape resource group
✓ [3/5] Negative — Off-topic prompt (Linux kernel scheduling)
✗ [1/5] Negative — Deploying a new stack (opposite operation)
✗ [4/5] Positive — Clean up the deployment stack
✗ [5/5] Positive — Local destroy of a Git-Ape deployment
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.71 | Duration: 1m43.719s
- Tests: 5 total, 1 passed, 4 failed, 0 errors
- Success Rate: 20.0%
- Score Range: 0.60 - 0.87 (σ=0.1093)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Deploying a new stack (opposite operation) | 0.81 | ❌ | budget, trigger_relevance_negative |
| Negative — Deleting a non-Git-Ape resource group | 0.87 | ❌ | budget, trigger_relevance_negative |
| Negative — Off-topic prompt (Linux kernel scheduling) | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — Clean up the deployment stack | 0.62 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Local destroy of a Git-Ape deployment | 0.63 | ❌ | answer_quality, budget, trigger_relevance_positive |
Failed Task Details
Negative — Deploying a new stack (opposite operation)
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)
Negative — Deleting a non-Git-Ape resource group
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)
Positive — Clean up the deployment stack
Run 1/2 (failed):
- ❌ answer_quality (0.00): fail: : Response only meets criterion 1 partially (recommends destroy-stack.sh script). Missing: explicit rationale that raw
az group deletemisses soft-delete cleanup/multi-RG resources; no mention ofstate.jsonrequirement under.azure/deployments/deploy-20260524-test/; no mention ofaz stack sub delete --action-on-unmanage deleteAll; no coverage of soft-delete purge sweep (Key Vault, Cognitive Services) or purgeProtected retention behavior. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)
Run 2/2 (failed):
- ❌ answer_quality (0.00): fail: : The response recommended running the destroy-stack.sh script (criterion 1 partial — mentioned script but did not explicitly contrast with
az group deleteor explain why raw RG delete misses soft-delete/multi-RG). It briefly mentioned "purge sweep (Key Vault/Cognitive Services, etc.)" satisfying criterion 4. However, it did NOT mention the state.json prerequisite under .azure/deployments/deploy-20260524-test/ (criterion 2 missing), and did NOT mentionaz stack sub delete --action-on-unmanage deleteAllsemantics explicitly (criterion 3 missing). Missing criteria: 1 (no explicit contrast with az group delete), 2 (state.json requirement), 3 (az stack sub delete --action-on-unmanage deleteAll). - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)
Positive — Local destroy of a Git-Ape deployment
Run 1/2 (failed):
- ❌ answer_quality (0.00): fail: Missing criteria 2 and 3: Criterion 1 met (invoked .github/skills/azure-stack-destroy/scripts/destroy-stack.sh and recommended the skill). Criterion 4 met (mentions the skill's soft-delete purge sweep including eligible Key Vaults so the name can be reused). Criterion 2 NOT met: the assistant only listed state.json in a permission-blocked
lscommand but never explained it as the source of truth for stackId, managedResources, softDeletable, purgeProtected under .azure/deployments/deploy-20260506-001/. Criterion 3 NOT met: the response never namesaz stack sub delete --action-on-unmanage deleteAllor its idempotent single-call semantics across resource groups. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.89): Prompt is trigger-aligned (score 0.89 >= 0.50)
Run 2/2 (failed):
- ❌ answer_quality (0.00): fail: Missing criteria 2 and 3: The response recommends the azure-stack-destroy skill/script (criterion 1 ✓) and mentions purging soft-deleted Key Vaults so the name is reusable (criterion 4 ✓). However it does NOT reference state.json under .azure/deployments/deploy-20260506-001/ as the source of truth (criterion 2 ✗), and it does NOT name the
az stack sub delete --action-on-unmanage deleteAllcommand or its semantics (criterion 3 ✗). - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.89): Prompt is trigger-aligned (score 0.89 >= 0.50)
Benchmark: azure-stack-destroy-eval | Skill: azure-stack-destroy | Model: gpt-5.3-codex
Results saved to: .waza-results/azure-stack-destroy-gpt-5.3-codex.json
🔢 Tokens (count + profile)
📊 azure-stack-destroy: 2,644 tokens (detailed ✓), 14 sections, 7 code blocks
🎯 Quality (5-dim table)
DIMENSION SCORE FEEDBACK
────────────────────────────────────────────
clarity ████░ Purpose is immediately obvious and steps are well-ordered with excellent use of tables and code blocks. However, the 'When to Use' section (after DO NOT USE FOR) directly duplicates content already in the 'USE FOR' section, creating redundant noise that an agent must reconcile.
completeness █████ Exceptionally thorough: prerequisites with version requirements, fast vs sync mode tradeoffs, all status codes with meanings, failure modes with recovery steps, purge sweep details, and state.json schema fields. Edge cases like purge-protected resources, fallback path, and already-destroyed state are all explicitly covered.
trigger_precision ████░ USE FOR and DO NOT USE FOR are specific and well-separated with concrete phrase examples and clear rationale. The duplication of 'When to Use' after DO NOT USE FOR is the only weakness — an agent parsing these sections could weigh conflicting signal about when to trigger.
scope_coverage █████ Scope is precisely defined: only Git-Ape deployments, only full-stack destruction (no surgical mode), only with state.json present, only Azure. The explicit 'Prefer this over raw az group delete' section with three numbered reasons is outstanding boundary documentation.
anti_patterns ████░ No conflicting directives, error handling is explicit, and the bypass-flag safety reasoning demonstrates good instructional hygiene. The one anti-pattern is the 'When to Use' section duplicating 'USE FOR' — remove or merge it to eliminate redundancy without losing any content.
────────────────────────────────────────────
Overall: 4.4/5.0
A high-quality, production-grade skill document. It is comprehensive, well-structured, and sets a strong example for scope definition and failure-mode coverage. The only actionable improvement is removing the redundant 'When to Use' section (lines after DO NOT USE FOR) which duplicates the USE FOR triggers and adds noise without adding information.
✅ Check (compliance summary) (69 lines — click to expand)
ℹ️
waza checkexpectseval.yamlcolocated withSKILL.md. This repo separates them into.github/evals/azure-stack-destroy/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: azure-stack-destroy
📋 Compliance Score: Low
❌ Needs significant improvement. Description too short or missing triggers.
Issues found:
❌ SKILL.md is 2644 tokens (hard limit 500)
📐 Spec Compliance: 7/9 checks passed
❌ Does not fully meet agentskills.io specification.
❌ [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
📎 agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility
❌ [spec-security] Security risks detected: description contains XML angle brackets
📎 XML angle brackets and reserved prefixes pose injection and naming conflict risks
📎 Links: 0/4 valid
⚠️ 4 link issue(s) found.
❌ [SKILL.md] → ../azure-stack-deploy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-stack-deploy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-drift-detector/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-resource-visualizer/SKILL.md: link escapes skill directory
📊 Token Budget: 2644 / 500 tokens
❌ Exceeds limit by 2144 tokens. Consider reducing content.
🧪 Evaluation Suite: Found
✅ eval.yaml detected. Run 'waza run eval.yaml' to test.
📐 Schema Validation: Passed
✅ eval.yaml schema valid
✅ 5 task file(s) validated
💡 Advisory Checks
✅ [module-count] Found 0 reference module(s)
❌ [complexity] Complexity: comprehensive (2644 tokens, 0 modules)
✅ [negative-delta-risk] No negative delta risk patterns detected
✅ [procedural-content] Description contains procedural language
✅ [over-specificity] No over-specificity patterns detected
✅ [cross-model-density] Advisory 16: first sentence doesn't lead with action verb (reduces clarity)
❌ [body-structure] Advisory 17: body structure quality — no examples section found; no error handling or troubleshooting section found
✅ [progressive-disclosure] Content structure supports progressive disclosure
✅ [scope-reduction] Capability scope: 8 signal(s) detected (8 level-2 heading(s), 2 numbered procedure(s))
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ Your skill needs some work before submission.
🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
To improve your skill:
1. Add a 'USE FOR:' section with 3-5 trigger phrases that activate the skill
2. Add a 'DO NOT USE FOR:' section to clarify when NOT to use this skill
3. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
4. Run 'waza dev' for interactive compliance improvement
5. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
6. Fix spec violation [spec-security]: Security risks detected: description contains XML angle brackets
7. Fix 4 link(s) that escape the skill directory
8. Reduce SKILL.md by 2144 tokens. Run 'waza tokens suggest' for optimization tips
Skill: azure-landing-zone-discovery
📈 Score (per model) + Suggestions/Recommendations
Model: claude-sonnet-4.6
Running benchmark: azure-landing-zone-discovery-eval
Skill: azure-landing-zone-discovery
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: EC22:35C3BC:B0E8B1:C7AA4C:6A2F6DAB)
✗ [4/4] Positive — Manual landing-zone context injection
✗ [2/4] Negative — CAF naming lookup (off-topic)
✓ [3/4] Positive — Discover the landing zone
[ERROR] waiting for session.idle: context deadline exceeded
✗ [1/4] Negative — Plain function-app deployment (off-topic)
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.82 | Duration: 2m0.034s
- Tests: 4 total, 1 passed, 3 failed, 0 errors
- Success Rate: 25.0%
- Score Range: 0.63 - 0.96 (σ=0.1334)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Plain function-app deployment (off-topic) | 0.77 | ❌ | budget, trigger_relevance_negative |
| Negative — CAF naming lookup (off-topic) | 0.93 | ❌ | budget, trigger_relevance_negative |
| Positive — Discover the landing zone | 0.96 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — Manual landing-zone context injection | 0.63 | ❌ | answer_quality, budget, trigger_relevance_positive |
Failed Task Details
Negative — Plain function-app deployment (off-topic)
Run 1/1 (error):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.55): Prompt appears trigger-aligned unexpectedly (score 0.55 >= 0.50)
Negative — CAF naming lookup (off-topic)
Run 1/1 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.86): Prompt appears trigger-aligned unexpectedly (score 0.86 >= 0.50)
Positive — Manual landing-zone context injection
Run 1/1 (error):
- ❌ answer_quality (0.00): fail: No prior assistant response exists in this session to evaluate.: There is no previous assistant response in the conversation to grade. All four PASS criteria are therefore unmet: (1) no reference to inject-lz.sh or the manual-injection procedure, (2) no mapping of hub VNet → --hub-vnet-id, allowed regions → --allowed-locations, required tags → --required-tags, (3) no mention of writing .azure/landing-zone-context.json, and (4) no manual injection path was provided at all.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)
Benchmark: azure-landing-zone-discovery-eval | Skill: azure-landing-zone-discovery | Model: claude-sonnet-4.6
Results saved to: .waza-results/azure-landing-zone-discovery-claude-sonnet-4.6.json
Model: gpt-5.3-codex
Running benchmark: azure-landing-zone-discovery-eval
Skill: azure-landing-zone-discovery
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers
✗ [2/4] Negative — CAF naming lookup (off-topic)
✓ [4/4] Positive — Manual landing-zone context injection
✗ [3/4] Positive — Discover the landing zone
[ERROR] waiting for session.idle: context deadline exceeded
✗ [1/4] Negative — Plain function-app deployment (off-topic)
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.82 | Duration: 2m0.042s
- Tests: 4 total, 1 passed, 3 failed, 0 errors
- Success Rate: 25.0%
- Score Range: 0.63 - 0.96 (σ=0.1331)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Plain function-app deployment (off-topic) | 0.77 | ❌ | budget, trigger_relevance_negative |
| Negative — CAF naming lookup (off-topic) | 0.93 | ❌ | budget, trigger_relevance_negative |
| Positive — Discover the landing zone | 0.63 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Manual landing-zone context injection | 0.96 | ✅ | answer_quality, budget, trigger_relevance_positive |
Failed Task Details
Negative — Plain function-app deployment (off-topic)
Run 1/1 (error):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.55): Prompt appears trigger-aligned unexpectedly (score 0.55 >= 0.50)
Negative — CAF naming lookup (off-topic)
Run 1/1 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.86): Prompt appears trigger-aligned unexpectedly (score 0.86 >= 0.50)
Positive — Discover the landing zone
Run 1/1 (failed):
- ❌ answer_quality (0.00): fail: Missing criterion 4: The response references discover-lz.sh (1✓), mentions .azure/landing-zone-context.json output (2✓), and covers MG hierarchy, platform vs application subscriptions, and hub VNet (3✓). However, it does NOT acknowledge limited RBAC scenarios (e.g., missing management-group read permission) nor mention the inject-lz.sh manual fallback for air-gapped/cross-tenant cases. Criterion 4 is unmet.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)
Benchmark: azure-landing-zone-discovery-eval | Skill: azure-landing-zone-discovery | Model: gpt-5.3-codex
Results saved to: .waza-results/azure-landing-zone-discovery-gpt-5.3-codex.json
🔢 Tokens (count + profile)
📊 azure-landing-zone-discovery: 6,849 tokens (detailed ✓), 27 sections, 16 code blocks
⚠️ token count 6849 exceeds 3000
🎯 Quality (5-dim table)
DIMENSION SCORE FEEDBACK
────────────────────────────────────────────
clarity █████ Purpose is immediately obvious, steps are logically ordered (check → discover → classify → detect → output), and each section uses headers, tables, and code blocks effectively. The distinction between auto-discovery and manual injection paths is clearly signposted.
completeness █████ Covers the full lifecycle: pre-check, discovery, classification, policy detection, networking, shared services, output format, visualization, confidence model, manual injection (3 options), downstream integration stages, and a comprehensive edge cases table. Field semantics for the JSON schema are explicitly documented.
trigger_precision █████ USE FOR and DO NOT USE FOR triggers are specific, non-overlapping, and reference concrete sibling skills for redirected cases. The trigger examples ('discover landing zone', 'connect to hub VNet') map directly to real user intents without ambiguity.
scope_coverage █████ Scope is precisely bounded to tenant/landing-zone topology discovery and context file generation. Capabilities (auto-discovery, manual injection, visualization, confidence scoring) and hard limitations (cross-tenant, limited RBAC) are all explicit. The integration section clearly shows how this skill feeds into downstream skills without overstepping.
anti_patterns ████░ Avoids vague instructions, conflicting directives, and missing error handling (graceful fallbacks are documented for every failure mode). Minor deduction: the injection script flags (Option B) are described as canonical but no validation or error output examples are shown for the scripts, leaving partial ambiguity about what happens on bad input. The confidence score table's 'Suggested treatment' column is slightly prescriptive (auto-proceed on high confidence) but is reasonably justified.
────────────────────────────────────────────
Overall: 4.8/5.0
An exceptionally well-crafted skill document. It is comprehensive, precisely scoped, and practically complete — covering happy paths, failure modes, confidence modeling, dual-shell parity, and downstream integration in a single coherent document. The only minor gap is the absence of injection-script error/validation examples, which is a small omission in an otherwise exemplary skill definition.
✅ Check (compliance summary) (59 lines — click to expand)
ℹ️
waza checkexpectseval.yamlcolocated withSKILL.md. This repo separates them into.github/evals/azure-landing-zone-discovery/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: azure-landing-zone-discovery
📋 Compliance Score: Medium-High
⚠️ Good, but could be improved. Missing routing clarity.
Issues found:
❌ SKILL.md is 6849 tokens (hard limit 500)
📐 Spec Compliance: 8/9 checks passed
❌ Does not fully meet agentskills.io specification.
❌ [spec-allowed-fields] Unknown frontmatter fields: argument-hint, last_updated, user-invocable
📎 agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility
📎 Links: 2/2 valid
✅ All links valid.
📊 Token Budget: 6849 / 500 tokens
❌ Exceeds limit by 6349 tokens. Consider reducing content.
🧪 Evaluation Suite: Found
✅ eval.yaml detected. Run 'waza run eval.yaml' to test.
📐 Schema Validation: Passed
✅ eval.yaml schema valid
✅ 4 task file(s) validated
💡 Advisory Checks
✅ [module-count] Found 0 reference module(s)
❌ [complexity] Complexity: comprehensive (6849 tokens, 0 modules)
✅ [negative-delta-risk] No negative delta risk patterns detected
✅ [procedural-content] Description contains procedural language
❌ [over-specificity] Over-specificity detected: IP addresses, hardcoded URLs with paths
❌ [cross-model-density] Advisory 16: word count is 65 (>60 may reduce cross-model effectiveness); first sentence doesn't lead with action verb (reduces clarity)
❌ [body-structure] Advisory 17: body structure quality — no examples section found
❌ [progressive-disclosure] Advisory 18: progressive disclosure — SKILL.md body is 574 lines (>500 lines reduces scannability; consider moving detail to references/); 1 code block(s) exceed 50 lines (suggest moving to references/)
✅ [scope-reduction] Capability scope: 9 signal(s) detected (9 level-2 heading(s), 2 numbered procedure(s))
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ Your skill needs some work before submission.
🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
To improve your skill:
1. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
2. Run 'waza dev' for interactive compliance improvement
3. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, last_updated, user-invocable
4. Reduce SKILL.md by 6349 tokens. Run 'waza tokens suggest' for optimization tips
Skill: azure-policy-advisor
📈 Score (per model) + Suggestions/Recommendations
Model: claude-sonnet-4.6
Running benchmark: azure-policy-advisor-eval
Skill: azure-policy-advisor
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers
✓ [3/5] Negative — Off-topic Linux kernel question
✓ [2/5] Negative — CAF naming lookup (azure-naming-research territory)
✓ [5/5] Positive — Compliance framework audit (CIS)
✓ [4/5] Positive — Post-template-generation policy recommendations
✓ [1/5] Negative — Pricing / cost estimation (azure-cost-estimator territory)
🧪 Waza Eval Results
Status: ✅ Passed | Score: 0.81 | Duration: 3m20.793s
- Tests: 5 total, 5 passed, 0 failed, 0 errors
- Success Rate: 100.0%
- Score Range: 0.75 - 0.90 (σ=0.0646)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Pricing / cost estimation (azure-cost-estimator territory) | 0.77 | ✅ | budget, out_of_scope_acknowledgement, trigger_relevance_negative |
| Negative — CAF naming lookup (azure-naming-research territory) | 0.77 | ✅ | budget, out_of_scope_acknowledgement, trigger_relevance_negative |
| Negative — Off-topic Linux kernel question | 0.75 | ✅ | budget, out_of_scope_acknowledgement, trigger_relevance_negative |
| Positive — Post-template-generation policy recommendations | 0.90 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — Compliance framework audit (CIS) | 0.89 | ✅ | answer_quality, budget, trigger_relevance_positive |
Benchmark: azure-policy-advisor-eval | Skill: azure-policy-advisor | Model: claude-sonnet-4.6
Results saved to: .waza-results/azure-policy-advisor-claude-sonnet-4.6.json
JUnit XML saved to: .waza-results/azure-policy-advisor-claude-sonnet-4.6.junit.xml
Model: gpt-5.3-codex
Running benchmark: azure-policy-advisor-eval
Skill: azure-policy-advisor
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: 981B:7C7E5:345ED89:3BF8DAF:6A2F6DA8)
✗ [3/5] Negative — Off-topic Linux kernel question
✓ [2/5] Negative — CAF naming lookup (azure-naming-research territory)
✓ [4/5] Positive — Post-template-generation policy recommendations
✗ [5/5] Positive — Compliance framework audit (CIS)
✓ [1/5] Negative — Pricing / cost estimation (azure-cost-estimator territory)
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.75 | Duration: 2m43.827s
- Tests: 5 total, 3 passed, 2 failed, 0 errors
- Success Rate: 60.0%
- Score Range: 0.56 - 0.90 (σ=0.1097)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Pricing / cost estimation (azure-cost-estimator territory) | 0.77 | ✅ | budget, out_of_scope_acknowledgement, trigger_relevance_negative |
| Negative — CAF naming lookup (azure-naming-research territory) | 0.77 | ✅ | budget, out_of_scope_acknowledgement, trigger_relevance_negative |
| Negative — Off-topic Linux kernel question | 0.75 | ❌ | budget, out_of_scope_acknowledgement, trigger_relevance_negative |
| Positive — Post-template-generation policy recommendations | 0.90 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — Compliance framework audit (CIS) | 0.56 | ❌ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Negative — Off-topic Linux kernel question: 50% pass rate, score=0.75±0.00
Failed Task Details
Negative — Off-topic Linux kernel question
Run 1/2 (error):
- ✅ budget (1.00): All behavior checks passed
- ✅ out_of_scope_acknowledgement (1.00): All prompts passed
- ✅ trigger_relevance_negative (0.24): Prompt correctly treated as non-trigger (score 0.24 < 0.50)
Positive — Compliance framework audit (CIS)
Run 1/2 (failed):
- ❌ answer_quality (0.00): fail: Missing verification caveat for CIS initiative ID/name: Criteria 2, 3, and 4 are satisfied: the response makes the initiative-vs-individual trade-off explicit, recommends Audit/DoNotEnforce first with phased move to Deny, and covers Storage (HTTPS-only, public access, TLS, soft delete), SQL (auditing, TDE, AAD, public network access), and VMSS/VMs (disk encryption, no public 22/3389, NSG/JIT, Defender) with named controls — well above the 3-of-4 bar (monitoring/diagnostics is also touched). However, criterion 1 is only partially met: the response recommends the CIS Azure Foundations regulatory compliance initiative but does NOT note that the current initiative ID / display name should be verified against Microsoft Learn or
az policy set-definition list --query "[?contains(displayName, 'CIS')]"before assignment. The skill context explicitly calls this out ("Always verify policy and initiative definition IDs from Microsoft Learn ... Do not rely on memorized IDs"), and the grading rubric requires this caveat. Missing: criterion 1 (verification note). - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.67): Prompt is trigger-aligned (score 0.67 >= 0.50)
Run 2/2 (failed):
- ❌ answer_quality (0.00): fail: Missing explicit note about verifying CIS initiative IDs/names: Criteria 2, 3, and 4 are met: the response covers the initiative-vs-individual trade-off (broad coverage vs stricter/faster control), explicitly recommends an Audit-first rollout with phased promotion to Deny, and addresses storage (HTTPS-only, TLS 1.2, public access), SQL (TDE, auditing, Defender, public access), VM/compute (NSG/management ports, disk encryption, Defender), plus monitoring (diagnostic logs). Criterion 1 is only partially met: the response recommends assigning the CIS Azure Foundations initiative, but does NOT note that the current initiative ID / display name should be verified from Microsoft Learn or via
az policy set-definition listbefore assignment — a required element of PASS criterion 1. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.67): Prompt is trigger-aligned (score 0.67 >= 0.50)
Benchmark: azure-policy-advisor-eval | Skill: azure-policy-advisor | Model: gpt-5.3-codex
Results saved to: .waza-results/azure-policy-advisor-gpt-5.3-codex.json
🔢 Tokens (count + profile)
📊 azure-policy-advisor: 5,118 tokens (detailed ✓), 29 sections, 14 code blocks
⚠️ token count 5118 exceeds 3000
🎯 Quality (5-dim table)
DIMENSION SCORE FEEDBACK
────────────────────────────────────────────
clarity ████░ The skill is well-structured with a logical step progression and clear headings. However, the heavy use of external reference files (references/classification-rules.yaml, references/per-resource-policy-priorities.md, etc.) creates dependency on files that may not exist, making the skill non-self-contained and harder to follow without them.
completeness █████ Exceptional coverage: prerequisites, graceful degradation when az is unavailable, landing zone context handling, confidence gating, output artifacts with full JSON schema, troubleshooting table, and concrete examples. Edge cases like multi-subscription environments, Government/China clouds, and initiative-member expansion gaps are all addressed.
trigger_precision █████ USE FOR and DO NOT USE FOR are precise, non-overlapping, and include specific alternative skill names for each exclusion. The routing logic is unambiguous — a reader would correctly route any policy-adjacent request without guessing.
scope_coverage █████ Scope boundaries are explicit both in the description header and in a dedicated Scope callout. The two-track model (template vs. subscription-level) cleanly partitions responsibilities. Limitations like initiative-member expansion are honestly documented as known gaps.
anti_patterns ████░ Avoids most anti-patterns well: error handling is explicit, steps are ordered, and conflicting directives are absent. Minor issue: Step 3 instructs 'always verify definition IDs' but Step 4 already fetches them from Microsoft Learn — the ordering slightly implies verification should come before research, which could cause confusion. Consolidating or reordering Steps 3-4 would tighten this.
────────────────────────────────────────────
Overall: 4.6/5.0
A high-quality, production-ready skill document. It excels at completeness, trigger precision, and scope coverage. The main improvement opportunity is reducing reliance on external reference files that may not always be present, and clarifying the Step 3/4 ordering around definition ID verification. The two-track (template vs. subscription) framework is a standout design choice that adds real clarity for operators.
✅ Check (compliance summary) (57 lines — click to expand)
ℹ️
waza checkexpectseval.yamlcolocated withSKILL.md. This repo separates them into.github/evals/azure-policy-advisor/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: azure-policy-advisor
📋 Compliance Score: High
✅ Excellent! Your skill meets all compliance requirements.
Issues found:
❌ SKILL.md is 5118 tokens (hard limit 500)
📐 Spec Compliance: 9/9 checks passed
✅ Meets agentskills.io specification.
🔌 MCP Integration: 4/4
✅ All MCP integration checks passed.
📎 Links: 5/5 valid
✅ All links valid.
📊 Token Budget: 5118 / 500 tokens
❌ Exceeds limit by 4618 tokens. Consider reducing content.
🧪 Evaluation Suite: Found
✅ eval.yaml detected. Run 'waza run eval.yaml' to test.
📐 Schema Validation: Passed
✅ eval.yaml schema valid
✅ 5 task file(s) validated
💡 Advisory Checks
✅ [module-count] Found 3 reference modules (2-3 is optimal)
❌ [complexity] Complexity: comprehensive (5118 tokens, 3 modules)
✅ [negative-delta-risk] No negative delta risk patterns detected
✅ [procedural-content] Description contains procedural language
✅ [over-specificity] No over-specificity patterns detected
❌ [cross-model-density] Advisory 16: word count is 113 (>60 may reduce cross-model effectiveness); first sentence doesn't lead with action verb (reduces clarity)
✅ [body-structure] Advisory 17: body structure quality
✅ [progressive-disclosure] Content structure supports progressive disclosure
✅ [scope-reduction] Capability scope: 12 signal(s) detected (12 level-2 heading(s), 4 numbered procedure(s))
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ Your skill needs some work before submission.
🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
To improve your skill:
1. Reduce SKILL.md by 4618 tokens. Run 'waza tokens suggest' for optimization tips
- add a Do NOT use for block and tighten the description for routing - replace the Step 8 cat stub with assembly summary and report-back guidance - make manual injection prescriptive about inject-lz flags 🧭 - Generated by Copilot
The LZ feature added a Landing Zone Context section to the mirror .github/copilot-instructions.md but not its canonical onboarding template, tripping the template mirror sync CI gate. Port the section into the template so the two stay byte-identical. 🧭 - Generated by Copilot
…-zone-discovery # Conflicts: # .github/evals/manifest.yaml
Summary
Adds the
azure-landing-zone-discoveryskill, which auto-discovers enterprise Azure landing zone topology (management groups, platform vs. application subscriptions, policy assignments, hub-spoke networking, and shared services) and writes a machine-readable.azure/landing-zone-context.json. Git-Ape agents read that context to route workloads to the correct subscription, connect to shared services, and avoid policy conflicts.What's included
.github/skills/azure-landing-zone-discovery/SKILL.md(user-invocable), with a weighted ALZ confidence scorer (high/medium/low/none) referencing the ALZ accelerator.discover-lzandinject-lzship in both bash (.sh) and PowerShell (.ps1) parity ports. Both produce a byte-compatiblelanding-zone-context.json, so Windows users withoutgit-bashare first-class.eval.yamlplus 4 tasks (2 positive, 2 negative) registered in.github/evals/manifest.yaml.tests/fixtures/landing-zone/(flat tenant, hub-spoke tenant, skipped-network).git-apeagent, pluscopilot-instructions.md.website/docs/authoring/skills.mdnow documents the dual-shell helper-script rule (user-invocable skills SHOULD ship.sh+.ps1; CI-only skills MAY ship.shonly).landing-zone-context.md), and a use-case walkthrough (landing-zone-aware-deployment.md).Validation
.shports passbash -n; both.ps1ports pass the PowerShell language parser.inject-lz.ps1functionally tested for schema parity vsinject-lz.sh, including cross-shell-Merge(bash creates base → pwsh merges correctly).scripts/generate-docs.js; unrelated drift reverted to keep the PR scoped.🧭 - Generated by Copilot