Add optional drift-detector onboarding to git-ape-onboarding#188
Conversation
Onboarding now optionally provisions COPILOT_GITHUB_TOKEN so the agentic drift-detection workflow (git-ape-drift.lock.yml) can run. That workflow runs on the Copilot engine and fails its preflight gate without this token, with no fallback to the built-in GITHUB_TOKEN. - SKILL.md: optional Step 10 to provision the token (gated on user consent), renumbered compliance/verify steps, and updated the config list, suggested agent flow, and verification commands - git-ape-verify.yml: non-blocking warning when the token is absent (emits ::warning:: but never counts toward MISSING) - docs: optional "Enable drift detection" subsection in the getting-started guide; regenerated skill + verify workflow doc pages Tracks #187 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
🧪 Waza skill evals (advisory)
Ran 4 matrix legs in parallel (skills × models). Results are non-blocking — investigate failures via the workflow logs and the per-leg
📊 Token comparison vs
|
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Storage service comparison (off-topic) | 0.56 | ✅ | budget, trigger_relevance_negative |
| Positive — First-time repo setup | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — Multi-environment onboarding | 0.91 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — Scaffold honors skip-with-notice on collision | 0.98 | ✅ | answer_quality, budget, trigger_relevance_positive |
Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: claude-opus-4.6
Results saved to: .waza-results/git-ape-onboarding-claude-opus-4.6.json
JUnit XML saved to: .waza-results/git-ape-onboarding-claude-opus-4.6.junit.xml
Model: claude-sonnet-4.6
Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers
✓ [1/4] Negative — Storage service comparison (off-topic)
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✓ [3/4] Positive — Multi-environment onboarding
✓ [2/4] Positive — First-time repo setup
🧪 Waza Eval Results
Status: ✅ Passed | Score: 0.86 | Duration: 1m24.144s
- Tests: 4 total, 4 passed, 0 failed, 0 errors
- Success Rate: 100.0%
- Score Range: 0.56 - 1.00 (σ=0.1798)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Storage service comparison (off-topic) | 0.56 | ✅ | budget, trigger_relevance_negative |
| Positive — First-time repo setup | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — Multi-environment onboarding | 0.91 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — Scaffold honors skip-with-notice on collision | 0.98 | ✅ | answer_quality, budget, trigger_relevance_positive |
Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: claude-sonnet-4.6
Results saved to: .waza-results/git-ape-onboarding-claude-sonnet-4.6.json
JUnit XML saved to: .waza-results/git-ape-onboarding-claude-sonnet-4.6.junit.xml
Model: gpt-5.3-codex
Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers
✓ [1/4] Negative — Storage service comparison (off-topic)
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✓ [3/4] Positive — Multi-environment onboarding
✗ [2/4] Positive — First-time repo setup
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.78 | Duration: 49.603s
- Tests: 4 total, 3 passed, 1 failed, 0 errors
- Success Rate: 75.0%
- Score Range: 0.56 - 0.98 (σ=0.1736)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Storage service comparison (off-topic) | 0.56 | ✅ | budget, trigger_relevance_negative |
| Positive — First-time repo setup | 0.67 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Multi-environment onboarding | 0.91 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — Scaffold honors skip-with-notice on collision | 0.98 | ✅ | answer_quality, budget, trigger_relevance_positive |
Failed Task Details
Positive — First-time repo setup
Run 1/1 (failed):
- ❌ answer_quality (0.00): fail: : Criterion 1 not met: the agent did not present actual prereq check results (tool versions, auth status). Shell execution failed with "unexpected user permission response" and the agent surfaced a blank checklist for the user to fill in manually rather than inspected versions/auth states. Criteria 2 (gate surfaced as "blocked at the prereq gate"), 3 (5 inputs requested in numbered list), and 4 (no false claims of configuration) are satisfied.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: gpt-5.3-codex
Results saved to: .waza-results/git-ape-onboarding-gpt-5.3-codex.json
Model: gpt-5.4 *(baseline — A/B mode)*
Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: gpt-5.4
Judge Model: claude-opus-4.7
Parallel: 4 workers
════════════════════════════════════════════════════════════════
PASS 1: Skills-Enabled Run
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Storage service comparison (off-topic)
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✓ [2/4] Positive — First-time repo setup
✓ [3/4] Positive — Multi-environment onboarding
════════════════════════════════════════════════════════════════
PASS 2: Skills Baseline (skills stripped)
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Storage service comparison (off-topic)
✗ [4/4] Positive — Scaffold honors skip-with-notice on collision
✗ [3/4] Positive — Multi-environment onboarding
✗ [2/4] Positive — First-time repo setup
════════════════════════════════════════════════════════════════
SKILL IMPACT ANALYSIS
════════════════════════════════════════════════════════════════
Overall Performance Delta:
With Skills: 100.0% (4/4 tasks passed)
Without Skills: 25.0% (1/4 tasks passed)
Impact: +75.0 percentage points
Per-Task Breakdown:
• Negative — Storage service comparison (off-topic) [NEUTRAL] 100% → 100% (+0pp)
• Positive — First-time repo setup [IMPROVED] 0% → 100% (+100pp)
• Positive — Multi-environment onboarding [IMPROVED] 0% → 100% (+100pp)
• Positive — Scaffold honors skip-with-notice on collision [IMPROVED] 0% → 100% (+100pp)
Verdict: Skills have POSITIVE IMPACT (improved 3/4 tasks)
════════════════════════════════════════════════════════════════
🧪 Waza Eval Results
Status: ✅ Passed | Score: 0.86 | Duration: 41.489s
- Tests: 4 total, 4 passed, 0 failed, 0 errors
- Success Rate: 100.0%
- Score Range: 0.56 - 1.00 (σ=0.1798)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Storage service comparison (off-topic) | 0.56 | ✅ | budget, trigger_relevance_negative |
| Positive — First-time repo setup | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — Multi-environment onboarding | 0.91 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — Scaffold honors skip-with-notice on collision | 0.98 | ✅ | answer_quality, budget, trigger_relevance_positive |
Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: gpt-5.4
Results saved to: .waza-results/git-ape-onboarding-gpt-5.4.json
JUnit XML saved to: .waza-results/git-ape-onboarding-gpt-5.4.junit.xml
🔢 Tokens (count + profile)
📊 git-ape-onboarding: 4,396 tokens (detailed ✓), 18 sections, 18 code blocks
⚠️ token count 4396 exceeds 3000
🎯 Quality (5-dim table)
DIMENSION SCORE FEEDBACK
────────────────────────────────────────────
clarity █████ Instructions are exceptionally well-ordered with numbered steps, named invariants, a canonical command playbook, and clear visual markers (✓/⊝). The 'first-turn rule' in Suggested Agent Flow eliminates ambiguity about when the agent may proceed.
completeness █████ Covers prereq validation, single/multi-env modes, OIDC subject format variance, idempotency on re-run, optional drift detector, compliance preferences, safe-execution rules, verification commands, and two known gotchas with remediation steps. Edge cases (disabled subscriptions, org OIDC overrides, non-main branches, partial failures) are all explicitly addressed.
trigger_precision ████░ USE FOR and DO NOT USE FOR triggers in the description and body are specific and non-overlapping. Minor gap: 'rotating or updating an existing secret or federated credential' is mentioned as out-of-scope but no alternative skill is named for that path, leaving routing slightly incomplete.
scope_coverage █████ Scope is tightly defined — bootstrapping only, not deploying. Capabilities list in 'What It Configures' is exhaustive and numbered. Limitations are explicit (no overwrite, no git operations, no deployment). The boundary between this skill and git-ape is clearly articulated.
anti_patterns ████░ Avoids vague instructions, conflicting directives, and missing error handling. The hard gate on prereq-check before collecting inputs is excellent. Minor: Step 11's conditional logic for copilot-instructions.md (exists/doesn't exist/exists-but-no-section) is slightly complex and could benefit from a decision table, but the prose handles it correctly.
────────────────────────────────────────────
Overall: 4.6/5.0
Exceptionally well-structured skill with strong invariants, idempotency guarantees, explicit edge case handling, and a clear agent execution model. The hard prereq gate, canonical command playbook, and dual-shell scaffold scripts demonstrate production-grade thinking. Minor improvements: name an alternative skill for secret rotation, and consider a decision table for the copilot-instructions.md update logic to reduce cognitive load.
✅ Check (compliance summary) (60 lines — click to expand)
ℹ️
waza checkexpectseval.yamlcolocated withSKILL.md. This repo separates them into.github/evals/git-ape-onboarding/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: git-ape-onboarding
📋 Compliance Score: Medium-High
⚠️ Good, but could be improved. Missing routing clarity.
Issues found:
❌ SKILL.md is 4396 tokens (hard limit 500)
📐 Spec Compliance: 9/9 checks passed
✅ Meets agentskills.io specification.
📎 Links: 3/6 valid
⚠️ 3 link issue(s) found.
❌ [templates/copilot-instructions.md] → .github/skills/azure-stack-deploy/SKILL.md: target does not exist
❌ [templates/copilot-instructions.md] → website/docs/deployment/state.md: target does not exist
❌ [templates/copilot-instructions.md] → .github/skills/azure-stack-destroy/SKILL.md: target does not exist
📊 Token Budget: 4396 / 500 tokens
❌ Exceeds limit by 3896 tokens. Consider reducing content.
🧪 Evaluation Suite: Found
✅ eval.yaml detected. Run 'waza run eval.yaml' to test.
📐 Schema Validation: Passed
✅ eval.yaml schema valid
✅ 4 task file(s) validated
💡 Advisory Checks
✅ [module-count] Found 0 reference module(s)
❌ [complexity] Complexity: comprehensive (4396 tokens, 0 modules)
❌ [negative-delta-risk] Negative delta risk patterns detected: excessive constraints (17 constraint keywords found)
✅ [procedural-content] Description contains procedural language
✅ [over-specificity] No over-specificity patterns detected
❌ [cross-model-density] Advisory 16: word count is 61 (>60 may reduce cross-model effectiveness); first sentence doesn't lead with action verb (reduces clarity)
❌ [body-structure] Advisory 17: body structure quality — no examples section found; no error handling or troubleshooting section found
✅ [progressive-disclosure] Content structure supports progressive disclosure
✅ [scope-reduction] Capability scope: 10 signal(s) detected (10 level-2 heading(s), 6 numbered procedure(s))
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ Your skill needs some work before submission.
🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
To improve your skill:
1. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
2. Run 'waza dev' for interactive compliance improvement
3. Fix 3 broken link(s) — targets do not exist
4. Reduce SKILL.md by 3896 tokens. Run 'waza tokens suggest' for optimization tips
…d trigger precision - Add first-turn rule: response must be gated handoff (prereq + inputs), not a walkthrough - Strengthen Step 1: surface full results table; explicit checklist fallback when CLI unavailable; hard gate before advancing - Strengthen Step 2: enumerate all 5 required inputs (repo URL, subscription ID, RBAC role, mode, branch) - Add DO NOT USE FOR to When to Use (re-deploy, secret rotation, drift-detection alone) - Update frontmatter description with USE FOR / DO NOT USE FOR trigger phrases - Add Safe-Execution Rule 8: idempotency on re-run (surface existing resources as ⊝ Already exists) Eval delta: 0.68 → 0.70 (+0.016); first-time repo setup +0.064
sendtoshailesh
left a comment
There was a problem hiding this comment.
Review: solid core change, two things to address before merge
Thanks for this — the core feature is correct and maps cleanly to #187.
✅ What's right
git-ape-verify.ymladds theCOPILOT_GITHUB_TOKENcheck as optional + non-blocking (::warning::, never incrementsMISSING) — plan/deploy/destroy/verify stay green without it. actionlint-clean.- SKILL.md Step 10 is well-designed: repo secret, gated on asking the user, PAT-with-Copilot-seat requirement, never echo the token.
- Evals green across all 4 models.
- The two unrelated lock-doc touches (
daily-repo-status,issue-triage) are pure gh-awv0.78.3 → v0.79.8version noise — pre-existing, correctly excluded, and disclosed. 👍
The committed docs are out of sync with the final SKILL.md. Running node scripts/generate-docs.js produces a deterministic diff on two tracked files:
website/docs/skills/git-ape-onboarding.mdwebsite/docs/skills/overview.md
Root cause: the docs were regenerated after the Step-10 drift edits (those are present), but not after the later SKILL.md edits — First-turn rule, Collect the required inputs, and Idempotency on re-run are absent from the published doc page. So the public skill page misrepresents the skill's description and agent flow. check-docs is advisory/non-blocking, so it won't gate this.
Fix: re-run the generator and commit only those two onboarding files (leave the unrelated lock docs out, as you already did).
Beyond the drift-detector work, the SKILL.md diff also silently includes:
- A full frontmatter
descriptionrewrite (USE FOR / DO NOT USE FOR clauses). - A new Safe-Execution rule #8 — idempotency on re-run.
- A near-total "Suggested Agent Flow" rewrite — first-turn gated handoff, 5 mandatory inputs, hard gates.
None of these are in #187 (scoped to the token) or the PR description. They look like reasonable, framework-aligned improvements and the evals pass — but they materially change agent behavior and are hard to spot under a PR titled "add drift-detector onboarding." Could you either document + justify them in the PR body, or split them into their own PR? (Note: these are the same edits missing from the regenerated docs in Finding 1 — which is what flagged that they were added after the last doc regen.)
Verdict: core feature is mergeable and correct; neither finding mechanically blocks (mergeable + advisory check), but both matter for doc accuracy and clean history. Happy to re-review once the docs are regenerated.
Re-runs node scripts/generate-docs.js so the published skill pages mirror the final SKILL.md. The earlier doc regen captured only the Step-10 drift edits; the later SKILL.md edits (frontmatter description rewrite, the DO NOT USE FOR block, Safe-Execution rule 8 idempotency, and the Suggested Agent Flow first-turn rule + required-inputs rewrite) were missing from the published pages. Scoped to the two onboarding files; the unrelated gh-aw v0.78.3 -> v0.79.8 lock-doc churn (daily-repo-status, issue-triage) is left out, matching the existing PR scope. Addresses review Finding 1 on #188. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@sendtoshailesh thanks for the thorough review — both findings are addressed in Finding 1 (stale doc mirror) — fixed. Re-ran
They now carry the previously-missing Finding 2 (undisclosed scope) — disclosed. Updated the PR body with an "Additional scope beyond #187" section documenting the three behavioral changes in commit Re-requesting review 🙏 |
sendtoshailesh
left a comment
There was a problem hiding this comment.
Re-review @ f2e8459 — both findings resolved ✅
Thanks for the fast turnaround and the clear disclosure.
Finding 1 — stale doc mirror → fixed. Independently verified: a fresh node scripts/generate-docs.js now leaves git-ape-onboarding.md and overview.md untouched (deterministic, zero drift). The previously-missing First-turn rule, Collect the required inputs, DO NOT USE FOR block, Safe-Execution rule #8, and the frontmatter description are all present in the published pages now. The two unrelated gh-aw v0.78.3 → v0.79.8 lock docs remain correctly excluded.
Finding 2 — scope → disclosed. The new "Additional scope beyond #187" section documents all three behavioral changes (commit aff5c52) with rationale and eval impact (0.68 → 0.70, +0.064 on first-time setup). Isolating them in their own commit + disclosing is a reasonable call given they're scoped to this skill and eval-gated — no need to split for me.
Also confirmed: all CI green on f2e8459 (incl. evals across all 4 models, scaffold parity, template↔docs sync); mergeable (the BLOCKED state is just the pending required approval).
One optional, non-blocking note for later (not for this PR): the SKILL.md additions push it further over the framework's ~500-line / ~5k-token guidance — the advisory flags ~3.9k tokens of headroom. Worth considering moving some L1 prose (e.g. the Step-10 token-requirements detail) into an references/ L2 file in a future pass. Purely housekeeping; the eval gain shows the current content is pulling its weight.
LGTM from my side. 👍
Closes #187
Summary
The
/git-ape-onboardingskill scaffolds the drift-detection workflow (git-ape-drift.lock.yml) but never provisions the credential it needs, so the workflow fails its first preflight gate:git-ape-driftis a gh-aw GitHub Agentic Workflow running on the Copilot engine; its compiled.lock.ymlhas a hard "Validate COPILOT_GITHUB_TOKEN secret" gate with no fallback toGITHUB_TOKEN. This PR makes onboarding optionally provision that token, as a skippable step.Changes
Source
SKILL.md— new optional Step 10: Onboard the drift detector workflow (provisionCOPILOT_GITHUB_TOKENas a repository secret, PAT with an active Copilot seat, gated on asking the user first). Renumbered compliance→11 / verify→12. Added to "What It Configures", Suggested Agent Flow, and Verification Commands.templates/workflows/git-ape-verify.yml— non-blockingCOPILOT_GITHUB_TOKENcheck: emits a::warning::if absent but never counts towardMISSING, so plan/deploy/destroy verification still passes when drift detection isn't enabled.Docs
website/docs/getting-started/onboarding.md— hand-written (Optional) Enable drift detection subsection with a smoke test.website/docs/skills/git-ape-onboarding.md,website/docs/skills/overview.md+website/docs/workflows/git-ape-verify.md— regenerated vianode scripts/generate-docs.jsto mirror the sources.Additional scope beyond #187 — onboarding agent-flow hardening (commit
aff5c52)Disclosing per review Finding 2. Separately from the token work, commit
aff5c52also hardens the behavioral quality of thegit-ape-onboardingskill. These are isolated in their own commit and gated by the eval suite:descriptionrewrite — adds explicitUSE FOR:/DO NOT USE FOR:trigger phrases so the agent router fires the skill on the right intents (first-time setup, multi-env onboarding) and not on adjacent ones (re-deploy, secret rotation, drift-detection alone).⊝ Already existsinstead of creating duplicate Entra apps / federated credentials / role assignments.Why keep them here: they are framework-aligned trigger-precision + safe-execution improvements to the same skill this PR already edits, and they move the onboarding eval 0.68 → 0.70 (+0.016) overall, +0.064 on first-time repo setup, green across all 4 models. Per reviewer's offer ("disclose or split"), keeping them in-PR and disclosing here.
Review follow-ups
node scripts/generate-docs.js; committed only the two onboarding pages (git-ape-onboarding.md,overview.md) that the finalSKILL.mdrequires. The unrelated gh-awv0.78.3 → v0.79.8lock-doc churn (daily-repo-status,issue-triage) is left out, keeping PR scope clean.Validation
actionlintpasses clean on the editedgit-ape-verify.ymltemplate.SKILL.mdpreserved.git-ape-onboardingevals green across all 4 models.Note on scope
The generator also re-touches two unrelated lock docs (
daily-repo-status-lock.md,issue-triage-agent-lock.md) — pre-existing staleness (lock files already at gh-aw setupv0.79.8, committed docs atv0.78.3). Those are reverted to keep this PR focused. TheDocs Checkworkflow is advisory/non-blocking and may flag them; happy to fold those regenerations in if preferred.