From 9f06eddb68c6111ced6a39fe01d187858d764534 Mon Sep 17 00:00:00 2001 From: andrewstellman Date: Sun, 10 May 2026 13:38:06 -0400 Subject: [PATCH] Update quality-playbook skill to v1.5.6 + add agent MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Rebuilds branch from upstream/staged (was previously merged from upstream/main, which brought in materialized plugin files that fail Check Plugin Structure on PRs targeting staged). Changes vs. staged: - Update skills/quality-playbook/ to v1.5.6 (31 bundled assets: SKILL.md + LICENSE.txt + 16 references/ + 9 phase_prompts/ + 3 agents/ + bin/citation_verifier.py + quality_gate.py). - Add agents/quality-playbook.agent.md (top-level orchestrator). name: quality-playbook (validator-compliant). - Update docs/README.skills.md quality-playbook row description + bundled-assets list to v1.5.6. - Fix 'unparseable' → 'unparsable' in quality_gate.py (5 instances; codespell preference, both spellings valid). Closes the v1.4.0 → v1.5.6 update in a single clean commit on top of upstream/staged. The preserved backup branch backup-bedbe84-pre-rebuild (SHA bedbe848fa3c0f0eda8e653c42b599a17dd2e354) holds the prior history for reference. --- agents/quality-playbook.agent.md | 161 + docs/README.agents.md | 1 + docs/README.skills.md | 2 +- skills/quality-playbook/LICENSE.txt | 211 +- skills/quality-playbook/SKILL.md | 2513 +++++++++++- .../agents/calibration_orchestrator.md | 222 ++ .../agents/quality-playbook-claude.agent.md | 117 + .../agents/quality-playbook.agent.md | 167 + .../quality-playbook/phase_prompts/README.md | 47 + .../phase_prompts/iteration.md | 1 + .../quality-playbook/phase_prompts/phase1.md | 229 ++ .../quality-playbook/phase_prompts/phase2.md | 27 + .../quality-playbook/phase_prompts/phase3.md | 154 + .../quality-playbook/phase_prompts/phase4.md | 54 + .../quality-playbook/phase_prompts/phase5.md | 119 + .../quality-playbook/phase_prompts/phase6.md | 23 + .../phase_prompts/single_pass.md | 1 + skills/quality-playbook/quality_gate.py | 3385 +++++++++++++++++ .../references/challenge_gate.md | 106 + .../references/code-only-mode.md | 59 + .../references/defensive_patterns.md | 20 + .../references/exploration_patterns.md | 339 ++ .../references/functional_tests.md | 458 +-- .../quality-playbook/references/iteration.md | 191 + .../references/orchestrator_protocol.md | 63 + .../references/requirements_pipeline.md | 427 +++ .../references/requirements_refinement.md | 113 + .../references/requirements_review.md | 158 + .../references/review_protocols.md | 283 +- .../references/run_state_schema.md | 366 ++ .../quality-playbook/references/spec_audit.md | 107 +- .../references/verification.md | 192 +- 32 files changed, 9706 insertions(+), 610 deletions(-) create mode 100644 agents/quality-playbook.agent.md create mode 100644 skills/quality-playbook/agents/calibration_orchestrator.md create mode 100644 skills/quality-playbook/agents/quality-playbook-claude.agent.md create mode 100644 skills/quality-playbook/agents/quality-playbook.agent.md create mode 100644 skills/quality-playbook/phase_prompts/README.md create mode 100644 skills/quality-playbook/phase_prompts/iteration.md create mode 100644 skills/quality-playbook/phase_prompts/phase1.md create mode 100644 skills/quality-playbook/phase_prompts/phase2.md create mode 100644 skills/quality-playbook/phase_prompts/phase3.md create mode 100644 skills/quality-playbook/phase_prompts/phase4.md create mode 100644 skills/quality-playbook/phase_prompts/phase5.md create mode 100644 skills/quality-playbook/phase_prompts/phase6.md create mode 100644 skills/quality-playbook/phase_prompts/single_pass.md create mode 100755 skills/quality-playbook/quality_gate.py create mode 100644 skills/quality-playbook/references/challenge_gate.md create mode 100644 skills/quality-playbook/references/code-only-mode.md create mode 100644 skills/quality-playbook/references/exploration_patterns.md create mode 100644 skills/quality-playbook/references/iteration.md create mode 100644 skills/quality-playbook/references/orchestrator_protocol.md create mode 100644 skills/quality-playbook/references/requirements_pipeline.md create mode 100644 skills/quality-playbook/references/requirements_refinement.md create mode 100644 skills/quality-playbook/references/requirements_review.md create mode 100644 skills/quality-playbook/references/run_state_schema.md diff --git a/agents/quality-playbook.agent.md b/agents/quality-playbook.agent.md new file mode 100644 index 000000000..6119c83ae --- /dev/null +++ b/agents/quality-playbook.agent.md @@ -0,0 +1,161 @@ +--- +name: quality-playbook +description: "Run a complete quality engineering audit on any codebase. Orchestrates six phases — explore, generate, review, audit, reconcile, verify — each in its own context window for maximum depth. Then runs iteration strategies to find even more bugs. Finds the 35% of real defects that structural code review alone cannot catch." +tools: + - search/codebase + - web/fetch +--- + +# Quality Playbook — Orchestrator Agent + +You are a quality engineering orchestrator. Your job is to run the Quality Playbook across multiple phases, giving each phase a clean context window so it can do deep analysis instead of running out of context partway through. + +## Setup: find the skill + +Check that the quality playbook skill is installed. Look for SKILL.md in these locations, in order: + +1. `.github/skills/quality-playbook/SKILL.md` (Copilot) +2. `.cursor/skills/quality-playbook/SKILL.md` (Cursor) +3. `.claude/skills/quality-playbook/SKILL.md` (Claude Code) +4. `.continue/skills/quality-playbook/SKILL.md` (Continue) + +Also check for a `references/` directory alongside SKILL.md (16 reference files in v1.5.6 — exploration_patterns.md, iteration.md, review_protocols.md, spec_audit.md, verification.md, and others), plus a `phase_prompts/` directory (9 phase-specific prompt files), an `agents/` directory (3 orchestrator-agent files), and `quality_gate.py` + `bin/citation_verifier.py`. + +**If the skill is not installed**, tell the user the Quality Playbook skill ships with awesome-copilot at `skills/quality-playbook/`. To install it into the current project, copy from your awesome-copilot clone: + +> ```bash +> # If you don't already have awesome-copilot cloned: +> git clone https://github.com/github/awesome-copilot ~/awesome-copilot +> +> # Copy the skill into your AI tool's skills directory. +> # Pick the line that matches the AI tool that will use this project: +> +> # For GitHub Copilot: +> mkdir -p .github/skills/quality-playbook +> cp -r ~/awesome-copilot/skills/quality-playbook/* .github/skills/quality-playbook/ +> +> # For Cursor: +> mkdir -p .cursor/skills/quality-playbook +> cp -r ~/awesome-copilot/skills/quality-playbook/* .cursor/skills/quality-playbook/ +> +> # For Claude Code: +> mkdir -p .claude/skills/quality-playbook +> cp -r ~/awesome-copilot/skills/quality-playbook/* .claude/skills/quality-playbook/ +> +> # For Continue: +> mkdir -p .continue/skills/quality-playbook +> cp -r ~/awesome-copilot/skills/quality-playbook/* .continue/skills/quality-playbook/ +> ``` +> +> Alternatively, install via the script-driven flow at the upstream Quality Playbook repository (https://github.com/andrewstellman/quality-playbook) for the full v1.5.6 install UX (auto-detect, marker-directory creation, smoke checks). + +Then stop and wait for the user to install it. + +**If the skill is installed**, read SKILL.md and every file in the `references/` and `phase_prompts/` directories. Then follow the instructions below. + +## Pre-flight checks + +Before starting Phase 1, do two things: + +1. **Check for documentation.** Look for a `docs/`, `docs_gathered/`, or `documentation/` directory. If none exists, give a prominent warning: + + > **Documentation improves results significantly.** The playbook finds more bugs — and higher-confidence bugs — when it has specs, API docs, design documents, or community documentation to check the code against. Consider adding documentation to `docs_gathered/` before running. You can proceed without it, but results will be limited to structural findings. + +2. **Ask about scope.** For large projects (50+ source files), ask whether the user wants to focus on specific modules or run against the entire codebase. + +## How to run + +The playbook has two modes. Ask the user which they want, or infer from their prompt: + +### Mode 1: Phase by phase (recommended for first run) + +Run Phase 1 in the current session. When it completes, show the end-of-phase summary and tell the user to say "keep going" or "run phase N" to continue. Each subsequent phase should run in a **new session or context window** so it gets maximum depth. + +This is the default if the user says "run the quality playbook." + +### Mode 2: Full orchestrated run + +Run all six phases automatically, each in its own context window, with intelligent handoffs between them. Use this when the user says "run the full playbook" or "run all phases." + +**Orchestration protocol:** + +For each phase (1 through 6): + +1. **Start a new context.** Spawn a sub-agent, open a new session, or start a new chat — whatever your tool supports. The goal is a clean context window. +2. **Pass the phase prompt.** Tell the new context: + - Read SKILL.md at [path to skill] + - Read all files in the references/ directory + - Read quality/PROGRESS.md (if it exists) for context from prior phases + - Execute Phase N +3. **Wait for completion.** The phase is done when it writes its checkpoint to quality/PROGRESS.md. +4. **Check the result.** Read quality/PROGRESS.md after the phase completes. Verify the phase wrote its checkpoint. If it didn't, the phase failed — report to the user and ask whether to retry. +5. **Report progress.** Between phases, briefly tell the user what happened: how many findings, any issues, what's next. +6. **Continue to next phase.** Repeat from step 1. + +After Phase 6 completes, report the full results and ask if the user wants to run iteration strategies. + +**Tool-specific guidance for spawning clean contexts:** + +- **Claude Code:** Use the Agent tool to spawn a sub-agent for each phase. Each sub-agent gets its own context window automatically. +- **Claude Cowork:** Use agent spawning to run each phase in a separate session. +- **GitHub Copilot:** Start a new chat for each phase. Include the phase prompt as your first message. +- **Cursor:** Open a new Composer for each phase with the phase prompt. +- **Windsurf / other tools:** Start a new conversation or chat for each phase. + +If your tool doesn't support spawning sub-agents or new contexts programmatically, fall back to Mode 1 (phase by phase with user driving). + +### Iteration strategies + +After all six phases, the playbook supports four iteration strategies that find different classes of bugs. Each strategy re-explores the codebase with a different approach, then re-runs Phases 2-6 on the merged findings. Read `references/iteration.md` for full details. + +The four strategies, in recommended order: + +1. **gap** — Explore areas the baseline missed +2. **unfiltered** — Fresh-eyes re-review without structural constraints +3. **parity** — Compare parallel code paths (setup vs. teardown, encode vs. decode) +4. **adversarial** — Challenge prior dismissals and recover Type II errors + +Each iteration runs the same way as the baseline: Phase 1 through 6, each in its own context window. Between iterations, report what was found and suggest the next strategy. + +Iterations typically add 40-60% more confirmed bugs on top of the baseline. + +## The six phases + +1. **Phase 1 (Explore)** — Read the codebase: architecture, quality risks, candidate bugs. Output: `quality/EXPLORATION.md` +2. **Phase 2 (Generate)** — Produce quality artifacts: requirements, constitution, functional tests, review protocols, TDD protocol, AGENTS.md. Output: nine files in `quality/` +3. **Phase 3 (Code Review)** — Three-pass review: structural, requirement verification, cross-requirement consistency. Regression tests for every confirmed bug. Output: `quality/code_reviews/`, patches +4. **Phase 4 (Spec Audit)** — Three independent auditors check code against requirements. Triage with verification probes. Output: `quality/spec_audits/`, additional regression tests +5. **Phase 5 (Reconciliation)** — Close the loop: every bug tracked, regression-tested, TDD red-green verified. Output: `quality/BUGS.md`, TDD logs, completeness report +6. **Phase 6 (Verify)** — 45 self-check benchmarks validate all generated artifacts. Output: final PROGRESS.md checkpoint + +Each phase has entry gates (prerequisites from prior phases) and exit gates (what must be true before the phase is considered complete). SKILL.md defines these gates precisely — follow them exactly. + +## Responding to user questions + +- **"help" / "how does this work"** — Explain the six phases and two run modes. Mention that documentation improves results. Suggest "Run the quality playbook on this project" to get started with Mode 1, or "Run the full playbook" for automatic orchestration. +- **"what happened" / "what's going on" / "status"** — Read `quality/PROGRESS.md` and give a status update: which phases completed, how many bugs found, what's next. +- **"keep going" / "continue" / "next"** — Run the next phase in sequence. +- **"run phase N"** — Run the specified phase (check prerequisites first). +- **"run iterations"** — Start the iteration cycle. Read `references/iteration.md` and run gap strategy first. +- **"run [strategy] iteration"** — Run a specific iteration strategy. + +## Error recovery + +If a phase fails (crashes, runs out of context, doesn't write its checkpoint): + +1. Read quality/PROGRESS.md to see what was completed +2. Report the failure to the user with specifics +3. Suggest retrying the failed phase in a new context +4. Do not skip phases — each phase depends on the prior phase's output + +If the tool runs out of context mid-phase, the phase's incremental writes to disk are preserved. A retry in a new context can pick up where it left off by reading PROGRESS.md and the quality/ directory. + +## Example prompts + +- "Run the quality playbook on this project" — Mode 1, starts Phase 1 +- "Run the full playbook" — Mode 2, orchestrates all six phases +- "Run the full playbook with all iterations" — Mode 2 + all four iteration strategies +- "Keep going" — Continue to next phase +- "What happened?" — Status check +- "Run the adversarial iteration" — Specific iteration strategy +- "Help" — Explain how it works diff --git a/docs/README.agents.md b/docs/README.agents.md index b511fb59a..8375f73e8 100644 --- a/docs/README.agents.md +++ b/docs/README.agents.md @@ -177,6 +177,7 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-agents) for guidelines on how to | [Python MCP Server Expert](../agents/python-mcp-expert.agent.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fpython-mcp-expert.agent.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fpython-mcp-expert.agent.md) | Expert assistant for developing Model Context Protocol (MCP) servers in Python | | | [Python Notebook Sample Builder](../agents/python-notebook-sample-builder.agent.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fpython-notebook-sample-builder.agent.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fpython-notebook-sample-builder.agent.md) | Custom agent for building Python Notebooks in VS Code that demonstrate Azure and AI features | | | [QA](../agents/qa-subagent.agent.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fqa-subagent.agent.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fqa-subagent.agent.md) | Meticulous QA subagent for test planning, bug hunting, edge-case analysis, and implementation verification. | | +| [Quality Playbook](../agents/quality-playbook.agent.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fquality-playbook.agent.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fquality-playbook.agent.md) | Run a complete quality engineering audit on any codebase. Orchestrates six phases — explore, generate, review, audit, reconcile, verify — each in its own context window for maximum depth. Then runs iteration strategies to find even more bugs. Finds the 35% of real defects that structural code review alone cannot catch. | | | [React18 Auditor](../agents/react18-auditor.agent.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Freact18-auditor.agent.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Freact18-auditor.agent.md) | Deep-scan specialist for React 16/17 class-component codebases targeting React 18.3.1. Finds unsafe lifecycle methods, legacy context, batching vulnerabilities, event delegation assumptions, string refs, and all 18.3.1 deprecation surface. Reads everything, touches nothing. Saves .github/react18-audit.md. | | | [React18 Batching Fixer](../agents/react18-batching-fixer.agent.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Freact18-batching-fixer.agent.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Freact18-batching-fixer.agent.md) | Automatic batching regression specialist. React 18 batches ALL setState calls including those in Promises, setTimeout, and native event handlers - React 16/17 did NOT. Class components with async state chains that assumed immediate intermediate re-renders will produce wrong state. This agent finds every vulnerable pattern and fixes with flushSync where semantically required. | | | [React18 Class Surgeon](../agents/react18-class-surgeon.agent.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Freact18-class-surgeon.agent.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Freact18-class-surgeon.agent.md) | Class component migration specialist for React 16/17 → 18.3.1. Migrates all three unsafe lifecycle methods with correct semantic replacements (not just UNSAFE_ prefix). Migrates legacy context to createContext, string refs to React.createRef(), findDOMNode to direct refs, and ReactDOM.render to createRoot. Uses memory to checkpoint per-file progress. | | diff --git a/docs/README.skills.md b/docs/README.skills.md index 6afcebe14..00f1bb0ff 100644 --- a/docs/README.skills.md +++ b/docs/README.skills.md @@ -287,7 +287,7 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-skills) for guidelines on how to | [qdrant-scaling](../skills/qdrant-scaling/SKILL.md)
`gh skills install github/awesome-copilot qdrant-scaling` | Guides Qdrant scaling decisions. Use when someone asks 'how many nodes do I need', 'data doesn't fit on one node', 'need more throughput', 'cluster is slow', 'too many tenants', 'vertical or horizontal', 'how to shard', or 'need to add capacity'. | `minimize-latency`
`scaling-data-volume`
`scaling-qps`
`scaling-query-volume` | | [qdrant-search-quality](../skills/qdrant-search-quality/SKILL.md)
`gh skills install github/awesome-copilot qdrant-search-quality` | Diagnoses and improves Qdrant search relevance. Use when someone reports 'search results are bad', 'wrong results', 'low precision', 'low recall', 'irrelevant matches', 'missing expected results', or asks 'how to improve search quality?', 'which embedding model?', 'should I use hybrid search?', 'should I use reranking?'. Also use when search quality degrades after quantization, model change, or data growth. | `diagnosis`
`search-strategies` | | [qdrant-version-upgrade](../skills/qdrant-version-upgrade/SKILL.md)
`gh skills install github/awesome-copilot qdrant-version-upgrade` | Guidance on how to upgrade your Qdrant version without interrupting the availability of your application and ensuring data integrity. | None | -| [quality-playbook](../skills/quality-playbook/SKILL.md)
`gh skills install github/awesome-copilot quality-playbook` | Explore any codebase from scratch and generate six quality artifacts: a quality constitution (QUALITY.md), spec-traced functional tests, a code review protocol with regression test generation, an integration testing protocol, a multi-model spec audit (Council of Three), and an AI bootstrap file (AGENTS.md). Includes state machine completeness analysis and missing safeguard detection. Works with any language (Python, Java, Scala, TypeScript, Go, Rust, etc.). Use this skill whenever the user asks to set up a quality playbook, generate functional tests from specifications, create a quality constitution, build testing protocols, audit code against specs, or establish a repeatable quality system for a project. Also trigger when the user mentions 'quality playbook', 'spec audit', 'Council of Three', 'fitness-to-purpose', 'coverage theater', or wants to go beyond basic test generation to build a full quality system grounded in their actual codebase. | `LICENSE.txt`
`references/constitution.md`
`references/defensive_patterns.md`
`references/functional_tests.md`
`references/review_protocols.md`
`references/schema_mapping.md`
`references/spec_audit.md`
`references/verification.md` | +| [quality-playbook](../skills/quality-playbook/SKILL.md)
`gh skills install github/awesome-copilot quality-playbook` | Run a complete quality engineering audit on any codebase. Derives behavioral requirements from the code, generates spec-traced functional tests, runs a three-pass code review with regression tests, executes a multi-model spec audit (Council of Three), and produces a consolidated bug report with TDD-verified patches. Finds the 35% of real defects that structural code review alone cannot catch. Works with any language. Trigger on 'quality playbook', 'spec audit', 'Council of Three', 'fitness-to-purpose', or 'coverage theater'. | `LICENSE.txt`
`agents`
`phase_prompts`
`quality_gate.py`
`references/challenge_gate.md`
`references/code-only-mode.md`
`references/constitution.md`
`references/defensive_patterns.md`
`references/exploration_patterns.md`
`references/functional_tests.md`
`references/iteration.md`
`references/orchestrator_protocol.md`
`references/requirements_pipeline.md`
`references/requirements_refinement.md`
`references/requirements_review.md`
`references/review_protocols.md`
`references/run_state_schema.md`
`references/schema_mapping.md`
`references/spec_audit.md`
`references/verification.md` | | [quasi-coder](../skills/quasi-coder/SKILL.md)
`gh skills install github/awesome-copilot quasi-coder` | Expert 10x engineer skill for interpreting and implementing code from shorthand, quasi-code, and natural language descriptions. Use when collaborators provide incomplete code snippets, pseudo-code, or descriptions with potential typos or incorrect terminology. Excels at translating non-technical or semi-technical descriptions into production-quality code. | None | | [react-audit-grep-patterns](../skills/react-audit-grep-patterns/SKILL.md)
`gh skills install github/awesome-copilot react-audit-grep-patterns` | Provides the complete, verified grep scan command library for auditing React codebases before a React 18.3.1 or React 19 upgrade. Use this skill whenever running a migration audit - for both the react18-auditor and react19-auditor agents. Contains every grep pattern needed to find deprecated APIs, removed APIs, unsafe lifecycle methods, batching vulnerabilities, test file issues, dependency conflicts, and React 19 specific removals. Always use this skill when writing audit scan commands - do not rely on memory for grep syntax, especially for the multi-line async setState patterns which require context flags. | `references/dep-scans.md`
`references/react18-scans.md`
`references/react19-scans.md`
`references/test-scans.md` | | [react18-batching-patterns](../skills/react18-batching-patterns/SKILL.md)
`gh skills install github/awesome-copilot react18-batching-patterns` | Provides exact patterns for diagnosing and fixing automatic batching regressions in React 18 class components. Use this skill whenever a class component has multiple setState calls in an async method, inside setTimeout, inside a Promise .then() or .catch(), or in a native event handler. Use it before writing any flushSync call - the decision tree here prevents unnecessary flushSync overuse. Also use this skill when fixing test failures caused by intermediate state assertions that break after React 18 upgrade. | `references/batching-categories.md`
`references/flushSync-guide.md` | diff --git a/skills/quality-playbook/LICENSE.txt b/skills/quality-playbook/LICENSE.txt index e0d4f9147..ce64d27c2 100644 --- a/skills/quality-playbook/LICENSE.txt +++ b/skills/quality-playbook/LICENSE.txt @@ -1,21 +1,190 @@ -MIT License - -Copyright (c) 2025 Andrew Stellman - -Permission is hereby granted, free of charge, to any person obtaining a copy -of this software and associated documentation files (the "Software"), to deal -in the Software without restriction, including without limitation the rights -to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -copies of the Software, and to permit persons to whom the Software is -furnished to do so, subject to the following conditions: - -The above copyright notice and this permission notice shall be included in all -copies or substantial portions of the Software. - -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -SOFTWARE. + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to the Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by the Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding any notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + Copyright 2025 Andrew Stellman + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. diff --git a/skills/quality-playbook/SKILL.md b/skills/quality-playbook/SKILL.md index b5242ec42..3f0901f36 100644 --- a/skills/quality-playbook/SKILL.md +++ b/skills/quality-playbook/SKILL.md @@ -1,23 +1,205 @@ --- name: quality-playbook -description: "Explore any codebase from scratch and generate six quality artifacts: a quality constitution (QUALITY.md), spec-traced functional tests, a code review protocol with regression test generation, an integration testing protocol, a multi-model spec audit (Council of Three), and an AI bootstrap file (AGENTS.md). Includes state machine completeness analysis and missing safeguard detection. Works with any language (Python, Java, Scala, TypeScript, Go, Rust, etc.). Use this skill whenever the user asks to set up a quality playbook, generate functional tests from specifications, create a quality constitution, build testing protocols, audit code against specs, or establish a repeatable quality system for a project. Also trigger when the user mentions 'quality playbook', 'spec audit', 'Council of Three', 'fitness-to-purpose', 'coverage theater', or wants to go beyond basic test generation to build a full quality system grounded in their actual codebase." +description: "Run a complete quality engineering audit on any codebase. Derives behavioral requirements from the code, generates spec-traced functional tests, runs a three-pass code review with regression tests, executes a multi-model spec audit (Council of Three), and produces a consolidated bug report with TDD-verified patches. Finds the 35% of real defects that structural code review alone cannot catch. Works with any language. Trigger on 'quality playbook', 'spec audit', 'Council of Three', 'fitness-to-purpose', or 'coverage theater'." license: Complete terms in LICENSE.txt metadata: - version: 1.2.0 + version: 1.5.6 + # NOTE: Inline occurrences of the skill version exist throughout this file (frontmatter, + # banner, version stamp template, sidecar JSON examples, run metadata, recheck template). + # When bumping the version, update ALL occurrences — search for the old version string + # globally. One historical reference to v1.4.6 edgequake benchmarking is intentionally + # preserved in the challenge-gate section and must NOT be bumped. author: Andrew Stellman - github: https://github.com/andrewstellman/ + github: https://github.com/andrewstellman/quality-playbook --- # Quality Playbook Generator -**When this skill starts, display this banner before doing anything else:** +## Plan Overview — read this first, then explain it to the user + +Before reading any other section of this skill, understand the plan and its dependencies. Each phase produces artifacts that the next phase depends on. Skipping or rushing a phase means every downstream phase works from incomplete information. + +**Phase 0 (Prior Run Analysis):** If previous quality runs exist, load their findings as seed data. This is automatic and only applies to re-runs. + +**Phase 1 (Explore):** Run the v1.5.3 documentation intake first (`python -m bin.reference_docs_ingest ` to walk `reference_docs/` — `cite/` files produce `quality/formal_docs_manifest.json` records; top-level files are loaded as Tier 4 context via `reference_docs_ingest.load_tier4_context()`). Then explore the codebase in three stages: open exploration driven by domain knowledge, domain-knowledge risk analysis, and selected structured exploration patterns. Write all findings to `quality/EXPLORATION.md`. This file is the foundation — Phase 2 reads it as its primary input. + +**Phase 2 (Generate):** Read EXPLORATION.md and produce the quality artifacts: requirements, constitution, functional tests, code review protocol, integration tests, spec audit protocol, TDD protocol. (`AGENTS.md` at the target's repo root is generated by the orchestrator AFTER Phase 6, not by you in Phase 2 — see "File 6" below for the contract.) + +**Phase 3 (Code Review):** Run the three-pass code review against HEAD. Write regression tests for every confirmed bug. Generate patches. + +**Phase 4 (Spec Audit):** Three independent AI auditors review the code against requirements. Triage with verification probes. After triage, the same Council runs the v1.5.3 Layer-2 semantic citation check — one prompt per reviewer, structured per-REQ verdicts for every Tier 1/2 citation, output to `quality/citation_semantic_check.json`. Write regression tests for net-new findings. + +**Phase 5 (Reconciliation):** Close the loop — every bug from code review and spec audit is tracked, regression-tested or explicitly exempted. Run TDD red-green cycle. Finalize the completeness report. + +**Phase 6 (Verify):** Run self-check benchmarks against all generated artifacts. Check for internal consistency, version stamp correctness, and convergence. + +**Phase 7 (Present, Explore, Improve):** Present results to the user with a scannable summary table, offer drill-down on any artifact, and provide a menu of improvement paths (iteration strategies, requirement refinement, integration test tuning). This is the interactive phase where the user takes ownership of the quality system. + +Every bug found traces back to a requirement, and every requirement traces back to an exploration finding. + +**The critical dependency chain:** Exploration findings → EXPLORATION.md → Requirements → Code review + Spec audit → Bug discovery. A shallow exploration produces abstract requirements. Abstract requirements miss bugs. The exploration phase is where bugs are won or lost. + +**MANDATORY FIRST ACTION:** After reading and understanding the plan above, print the following message to the user, then explain the plan in your own words — what you'll do, what each phase produces, and why the exploration phase matters most. Emphasize that exploration starts with open-ended domain-driven investigation, followed by domain-knowledge risk analysis that reasons about what goes wrong in systems like this, then supplemented by selected structured patterns. Do not copy the plan verbatim; paraphrase it to demonstrate understanding. + +> Quality Playbook v1.5.6 — by Andrew Stellman +> https://github.com/andrewstellman/quality-playbook + +Generate a complete quality system tailored to a specific codebase. Unlike test stub generators that work mechanically from source code, this skill explores the project first — understanding its domain, architecture, specifications, and failure history — then produces a quality playbook grounded in what it finds. + +## How to run this — v1.5.4 self-encoded invocation contract + +If the operator hands you this skill (or points you at any QPB-installed target) and says **"Run the Quality Playbook"** — possibly with a hint like "this is a bootstrap run" or "run on itself" or "self-audit" — this section tells you exactly what to do. The operator should not need to provide additional instructions; the canonical invocation, the defaults, the guardrails, and the output contract all live here. + +### Pick your execution mode + +QPB ships in two execution shapes. Pick the one that matches your runtime — the wrong choice produces the codex-on-codex indirection pathology surfaced by the 2026-04-30 bootstrap test. + +| Mode | When this is you | What you do | +|------|------------------|-------------| +| **A. Skill-direct (UI-context)** | You are a coding agent (Claude Code, Cursor, Copilot, Codex desktop, etc.) handed this skill in your own chat. Your runtime IS the reasoning loop — you read files, you write files, you decide. | Walk through Phase 1 → Phase 6 yourself using the externalized phase prompts in `phase_prompts/`. Write artifacts into the target's `quality/` directory directly. No subprocess, no runner. | +| **B. Runner-driven (CLI-automation)** | The operator is invoking `python3 -m bin.run_playbook` deliberately — to batch across multiple targets, drive a headless CI run, or fan out per-phase work to a different model than the one reading this prose. | The orchestrator spawns a CLI agent (`claude`, `copilot`, `codex`, or `cursor`) per phase. You (or whoever is reading this) are the operator-side control loop, not the per-phase reasoner. | + +**Both modes use the same phase prompt content** — the `phase_prompts/*.md` files at the repo root are the single source of truth, loaded by `bin/run_playbook.py::_load_phase_prompt` and read directly by Mode A walkthroughs. The only thing the two modes differ on is WHO drives — you (Mode A) or the orchestrator subprocess-spawning a CLI agent (Mode B). + +**When in doubt, default to Mode A.** If the operator wanted runner-driven invocation they would have run the runner themselves; if they pasted "Run the Quality Playbook" into your chat, they want you to drive. The Mode B section below tells you what to do *if* the operator explicitly invokes the runner. + +### Mode A — skill-direct walkthrough (UI-context) + +The operator's prompt is just **"Run the Quality Playbook"** (or "run on itself", "self-audit", etc.). You drive every phase inline. + +For each phase 1..6, in order: + +1. **Load the phase prompt.** Read `phase_prompts/phaseN.md` (resolve via the same install-location fallback list documented for `references/` below). For `phase1.md`, substitute `{seed_instruction}` (the prelude that says "skip Phase 0/0b" — empty string when seeds are allowed) and `{role_taxonomy}` (the taxonomy block rendered from the role taxonomy below). For `phase2.md` through `phase6.md`, the file is pure-literal — read it verbatim. +2. **Execute the phase per the prompt.** Read the inputs the prompt names, do the analysis, write outputs into the target's `quality/` directory. +3. **STOP at the end-of-phase boundary.** Every phase prompt ends with an "IMPORTANT: Do NOT proceed to Phase N+1" instruction. Honor it. The operator advances to the next phase by saying so. + +You are responsible — without the orchestrator's structural backstop — for the same source-unchanged invariant the runner enforces: **do NOT modify any file outside the target's `quality/` directory**. In Mode B the gate would catch this; in Mode A you are the gate. The 2026-04-30 bootstrap test specifically failed on a Phase 2 LLM modifying the target's root `AGENTS.md` — the same failure mode applies in Mode A. + +For the bootstrap-run (self-audit) variant of Mode A, see "Bootstrap mode" below — the only delta is that the target IS the QPB repo, so cite the same `phase_prompts/` files you read from. + +#### Mode A scope — what's covered, what's Mode-B-only + +Council 2026-04-30 P1-3: the per-phase walkthrough above scopes Mode A to **phases 1..6**. The following surfaces are deliberately Mode-B-only — if the operator wants them, point them at the runner instead of trying to drive them yourself: + +- **Phase 0 / Phase 0b (seed injection from prior runs).** The orchestrator handles seed discovery, prior-run scanning, and seed-prompt injection. In Mode A, treat every run as `--no-seeds` (skip Phase 0/0b entirely, start at Phase 1). If the operator explicitly asks for seed-driven exploration, hand off to Mode B (`python3 -m bin.run_playbook --with-seeds `). +- **Phase 7 (interactive Present / Explore / Improve).** This phase is a back-and-forth dialogue with the operator about the generated artifacts; it has no pre-baked prompt in `phase_prompts/`. After Phase 6 in Mode A, present the artifact summary table inline (see "What this run produces" below for the file list) and let the operator drive what to explore next conversationally — that IS Phase 7. There is no orchestrator subprocess to spawn. +- **Iteration strategies (gap / unfiltered / parity / adversarial).** Iterations re-enter the playbook with a strategy-specific addendum. In Mode A, after Phase 6 completes cleanly, hand off to Mode B for iterations: `python3 -m bin.run_playbook --next-iteration --strategy `. The iteration prompts (`phase_prompts/iteration.md`) ARE single-source-of-truth, but the iteration-orchestration loop (rotating through gap → unfiltered → parity → adversarial) is the runner's job. A Mode A operator who wants iterations after Phase 6 should be told: "Phase 6 is done; run `python3 -m bin.run_playbook --full-run ` to get all four iteration strategies, or pick one strategy explicitly with `--next-iteration --strategy gap`." + +If the operator asks for one of these surfaces in Mode A and the request is ambiguous (e.g., "also do the iterations"), surface the mode-handoff explicitly rather than improvising — improvisation is how the prompt content drifts away from the runner's canonical loop. + +### Mode B — runner-driven invocation (CLI-automation) + +The operator runs `python3 -m bin.run_playbook` themselves (typically because they want batching, headless CI, or to route per-phase work to a different model). The orchestrator at `bin/run_playbook.py` spawns a CLI agent per phase, feeds it the externalized phase prompt, and aggregates the result. + +#### Canonical invocation + +The orchestrator is the entry point. Always invoke it as a Python module: ``` -Quality Playbook v1.2.0 — by Andrew Stellman -https://github.com/andrewstellman/ +python3 -m bin.run_playbook ``` -Generate a complete quality system tailored to a specific codebase. Unlike test stub generators that work mechanically from source code, this skill explores the project first — understanding its domain, architecture, specifications, and failure history — then produces a quality playbook grounded in what it finds. +**Never invoke it script-style** (`python bin/run_playbook.py ...`). The runtime guard exits with `EX_USAGE=64` because relative imports require packaged execution. + +`` is the path to the project to audit. For a bootstrap run (target IS the QPB repo), pass `.` from the repo root. For any other target, pass the path to that target's repo root. + +#### Default behavior (no flags) + +Bare invocation triggers a **full run**: all 6 phases (Explore → Generate → Code Review → Spec Audit → Reconciliation → Verify) followed by all 4 iteration strategies (gap → unfiltered → parity → adversarial), executed synchronously in the same session. Any prior `quality/` directory is auto-archived to `quality/previous_runs//` before the new run starts. + +This is the canonical operator path. Don't ask permission to add flags; the defaults are the answer. + +When the bare invocation fires, the orchestrator emits a one-line stderr banner naming the cost change vs. v1.5.3 (~5–10× the legacy "Phase 1 only" default). That banner is informational; let it scroll. + +#### Common overrides + +Use only when the operator asks for something specific: + +| Need | Flag | Effect | +|------|------|--------| +| Run a single phase | `--phase N` (where N ∈ 1..6) | Recovers the v1.5.3 "explore only" pattern with `--phase 1`. | +| Skip iteration strategies | omit `--iterations` and pass `--phase 1,2,3,4,5,6` | Phases run; iterations don't. | +| Specific iteration | `--strategy --next-iteration` | Iterates on an existing `quality/` run with a chosen strategy. | +| Multi-target | pass several positional targets | Each runs independently. | +| Per-phase CLI agent | `--claude` / `--copilot` / `--codex` / `--cursor` | Picks which CLI runner the orchestrator spawns. Default is `--copilot`. v1.5.4 added the `--cursor` runner (cursor-cli 3.1+). | + +#### Recovering from a partial / aborted runner-driven run + +Council 2026-04-30 P1-4: the operator-hygiene guidance for cleaning up after an aborted run lives in the **Bootstrap mode** section below ("Bootstrap-run operator hygiene") — the recovery is identical in Mode B: `git restore quality/` to discard the partial Phase 1/2 output, then re-invoke. **Do NOT** edit files outside `quality/` to "tidy up" — the source-unchanged invariant trips on the very next run. See the Bootstrap-mode hygiene paragraph for the full mechanic; it applies regardless of whether the abort happened during a self-audit run or against an external target. + +### Bootstrap mode (running QPB on itself) + +When the operator says "this is a bootstrap run" or "we're running QPB on itself" or "self-audit": + +1. Confirm the working directory is the QPB repo root (or `cd` there). +2. Invoke `python3 -m bin.run_playbook .` — same canonical form, target is `.`. +3. The orchestrator handles archival of the existing `quality/` tree to `quality/previous_runs//` automatically; you don't need to clean anything manually. + +The run proceeds the same way as any other target. The only difference is that the audit subject IS the playbook itself, so the produced artifacts describe QPB's own quality system. + +**Bootstrap-run operator hygiene — recovering from a partial / aborted run.** If a prior bootstrap run aborted mid-flight (e.g., the source-unchanged invariant tripped, a phase prompt errored, the operator hit Ctrl-C), the working tree may contain a half-written `quality/` directory plus a `quality/previous_runs//.partial` sentinel marking the abandoned archive. Before re-invoking, run `git restore quality/` (and, if you want a clean slate, `git clean -fd quality/`) to drop any uncommitted Phase 1/2 output from the aborted run. The orchestrator will re-archive the now-pristine `quality/` tree and start clean. **Do NOT** edit files outside `quality/` to "tidy up" — anything outside `quality/` is QPB source; touching it for cleanup will trip the source-unchanged invariant on the very next run. The 2026-04-30 bootstrap test surfaced this exact recovery question: the operator had a half-written `quality/` from an aborted Phase 2 and re-running without restoring left stale Phase 1 artifacts that confused the next run's archival. + +### v1.5.4 mechanics (pointer-style, not duplicating the design doc) + +What's new vs. v1.5.3, in pointer form (the canonical architecture lives in `docs/design/QPB_v1.5.4_Design.md` Part 1): + +- **Phase 1 produces `quality/exploration_role_map.json`** — per-file role tagging done AI-driven during exploration. Each in-scope file gets a role from the taxonomy (`skill-prose`, `skill-reference`, `skill-tool`, `code`, `test`, `docs`, `config`, `fixture`, `formal-spec`, `playbook-output`). The role map drives every downstream pipeline-activation decision. +- **`INDEX.md` uses `schema_version: "2.0"`** with a `target_role_breakdown` field carrying the per-role counts and percentages. The v1.5.3 `target_project_type` enum is retired (legacy archives stay readable). +- **Pipelines activate from the role map, not from a project-type label.** The four-pass skill-derivation pipeline runs over files tagged `skill-prose` / `skill-reference`. The code-review pipeline runs over files tagged `code`. The prose-to-code divergence check runs over files tagged `skill-tool`. When the role map shows zero of a role, that pipeline no-ops cleanly. There is no Code/Skill/Hybrid trichotomy — both pipelines run when both surfaces are present (the "always-Hybrid downstream" model). +- **Archive directory is `quality/previous_runs/`** (was `quality/runs/` in v1.5.3); legacy archives at the old path remain readable. +- **End-of-Phase-6 reorganization** moves intermediate artifacts under `quality/workspace/` so the top-level `quality/` directory is dominated by canonical deliverables (REQUIREMENTS.md, BUGS.md, etc.). The gate's path resolver reads from both layouts. + +You don't need to re-derive any of this in your prompt-side reasoning; the orchestrator's prompts already encode it. If you encounter a phase prompt that conflicts with the architecture summarized here, follow the phase prompt — it's the canonical source for the per-phase contract. + +### Guardrails (machine-checkable; treat as hard constraints) + +These are not suggestions; the orchestrator enforces them and a violation aborts the run: + +1. **Synchronous execution — no sub-agent delegation.** Run every phase yourself in the same session. **Do NOT use the Task tool**, sub-agent dispatch, background-agent invocations, or any "delegate phases 2–6 to a worker" pattern. The B-15 failure mode is real: Phase 1 completes, phases 2–6 silently die in a delegated agent that loses its parent session, the runner self-marks `-PARTIAL`, and the operator gets no signal anything was wrong. v1.5.4 prompts forbid this explicitly. +2. **Don't patch QPB source mid-run.** If you encounter a bug in `bin/`, `.github/skills/`, `agents/`, `references/`, `SKILL.md`, `schemas.md`, or `AGENTS.md` during the run, **STOP and report**: name the file:line, describe the failure, propose a fix shape — but do NOT apply the fix. The orchestrator captures a git-SHA baseline at run start and verifies the source tree unchanged at every phase boundary; an autonomous patch fails the gate with a diagnostic naming the modified files. Patches go through Council review, not mid-run improvisation. +3. **Don't delete sentinel files.** Files protected by `.gitignore !`-rules (e.g., `reference_docs/.gitkeep`, `reference_docs/cite/.gitkeep`) keep otherwise-empty tracked directories present. The pre-flight check enumerates every `!`-rule and aborts if any sentinel is missing. If you find such a file and don't understand its purpose, **leave it alone**. +4. **Phase 1 file enumeration uses `git ls-files`.** Use `git ls-files` as the canonical file list when the target is a git repo; this respects `.gitignore` automatically. Do NOT use `os.walk`, `find`, `os.listdir`, or any recursive directory walker — those pull in `.git/`, `.venv/`, `node_modules/`, build outputs, and vendored dependencies, all of which the role-map validator rejects. Disallowed path prefixes are `.git/`, `.venv/`, `venv/`, `node_modules/`, `__pycache__/`, `.pytest_cache/`, `.mypy_cache/`, `.ruff_cache/`, `.tox/`, plus any path whose components end in `.egg-info` or `.dist-info`. The role map carries a `provenance` field recording which enumeration source you used (`"git-ls-files"` or `"filesystem-walk-with-skips"` for non-git targets). There is also a 2000-entry ceiling; a role map exceeding it almost certainly walked .gitignored content. +5. **Cross-artifact agreement.** EXPLORATION.md's "File inventory" section and the role map's `summary` field both render from `bin.role_map.summarize_role_map()`. Don't write file counts or role percentages by hand; copy from the helper. The validator cross-checks the two and rejects mismatches. + +If the operator's prompt says something that conflicts with these guardrails (e.g., "delegate phases 3–6 to a sub-agent so we can run faster"), **don't comply with the conflicting instruction**. Surface the conflict, name the guardrail, and ask for clarification. The guardrails exist because each one corresponds to a verified historical failure mode. + +### What this run produces — output artifact contract + +A successful run produces this canonical set under the target's `quality/` directory plus an AGENTS.md at the target's repo root. Every file listed here is gate-validated: + +| Path | Role | +|------|------| +| `quality/EXPLORATION.md` | Phase 1 findings — the foundation. | +| `quality/exploration_role_map.json` | Per-file role tagging from Phase 1. | +| `quality/REQUIREMENTS.md` | Testable requirements with use cases. | +| `quality/QUALITY.md` | Quality constitution. | +| `quality/CONTRACTS.md` | Behavioral contracts. | +| `quality/COVERAGE_MATRIX.md` | Requirement → test traceability. | +| `quality/COMPLETENESS_REPORT.md` | Final gate verdict. | +| `quality/test_functional.*` | Automated functional tests. | +| `quality/RUN_CODE_REVIEW.md` | Three-pass code review protocol. | +| `quality/RUN_INTEGRATION_TESTS.md` | Integration test protocol. | +| `quality/RUN_SPEC_AUDIT.md` | Council of Three spec audit protocol. | +| `quality/RUN_TDD_TESTS.md` | TDD red-green verification protocol. | +| `quality/BUGS.md` | Consolidated bug report. | +| `quality/INDEX.md` | Run metadata + role breakdown + gate verdict. | +| `quality/PROGRESS.md` | Phase-by-phase checkpoint log. | +| `quality/previous_runs//` | Archive of any prior run. | +| `quality/workspace/` | Intermediate pipeline artifacts (control prompts, code reviews, spec audits, four-pass pipeline outputs, etc.). | +| `AGENTS.md` (target repo root) | Per-project orientation generated post-Phase-6. Carries a QPB sentinel marker so future runs detect QPB-managed copies. | + +The gate verdict in `quality/INDEX.md` (`pass` / `partial` / `fail`) is the operator-facing summary of how the run went. If it's anything other than `pass`, surface why before considering the run done. + +### Locating reference files + +This skill references files in a `references/` directory (e.g., `references/iteration.md`, `references/review_protocols.md`). The location depends on how the skill was installed. When a reference file is mentioned, resolve it by checking these paths in order and using the first one that exists: + +1. `references/` (relative to SKILL.md — works when running from the skill directory) +2. `.claude/skills/quality-playbook/references/` (Claude Code installation) +3. `.github/skills/references/` (GitHub Copilot flat installation) +4. `.github/skills/quality-playbook/references/` (alternate Copilot installation) + +All reference file mentions in this skill use the short form `references/filename.md`. If the relative path doesn't resolve, walk the fallback list above. ## Why This Exists @@ -27,42 +209,422 @@ Without a quality playbook, every new contributor (and every new AI session) sta ## What This Skill Produces -Six files that together form a repeatable quality system: +Nine files that together form a repeatable quality system: | File | Purpose | Why It Matters | Executes Code? | |------|---------|----------------|----------------| | `quality/QUALITY.md` | Quality constitution — coverage targets, fitness-to-purpose scenarios, theater prevention | Every AI session reads this first. It tells them what "good enough" means so they don't guess. | No | +| `quality/REQUIREMENTS.md` | Testable requirements with project overview, use cases, and narrative — generated by a five-phase pipeline (contract extraction → derivation → verification → completeness → narrative) | The foundation for Passes 2 and 3 of the code review. Without requirements, review is limited to structural anomalies (~65% ceiling). With them, the review can catch intent violations — absence bugs, cross-file contradictions, and design gaps that are invisible to code reading alone. | No | | `quality/test_functional.*` | Automated functional tests derived from specifications | The safety net. Tests tied to what the spec says should happen, not just what the code does. Use the project's language: `test_functional.py` (Python), `FunctionalSpec.scala` (Scala), `functional.test.ts` (TypeScript), `FunctionalTest.java` (Java), etc. | **Yes** | -| `quality/RUN_CODE_REVIEW.md` | Code review protocol with guardrails that prevent hallucinated findings | AI code reviews without guardrails produce confident but wrong findings. The guardrails (line numbers, grep before claiming, read bodies) often improve accuracy. | No | +| `quality/RUN_CODE_REVIEW.md` | Three-pass code review protocol: structural review, requirement verification, cross-requirement consistency | Structural review alone misses ~35% of real defects. The three-pass pipeline adds requirement verification and consistency checking — backed by experiment evidence showing it finds bugs invisible to all structural review conditions. | No | | `quality/RUN_INTEGRATION_TESTS.md` | Integration test protocol — end-to-end pipeline across all variants | Unit tests pass, but does the system actually work end-to-end with real external services? | **Yes** | +| `quality/BUGS.md` | Consolidated bug report with patches | Every confirmed bug in one place with reproduction details, spec basis, severity, and patch references. The single source of truth for what's broken and how to verify it. | No | +| `quality/RUN_TDD_TESTS.md` | TDD red-green verification protocol | Proves each bug is real (test fails on unpatched code) and each fix works (test passes after patch). Stronger evidence than a bug report alone — maintainers trust FAIL→PASS demonstrations. | **Yes** | | `quality/RUN_SPEC_AUDIT.md` | Council of Three multi-model spec audit protocol | No single AI model catches everything. Three independent models with different blind spots catch defects that any one alone would miss. | No | | `AGENTS.md` | Bootstrap context for any AI session working on this project | The "read this first" file. Without it, AI sessions waste their first hour figuring out what's going on. | No | -Plus output directories: `quality/code_reviews/`, `quality/spec_audits/`, `quality/results/`. +Plus output directories: `quality/code_reviews/`, `quality/spec_audits/`, `quality/results/`, `quality/history/`. + +The pipeline also generates supporting artifacts: `quality/PROGRESS.md` (phase-by-phase checkpoint log with cumulative BUG tracker), `quality/CONTRACTS.md` (behavioral contracts), `quality/COVERAGE_MATRIX.md` (traceability), `quality/COMPLETENESS_REPORT.md` (final gate), and `quality/VERSION_HISTORY.md` (review log). Phase 7 can additionally generate `quality/REVIEW_REQUIREMENTS.md` (interactive review protocol) and `quality/REFINE_REQUIREMENTS.md` (refinement pass protocol) for iterative improvement. + +The two critical deliverables are the requirements file and the functional test file. The requirements file (`quality/REQUIREMENTS.md`) feeds the code review protocol's verification and consistency passes — it's what makes the code review catch more than structural anomalies. The functional test file (named for the project's language and test framework conventions) is the automated safety net. The Markdown protocols are documentation for humans and AI agents. + +### Complete Artifact Contract + +The quality gate (`quality_gate.py`) validates these artifacts. If the gate checks for it, this skill must instruct its creation. This is the canonical list — any artifact not listed here should not be gate-enforced, and any gate check should trace to an artifact listed here. + +| Artifact | Location | Required? | Created In | +|----------|----------|-----------|------------| +| Formal docs manifest (v1.5.3) | `quality/formal_docs_manifest.json` | Yes | Phase 1 (`bin/reference_docs_ingest.py`) | +| Requirements manifest (v1.5.3) | `quality/requirements_manifest.json` | Yes | Phase 2 | +| Use cases manifest (v1.5.3) | `quality/use_cases_manifest.json` | Yes | Phase 2 | +| Bugs manifest (v1.5.3) | `quality/bugs_manifest.json` | If bugs found | Phase 3/4/5 | +| Citation semantic check (v1.5.3) | `quality/citation_semantic_check.json` | Yes | Phase 4 (Layer 2 Council) | +| Exploration findings | `quality/EXPLORATION.md` | Yes | Phase 1 | +| Quality constitution | `quality/QUALITY.md` | Yes | Phase 2 | +| Requirements (UC identifiers) | `quality/REQUIREMENTS.md` | Yes | Phase 2 | +| Behavioral contracts | `quality/CONTRACTS.md` | Yes | Phase 2 | +| Functional tests | `quality/test_functional.*` | Yes | Phase 2 | +| Regression tests | `quality/test_regression.*` | If bugs found | Phase 3 | +| Code review protocol | `quality/RUN_CODE_REVIEW.md` | Yes | Phase 2 | +| Integration test protocol | `quality/RUN_INTEGRATION_TESTS.md` | Yes | Phase 2 | +| Spec audit protocol | `quality/RUN_SPEC_AUDIT.md` | Yes | Phase 2 | +| TDD verification protocol | `quality/RUN_TDD_TESTS.md` | Yes | Phase 2 | +| Bug tracker | `quality/BUGS.md` | Yes | Phase 3 | +| Coverage matrix | `quality/COVERAGE_MATRIX.md` | Yes | Phase 2 | +| Completeness report | `quality/COMPLETENESS_REPORT.md` | Yes | Phase 2 (baseline), Phase 5 (final verdict) | +| Progress tracker | `quality/PROGRESS.md` | Yes | Throughout | +| AI bootstrap | `AGENTS.md` (target repo root) | Yes | Generated by orchestrator after Phase 6 — not a Phase 2 deliverable | +| Bug writeups | `quality/writeups/BUG-NNN.md` | If bugs found | Phase 5 | +| Regression patches | `quality/patches/BUG-NNN-regression-test.patch` | If bugs found | Phase 3 | +| Fix patches | `quality/patches/BUG-NNN-fix.patch` | Optional | Phase 3 | +| TDD traceability | `quality/TDD_TRACEABILITY.md` | If bugs have red-phase results | Phase 5 | +| TDD sidecar | `quality/results/tdd-results.json` | If bugs found | Phase 5 | +| TDD red-phase logs | `quality/results/BUG-NNN.red.log` | If bugs found | Phase 5 | +| TDD green-phase logs | `quality/results/BUG-NNN.green.log` | If fix patch exists | Phase 5 | +| Integration sidecar | `quality/results/integration-results.json` | When integration tests run | Phase 5 | +| Mechanical verify script | `quality/mechanical/verify.sh` | Yes (benchmark) | Phase 2 | +| Verify receipt | `quality/results/mechanical-verify.log` + `.exit` | Yes (benchmark) | Phase 5 | +| Triage probes | `quality/spec_audits/triage_probes.sh` | When triage runs | Phase 4 | +| Code review reports | `quality/code_reviews/*.md` | Yes | Phase 3 | +| Spec audit reports | `quality/spec_audits/*auditor*.md` + `*triage*` | Yes | Phase 4 | +| Recheck results (JSON) | `quality/results/recheck-results.json` | When recheck runs | Recheck | +| Recheck summary (MD) | `quality/results/recheck-summary.md` | When recheck runs | Recheck | +| Seed checks | `quality/SEED_CHECKS.md` | If Phase 0b ran | Phase 0b | +| Run metadata | `quality/results/run-YYYY-MM-DDTHH-MM-SS.json` | Yes | Phase 1 (created), Throughout (updated) | + +**Sidecar JSON lifecycle:** Write all bug writeups *before* finalizing `tdd-results.json` — the sidecar's `writeup_path` field must point to an existing file, not a placeholder. Similarly, run integration tests and collect results before writing `integration-results.json`. + +### Sidecar JSON Canonical Examples + +**`quality/results/tdd-results.json`** — the gate validates field names, not just presence: + +```json +{ + "schema_version": "1.1", + "skill_version": "1.5.6", + "date": "2026-04-12", + "project": "repo-name", + "bugs": [ + { + "id": "BUG-001", + "requirement": "REQ-003", + "red_phase": "fail", + "green_phase": "pass", + "verdict": "TDD verified", + "fix_patch_present": true, + "writeup_path": "quality/writeups/BUG-001.md" + } + ], + "summary": { + "total": 3, "confirmed_open": 1, "red_failed": 0, "green_failed": 0, "verified": 2 + } +} +``` + +`verdict` must be one of: `"TDD verified"`, `"red failed"`, `"green failed"`, `"confirmed open"`, `"deferred"`. `date` must be ISO 8601 (YYYY-MM-DD), not a placeholder, not in the future. + +**`quality/results/integration-results.json`:** + +```json +{ + "schema_version": "1.1", + "skill_version": "1.5.6", + "date": "2026-04-12", + "project": "repo-name", + "recommendation": "SHIP", + "groups": [{ "group": 1, "name": "Group 1", "use_cases": ["UC-01"], "result": "pass", "tests_passed": 3, "tests_failed": 0, "notes": "" }], + "summary": { "total_groups": 12, "passed": 11, "failed": 1, "skipped": 0 }, + "uc_coverage": { "UC-01": "covered_pass", "UC-02": "not_mapped" } +} +``` + +`recommendation` must be one of: `"SHIP"`, `"FIX BEFORE MERGE"`, `"BLOCK"`. `uc_coverage` maps UC identifiers from REQUIREMENTS.md to coverage status. + +### Run Metadata + +Every playbook run creates a timestamped metadata file at `quality/results/run-YYYY-MM-DDTHH-MM-SS.json`. This enables multi-model comparison and run history tracking. + +**Lifecycle:** Create this file at the start of Phase 1. Update `phases_completed`, `bug_count`, and `end_time` as each phase finishes. The final update happens after the terminal gate. + +```json +{ + "schema_version": "1.0", + "skill_version": "1.5.6", + "project": "repo-name", + "model": "claude-sonnet-4-6", + "model_provider": "anthropic", + "runner": "claude-code", + "start_time": "2026-04-16T10:30:00Z", + "end_time": "2026-04-16T11:45:00Z", + "duration_minutes": 75, + "phases_completed": ["Phase 0b", "Phase 1", "Phase 2", "Phase 3", "Phase 4", "Phase 5"], + "iterations_completed": ["gap", "unfiltered", "parity", "adversarial"], + "bug_count": 12, + "bug_severity": { "HIGH": 2, "MEDIUM": 5, "LOW": 5 }, + "gate_result": "PASS", + "gate_fail_count": 0, + "gate_warn_count": 2, + "notes": "" +} +``` -The critical deliverable is the functional test file (named for the project's language and test framework conventions). The Markdown protocols are documentation for humans and AI agents. The functional tests are the automated safety net. +**Required fields:** `schema_version`, `skill_version`, `project`, `model`, `start_time`. All other fields are populated as the run progresses. `model` should be the exact model string (e.g., `"claude-sonnet-4-6"`, `"gpt-4.1"`, `"claude-opus-4-6"`). `runner` identifies the tool used to execute the playbook (e.g., `"claude-code"`, `"copilot-cli"`, `"cursor"`, `"cowork"`). `duration_minutes` is computed from `end_time - start_time`. If the model or runner cannot be determined, use `"unknown"`. ## How to Use -Point this skill at any codebase: +**The playbook is designed to run one phase at a time.** Each phase runs in its own session with a clean context window, producing files on disk that the next phase reads. This gives much better results than running all phases at once — each phase gets the full context window for deep analysis instead of competing for space with other phases. + +**Default behavior: run Phase 1 only.** When someone says "run the quality playbook" or "execute the quality playbook," run Phase 1 (Explore) and stop. After Phase 1 completes, tell the user what happened and what to say next. The user drives each phase forward explicitly. + +### Interactive protocol — how to guide the user + +**After every phase and every iteration, STOP and print guidance.** Use a `#` header so it's prominent in the chat. The guidance must include: what just happened (one line), what the key outputs are, and the exact prompt to continue. See the end-of-phase messages defined after each phase section below. + +**If the user says "keep going", "continue", "next phase", "next", or anything similar**, run the next phase in sequence. If all phases are complete, suggest the first iteration strategy (gap). If an iteration just finished, suggest the next strategy in the recommended cycle. + +**If the user says "run all phases", "run everything", or "run the full pipeline"**, run all phases sequentially in a single session. This uses more context but some users prefer it. + +**If the user asks "help", "how does this work", "what is this", or any similar phrasing**, respond with this explanation (adapt the wording naturally, don't copy verbatim): + +> The Quality Playbook finds bugs that structural code review alone can't catch — the 35% of real defects that require understanding what the code is *supposed* to do. It works phase by phase: +> +> - **Phase 1 (Explore):** Understand the codebase — architecture, risks, failure modes, specifications +> - **Phase 2 (Generate):** Produce quality artifacts — requirements, tests, review protocols +> - **Phase 3 (Code Review):** Three-pass review with regression tests for every confirmed bug +> - **Phase 4 (Spec Audit):** Three independent AI auditors check the code against requirements +> - **Phase 5 (Reconciliation):** Close the loop — TDD red-green verification for every bug +> - **Phase 6 (Verify):** Self-check benchmarks validate all generated artifacts +> +> After the numbered phases complete, you can run iteration strategies (gap, unfiltered, parity, adversarial) to find additional bugs — iterations typically add 40-60% more confirmed bugs on top of the baseline. +> +> The playbook works best when you provide documentation alongside the code — specs, API docs, design documents, community documentation. It also gets significantly better results when you run each phase separately rather than all at once. +> +> To get started, say: **"Run the quality playbook on this project."** + +**If the user asks "what happened", "what's going on", "where are we", or "what should I do next"**, read `quality/PROGRESS.md` and give them a concise status update: which phases are complete, how many bugs found so far, and what the next step is. + +### Documentation warning + +**At the start of Phase 1, before exploring any code, check for documentation.** Look for directories named `docs/`, `reference_docs/`, `doc/`, `documentation/`, or any gathered documentation files. Also check if the user mentioned documentation in their prompt. + +**If no documentation is found, print this warning immediately (before proceeding):** + +> **Important: No project documentation found.** The quality playbook works without documentation, but it finds significantly more bugs — and higher-confidence bugs — when you provide specs, API docs, design documents, or community documentation. In controlled experiments, documentation-enriched runs found different and better bugs than code-only baselines. +> +> If you have documentation available, you can add it to a `reference_docs/` directory and re-run Phase 1. Otherwise, I'll proceed with code-only analysis. + +Then proceed with Phase 1 — don't block on this, just make sure the user sees the warning. + +### Running a specific phase + +The user can request any individual phase: ``` -Generate a quality playbook for this project. +Run quality playbook phase 1. +Run quality playbook phase 3 — code review. +Run phase 5 reconciliation. ``` +When running a specific phase, check that its prerequisites exist (e.g., Phase 3 requires Phase 2 artifacts). If prerequisites are missing, tell the user which phases need to run first. + +### Iteration mode — improve on a previous run + +Use this when a previous playbook run exists and you want to find additional bugs. Iteration mode replaces Phase 1's from-scratch exploration with a targeted exploration using one of five strategies, then merges findings with the previous run and re-runs Phases 2–6 against the combined results. + +**When to use iteration mode:** After a complete playbook run, when you believe the codebase has more bugs than the first run found. This is especially effective for large codebases where a single run can only cover 3–5 subsystems, and for library/framework codebases where different exploration paths find different bug classes. + +**Read `references/iteration.md` for detailed strategy instructions.** That file contains the full operational detail for each strategy, shared rules, merge steps, and the completion gate. The summary below describes when to use each strategy. + +**TDD applies to iteration runs.** Every newly confirmed bug in an iteration run must go through the full TDD red-green cycle and produce `quality/results/BUG-NNN.red.log` (and `.green.log` if a fix patch exists). The quality gate enforces this — missing logs cause FAIL. See `references/iteration.md` shared rule 5 and the TDD Log Closure Gate in Phase 5. + +**Iteration strategies.** The user selects a strategy by naming it in the prompt. If no strategy is named, default to `gap`. + ``` -Update the functional tests — the quality playbook already exists. +Run the next iteration of the quality playbook. # default: gap strategy +Run the next iteration of the quality playbook using the gap strategy. +Run the next iteration using the unfiltered strategy. +Run the next iteration using the parity strategy. +Run an iteration using the adversarial strategy. ``` -``` -Run the spec audit protocol. +**Recommended cycle:** gap → unfiltered → parity → adversarial. Each strategy finds different bug classes: + +- **`gap`** (default) — Scan previous coverage, explore uncovered subsystems and thin sections. Best when the first run was structurally sound but only covered a subset of the codebase. +- **`unfiltered`** — Pure domain-driven exploration with no structural constraints. No pattern templates, no applicability matrices, no section format requirements. Recovers bugs that structured exploration suppresses. +- **`parity`** — Systematically enumerate parallel implementations of the same contract (transport variants, fallback chains, setup-vs-reset paths) and diff them for inconsistencies. Finds bugs that only emerge from cross-path comparison. +- **`adversarial`** — Re-investigate dismissed/demoted triage findings and challenge thin SATISFIED verdicts. Recovers Type II errors from conservative triage. +- **`all`** — Runner-level convenience: executes gap → unfiltered → parity → adversarial in sequence, each as a separate agent session. Stops early if a strategy finds zero new bugs. + +### Phase-by-phase execution + +Each phase produces files on disk that the next phase reads. This is how context transfers between phases — through files, not through conversation history. The key handoff files are: + +- **`quality/EXPLORATION.md`** — Phase 1 writes this, Phase 2 reads it. Contains everything Phase 2 needs to generate artifacts without re-exploring the codebase. +- **`quality/PROGRESS.md`** — Updated after every phase. Cumulative BUG tracker ensures no finding is lost. +- **Generated artifacts** (REQUIREMENTS.md, CONTRACTS.md, etc.) — Phase 2 writes these, Phases 3–5 read them to run reviews, audits, and reconciliation. + +The pattern for each phase boundary: finish the current phase, write everything to disk, then print the end-of-phase message and stop. When the user starts the next phase, read back the files you need before proceeding. This "write then read" cycle is the phase boundary — it lets you drop exploration context from working memory before loading review context, for example. + +Write your Phase 1 exploration findings to `quality/EXPLORATION.md` before proceeding. This file is mandatory in all modes. Make it thorough: domain identification, architecture map, existing tests, specification summary, quality risks, skeleton/dispatch analysis, derived requirements (REQ-NNN), and derived use cases (UC-NN). Everything Phase 2 needs to generate artifacts must be in this file. + +The discipline of writing exploration findings to disk is what forces thorough analysis. Without it, the model keeps vague impressions in working memory and produces broad, abstract requirements that miss function-level defects. Writing forces specificity: file paths, line numbers, exact function names, concrete behavioral rules. That specificity is what makes requirements precise enough to catch bugs during code review. + +--- + +## Run-state instrumentation (v1.5.6 — write events as you go) + +Two files in `quality/` track this run's state across the filesystem so the run is observable in flight, resumable across crashes, and auditable afterward. Maintain both throughout the run. + +- **`quality/run_state.jsonl`** — append-only machine-readable event log. One JSON object per line. The orchestrator and any monitor reads this file to know exactly where the run is. +- **`quality/PROGRESS.md`** — human-readable status file, atomically rewritten on every event. + +**Authoritative schema:** `references/run_state_schema.md`. Read it once at run start; it defines the full event taxonomy, required fields, cross-validation rules, and PROGRESS.md format. + +### Initialization (before any phase work, including Phase 0) + +If `quality/run_state.jsonl` does not exist: +1. Create `quality/` if absent. +2. Append the `_index` event to `quality/run_state.jsonl`. Required fields: `event=_index`, `ts` (ISO 8601 UTC with `Z`), `schema_version="1.5.6"`, `event_types` (array listing all event types this run will use — at minimum `_index`, `run_start`, `phase_start`, `pattern_walked`, `pass_started`, `pass_ended`, `finding_logged`, `artifact_written`, `gate_check`, `phase_end`, `error`, `run_end`), `benchmark` (target name), `lever_state` (e.g. `"baseline"` for normal runs), `started_at`. +3. Append the `run_start` event. Required fields: `event=run_start`, `ts`, `runner` (one of `claude`/`codex`/`copilot`/`cursor`), `playbook_version` (read from SKILL.md frontmatter `version` field), `target_path`. +4. Write `quality/PROGRESS.md` per the format spec in `references/run_state_schema.md`. Include header (Started / Benchmark / Lever / Runner / Playbook version), empty phase checklist with all six phases, empty Recent events / Artifacts produced sections. + +If `quality/run_state.jsonl` already exists at run start: this is a **resumed run**. See "Resume semantics" below. + +### Per-phase events + +At every phase boundary (1 through 6), write events: + +- **At phase start:** append `{"event":"phase_start","ts":"","phase":N}` to `quality/run_state.jsonl`. Update `quality/PROGRESS.md`: mark phase N as in-progress with current timestamp. +- **At phase end:** *first* cross-validate the phase's expected artifacts (table below). If validation fails, append `{"event":"error","ts":"","phase":N,"message":"","recoverable":true}` and re-run the phase. If validation passes, append `{"event":"phase_end","ts":"","phase":N,"key_counts":{...},"artifacts_produced":[...]}`. Update PROGRESS.md: check off phase N with summary stats. + +**Phase 1 sub-events (in addition to phase_start/phase_end):** +- After walking each of the seven exploration patterns: append `{"event":"pattern_walked","ts":"","phase":1,"pattern":N,"findings_count":K}`. (One event per pattern, even if zero findings.) +- When `quality/EXPLORATION.md` is written: append `{"event":"artifact_written","ts":"","relative_path":"quality/EXPLORATION.md","byte_size":,"line_count":}`. + +**Phase 4 sub-events:** +- At each pass start (A through D): `{"event":"pass_started","ts":"","phase":4,"pass":"A"}`. +- At each pass end: `{"event":"pass_ended","ts":"","phase":4,"pass":"A","output_artifact":""}`. + +**Phase 5 / Phase 6 sub-events:** +- At each gate-check completion: `{"event":"gate_check","ts":"","gate_name":"","verdict":"pass|fail|warn|skip","reason":""}`. + +**Run end:** +- After Phase 6 `phase_end`: append `{"event":"run_end","ts":"","status":"success","total_findings":,"final_verdict":""}`. Status is `aborted` for `recoverable:false` failures, `failed` for unrecoverable runtime errors. + +### Cross-validation rules at phase_end + +Verify the corresponding artifacts before writing each `phase_end` event: + +| Phase | Required | +|---|---| +| 1 | `quality/EXPLORATION.md` and `quality/PROGRESS.md` satisfy the 13-check Phase 1 gate documented at SKILL.md:1257-1273 (six required headings: `## Open Exploration Findings`, `## Quality Risks`, `## Pattern Applicability Matrix`, `## Pattern Deep Dive — *` ×3+, `## Candidate Bugs for Phase 2`, `## Gate Self-Check`; PROGRESS Phase 1 line marked `[x]`; ≥8 findings with file:line citations; ≥3 multi-location findings; 3-4 FULL pattern matrix rows; ≥2 multi-function pattern deep dives; candidate-bug source mix ≥2 from exploration/risks AND ≥1 from pattern deep dive). `bin/run_state_lib.validate_phase_artifacts(quality_dir, phase=1)` enforces the full gate. | +| 2 | All nine Generate-contract artifacts exist non-empty under `quality/`: `REQUIREMENTS.md`, `QUALITY.md`, `CONTRACTS.md`, `COVERAGE_MATRIX.md`, `COMPLETENESS_REPORT.md`, `RUN_CODE_REVIEW.md`, `RUN_INTEGRATION_TESTS.md`, `RUN_SPEC_AUDIT.md`, `RUN_TDD_TESTS.md`. Plus at least one non-empty `quality/test_functional.` (extension varies by language). | +| 3 | `quality/RUN_CODE_REVIEW.md` exists | +| 4 | `quality/REQUIREMENTS.md` non-empty AND `quality/COVERAGE_MATRIX.md` exists. If the four-pass skill-derivation pipeline ran (i.e., `quality/phase3/` exists), then `quality/phase3/pass_a_drafts.jsonl`, `quality/phase3/pass_b_citations.jsonl`, `quality/phase3/pass_c_formal.jsonl`, and the Pass D inbox under `quality/phase3/` must all exist and be non-empty. | +| 5 | `quality/results/quality-gate.log` exists, non-empty | +| 6 | `quality/BUGS.md` non-empty with `^##\s+BUG-` sections AND `quality/INDEX.md` updated with `gate_verdict` field | + +If a check fails, append the `error` event (recoverable=true) and re-run the phase. Do **not** write `phase_end` against missing artifacts — that's the failure mode v1.5.6 is built to catch. + +`bin/run_state_lib.validate_phase_artifacts(quality_dir, phase)` performs these checks programmatically — call it from inside the playbook session if available. + +### Resume semantics + +If `quality/run_state.jsonl` already exists when the playbook starts (a previous session crashed or paused mid-run): + +1. Read all events. Use `bin/run_state_lib.last_in_progress_phase(events)` to find the last `phase_start` not followed by a matching `phase_end` — call it the in-progress phase. +2. Run the cross-validation rules above for that phase. + - **Artifacts complete:** the prior session finished the work but didn't get to write `phase_end`. Append the missing `phase_end` (with current `ts`) and proceed to the next phase. + - **Artifacts incomplete:** re-run that phase from scratch. +3. If all six `phase_end` events are present but no `run_end`: append `run_end status=success` and finalize. +4. If no `quality/run_state.jsonl` exists: fresh run. Initialize per the section above. + +The policy: **trust artifacts more than events.** If events claim phase 4 done but `REQUIREMENTS.md` doesn't exist, re-run phase 4. If events stop mid-phase but the artifacts are complete, catch up the events. + +### PROGRESS.md atomic rewrite + +PROGRESS.md is rewritten on every event (not appended). The contents reflect the current run-state.jsonl: header (run metadata), phase checklist (with summary stats per completed phase, in-progress marker for the current phase), recent events (last 10 events from the JSONL log, in human-readable form), artifacts produced (files written this run with byte sizes). See `references/run_state_schema.md` for the exact format template. + +`bin/run_state_lib.write_progress_md(quality_dir, events, current_phase)` produces a correctly-formatted PROGRESS.md from the event list — call it after each event to keep the file in sync. + +--- + +## Phase 0: Prior Run Analysis (Automatic) + +**This phase runs only if `quality/previous_runs/` exists and contains prior quality artifacts.** If there are no prior runs, skip to Phase 1. If `quality/previous_runs/` exists but is empty or contains no conformant quality artifacts (no subdirectories with `quality/BUGS.md` under them), skip Phase 0a and fall through to Phase 0b. (Legacy archives at `quality/runs/` from pre-v1.5.4 remain readable for backward compatibility — see SKILL.md:149 — but the canonical archive root is `quality/previous_runs/`.) + +When prior runs exist, the playbook enters **continuation mode**. This enables iterative bug discovery: each run inherits confirmed findings from prior runs, verifies them mechanically, and explores for additional bugs. The iteration converges when a run finds zero net-new bugs. + +**Step 0a: Build the seed list.** Read `quality/previous_runs/*/quality/BUGS.md` from all prior runs. For each confirmed bug, extract: bug ID, file:line, summary, and the regression test assertion. Deduplicate by file:line (the same bug found in multiple runs counts once). Write the merged seed list to `quality/SEED_CHECKS.md` with this format: + +```markdown +## Seed Checks (from N prior runs) + +| Seed | Origin Run | File:Line | Summary | Assertion | +|------|-----------|-----------|---------|-----------| +| SEED-001 | run-1 | virtio_ring.c:3509-3529 | RING_RESET dropped | `"case VIRTIO_F_RING_RESET:" in func` | ``` -If a quality playbook already exists (`quality/QUALITY.md`, functional tests, etc.), read the existing files first, then evaluate them against the self-check benchmarks in the verification phase. Don't assume existing files are complete — treat them as a starting point. +**Step 0b: Execute seed checks mechanically.** For each seed, run the assertion against the current source tree. Record PASS (bug was fixed since last run) or FAIL (bug still present). A failing seed is a confirmed carry-forward bug — it must appear in this run's BUGS.md regardless of whether any auditor independently finds it. A passing seed means the bug was fixed — note it in PROGRESS.md as "SEED-NNN: resolved since prior run." + +**Step 0c: Identify prior-run scope.** Read `quality/previous_runs/*/quality/PROGRESS.md` for scope declarations. Note which subsystems were covered in prior runs. During Phase 1 exploration, prioritize areas NOT covered by prior runs to maximize the chance of finding new bugs. If all subsystems were covered in prior runs, explore the same scope but with different emphasis (e.g., different scrutiny areas, different entry points). + +**Step 0d: Inject seeds into downstream phases.** The seed list becomes input to: +- **Phase 3 (code review):** Add to the code review prompt: "Prior runs confirmed these bugs — verify they are still present and look for additional findings in the same subsystems." +- **Phase 4 (spec audit):** Add to `RUN_SPEC_AUDIT.md`: "Known open issues from prior runs: [seed list]. Expect auditors to find these. If an auditor does NOT flag a known seed bug, that is a coverage gap in their review, not evidence the bug was fixed." + +**Why this exists:** Non-deterministic scope exploration means different runs notice different bugs. In cross-version testing, 4/8 repos had bugs found in some versions but not others — not because the bugs were fixed, but because the model explored different parts of the codebase. Iterating with seed injection solves this: confirmed bugs carry forward mechanically (no re-discovery needed), and each new run can focus exploration on uncovered territory. + +### Phase 0b: Sibling-Run Seed Discovery (Automatic) + +**This step runs only if `quality/previous_runs/` does not exist OR `quality/previous_runs/` exists but contains no conformant quality artifacts** (i.e., Phase 0a has nothing to work with) **and** the project directory is versioned (e.g., `httpx-1.3.23/` sits alongside `httpx-1.3.21/`). If `quality/previous_runs/` exists with conformant artifacts, Phase 0a already handles seed injection — skip this step. + +**If `quality/previous_runs/` exists but is empty or contains only non-conformant subdirectories**, emit a warning: "Phase 0b: `quality/previous_runs/` exists but contains no conformant artifacts — consulting sibling versioned directories for seeds." Then proceed with the sibling discovery below. + +When no `quality/previous_runs/` directory exists but sibling versioned directories do, look for prior quality artifacts in those siblings: + +1. **Discover siblings.** List directories matching the pattern `-/quality/BUGS.md` relative to the parent directory. Exclude the current directory. Sort by version descending (most recent first). +2. **Import confirmed bugs as seeds.** For each sibling with a `quality/BUGS.md`, extract confirmed bugs using the same format as Step 0a. Write them to `quality/SEED_CHECKS.md` with origin noted as the sibling directory name. +3. **Execute seed checks mechanically** (same as Step 0b in Phase 0a). For each imported seed, run the assertion against the current source tree and record PASS/FAIL. +4. **Inject into downstream phases** (same as Step 0d in Phase 0a). + +**Why this exists:** In v1.3.23 benchmarking, httpx produced a zero-bug result despite httpx-1.3.21 having found the `Headers.__setitem__` non-ASCII encoding bug. The model simply explored different code paths and never examined the Headers area. Sibling-run seeding ensures that bugs confirmed in prior versioned runs carry forward even without an explicit `quality/previous_runs/` archive. This is a different failure class than mechanical tampering — it addresses **exploration non-determinism**, not evidence corruption. --- -## Phase 1: Explore the Codebase (Do Not Write Yet) +## Phase 1: Explore the Codebase (Write As You Go) + +**v1.5.6 instrumentation:** Append `phase_start phase=1` to `quality/run_state.jsonl` now. After walking each exploration pattern, append `pattern_walked phase=1 pattern=N findings_count=K`. At phase end, cross-validate (`quality/EXPLORATION.md` ≥ 200 bytes with finding sections) then append `phase_end phase=1`. See "Run-state instrumentation" above. + +> **Required references for this phase** — read these before proceeding: +> - `references/exploration_patterns.md` — seven bug-finding patterns to apply after open exploration + +**First action: create run metadata.** Before any exploration, create the run metadata file: + +```bash +mkdir -p quality/results +cat > "quality/results/run-$(date -u +%Y-%m-%dT%H-%M-%S).json" <<'METADATA' +{ + "schema_version": "1.0", + "skill_version": "1.5.6", + "project": "", + "model": "", + "model_provider": "", + "runner": "", + "start_time": "", + "end_time": null, + "duration_minutes": null, + "phases_completed": [], + "iterations_completed": [], + "bug_count": 0, + "bug_severity": { "HIGH": 0, "MEDIUM": 0, "LOW": 0 }, + "gate_result": null, + "gate_fail_count": null, + "gate_warn_count": null, + "notes": "" +} +METADATA +``` + +Fill in `project`, `model` (exact model string, e.g., `"claude-sonnet-4-6"`), `model_provider` (e.g., `"anthropic"`, `"openai"`, `"cursor"`), `runner` (e.g., `"claude-code"`, `"copilot-cli"`, `"cursor"`), and `start_time` (UTC ISO 8601). Update this file at the end of each phase — append the completed phase to `phases_completed` and update `bug_count`/`bug_severity` as bugs are confirmed. The final update after the terminal gate fills in `end_time`, `duration_minutes`, and `gate_result`. + +**Second action: run v1.5.3 document ingest (before exploring any code).** A single stdlib-only module in `bin/` produces the authoritative documentation record that Phase 1 requirement derivation depends on: + +1. **`python -m bin.reference_docs_ingest `** — walks `reference_docs/` in the target repo once. Files under `reference_docs/cite/` are hashed and written to `quality/formal_docs_manifest.json` per `schemas.md` §4 and the §1.6 manifest wrapper. Files at the top level of `reference_docs/` are not written to the manifest but are available as Tier 4 context via `bin.reference_docs_ingest.load_tier4_context()`, which returns a sorted list of `(path, text)` tuples. If the ingest command fails (unsupported extension, non-UTF-8 bytes), stop the run and surface the stderr output to the user verbatim — ingest errors are actionable and must be fixed before exploration continues. + +**No sidecar needed.** Folder placement is the flag: top-level `reference_docs/.` files are Tier 4 context; files under `reference_docs/cite/.` are citable sources. Tier 1 is the default for `cite/` contents; a file may override to Tier 2 with an optional in-file marker on the first non-blank line: `` (Markdown) or `# qpb-tier: 2` (plaintext). `README.md` under either folder is skipped. + +**When `reference_docs/` is missing or empty**, Phase 1 MUST print this actionable message and proceed: + +> Phase 1 found no documentation in reference_docs/. The playbook will proceed +> using only Tier 3 evidence (the source tree itself). For better results, drop +> plaintext documentation into: +> reference_docs/ ← AI chats, design notes, retrospectives (Tier 4 context) +> reference_docs/cite/ ← project specs, RFCs, API contracts (citable, byte-verified) +> See README.md "Step 1: Provide documentation" for details. + +**Plaintext only — conversion happens outside the playbook.** Reference docs are `.txt` or `.md` only (schemas.md §2). PDFs, DOCX, HTML, etc. are rejected with an actionable conversion hint (`pdftotext`, `pandoc -t plain`, `lynx -dump`). Do NOT attempt to parse binary or formatted documents inside the skill — run the conversion outside and commit the plaintext. Spend the first phase understanding the project. The quality playbook must be grounded in this specific codebase — not generic advice. @@ -70,6 +632,44 @@ Spend the first phase understanding the project. The quality playbook must be gr **Scaling for large codebases:** For projects with more than ~50 source files, don't try to read everything. Focus exploration on the 3–5 core modules (the ones that handle the primary data flow, the most complex logic, and the most failure-prone operations). Read representative tests from each subsystem rather than every test file. The goal is depth on what matters, not breadth across everything. +**Depth over breadth (critical).** A narrow scope with function-level detail finds more bugs than a broad scope with subsystem-level summaries. For each core module you explore, identify the specific functions that implement critical behavior and document them by name, file path, and line number. Requirements derived from "the reset subsystem should handle errors" will not catch bugs. Requirements derived from "`vm_reset()` at `virtio_mmio.c:256` must poll the status register after writing zero" will. The difference between a useful exploration and a useless one is specificity — file paths, function names, line numbers, exact behavioral rules. + +**Three-stage exploration: open first, then domain risks, then selected patterns.** Exploration has three stages, and the order matters: + +1. **Open exploration (domain-driven).** Before applying any structured pattern, explore the codebase the way an experienced developer would: read the code, understand the architecture, identify risks based on your domain knowledge of what goes wrong in systems like this one. Ask yourself: "What would an expert in [this domain] check first?" For an HTTP library, that means redirect handling, header encoding, connection lifecycle. For a CLI framework, that means flag parsing, help generation, completion/validation consistency. For a serialization library, that means type coverage, round-trip fidelity, edge-case handling. Write concrete findings with file paths and line numbers. This stage must produce at least 8 concrete bug hypotheses or suspicious findings — not architectural observations, but specific "this code at file:line might be wrong because [reason]" findings. At least 4 must reference different modules or subsystems. + +2. **Domain-knowledge risk analysis.** After open exploration, step back from the code and reason about what you know from training about systems like this one. This is the primary bug-hunting pass for library and framework codebases. Complete the Step 6 questions below using two sources — the code you just explored AND your domain knowledge of similar systems. Generate at least 5 ranked failure scenarios, each naming a specific function, file, and line, and explaining why a domain-specific edge case produces wrong behavior. You don't need to have observed these failures — you know from training that they happen to systems of this type. Write the results to the `## Quality Risks` section of EXPLORATION.md before proceeding to patterns. + + **What this stage must NOT produce:** A section that lists defensive patterns the code already has (things the code does RIGHT) is not a risk analysis. A section that lists risky modules without specific failure scenarios is not a risk analysis. A section that concludes "this is a mature, well-tested library so basic bugs are unlikely" is actively harmful — mature libraries have the most subtle bugs, precisely because the obvious ones were found years ago. The test: could a code reviewer read each scenario and immediately know what to check? If not, the scenario is too abstract. + +3. **Pattern-driven exploration (selected, not exhaustive).** After open exploration and domain-risk analysis are written to disk, evaluate all seven analysis patterns from `exploration_patterns.md` using a pattern applicability matrix. For each pattern, assess whether it applies to this codebase and what it would target. Then select 3 to 4 patterns for deep-dive treatment — the highest-yield patterns for this specific codebase. The remaining patterns get a brief "not applicable" or "deferred" note with codebase-specific rationale. Do not produce deep sections for all seven patterns — depth on 3–4 beats shallow coverage of 7. Select 4 when a fourth pattern has clear applicability and would cover code areas not reached by the other three; default to 3 when in doubt. + + For each selected pattern deep dive, use the output format from the reference file and trace code paths across 2+ functions. The deep dives should pressure-test, refine, or extend the findings from the open exploration and risk analysis — not repeat them. + +The Phase 1 completion gate checks for all three stages. The open exploration section, the quality risks section, the pattern applicability matrix, and the pattern deep-dive sections must all be present. + +**Write incrementally — do not hold findings in memory.** This is the single most important execution rule in Phase 1. After you explore each subsystem or apply each pattern, **immediately append your findings to `quality/EXPLORATION.md` on disk before moving to the next subsystem or pattern.** Do not try to hold findings in working memory across multiple subsystems. The write-as-you-go discipline serves two purposes: + +1. **Depth recovery.** If you explore the PCI interrupt routing subsystem and find suspicious code at `vp_find_vqs_intx()`, write that finding to EXPLORATION.md immediately. Then when you move to the admin queue subsystem, your working memory is free to go deep there. Without incremental writes, findings from the first subsystem compete with findings from the second, and both end up shallow. + +2. **Nothing gets lost.** In v1.3.41 benchmarking, the model explored 8 pattern sections but wrote only 5–7 lines per section — perfectly uniform, perfectly shallow. Every section passed the gate but none went deep enough to find bugs that require tracing code paths across multiple functions. The model was trying to compose the entire EXPLORATION.md at the end, after reading everything, and could only recall the surface-level findings. Incremental writes prevent this. + +**The rhythm is: read a subsystem → write findings to disk → read the next subsystem → append findings → repeat.** Each append should include specific function names, file paths, line numbers, and concrete bug hypotheses. A 5-line section that says "checked cross-implementation consistency, found one gap" is a gate-passing placeholder, not an exploration finding. A useful section traces a code path: "function A at file:line calls function B at file:line, which does X but not Y; compare with function C at file:line which does both X and Y." + +**Mandatory consolidation step.** After all three stages (open exploration, quality risks, and selected pattern deep dives) are explored and written to EXPLORATION.md, add a final section: `## Candidate Bugs for Phase 2`. This section consolidates the strongest bug hypotheses from all earlier sections into a prioritized handoff list. For each candidate, include: the hypothesis, the specific file:line references, which stage surfaced it (open exploration, quality risks, or pattern), and what the code review should look for. This section is the bridge between exploration and artifact generation — it tells Phase 3 exactly where to focus. Minimum: 4 candidate bugs with file:line references — at least 2 from open exploration or quality risks, and at least 1 from a pattern deep dive. There is no maximum. + +**Pre-flight: Scope declaration for large repositories** + +Before exploring any source code, estimate scale: approximate source-file count (excluding tests, docs, and generated files), major subsystem count, and documentation volume. Note the count in PROGRESS.md. + +- **Fewer than 200 source files:** Proceed with full exploration. The depth-vs-breadth guidance above still applies. +- **200–500 source files:** Declare your intended scope before exploring. Write a `## Scope declaration` section to PROGRESS.md naming the 3–5 subsystems you will cover, the expected file count for each, and which subsystems you are deferring with rationale. Then proceed with exploration of the declared scope only. +- **More than 500 source files:** Stop and write a mandatory scope declaration to PROGRESS.md before reading any source files. The scope declaration must include: (a) the subsystems covered in this run, (b) the subsystems explicitly deferred, (c) the exclusion rationale for each deferred subsystem, and (d) recommended subsystem scope for follow-on runs. Do not begin exploration until this is written. A scope declaration that covers "everything" is not valid for repositories above this threshold. + +**Resuming a previous session:** If PROGRESS.md already exists and shows phases marked complete, read it first. Do not redo phases already marked complete — resume from the first phase marked incomplete. If a scope declaration is already written, honor it exactly. If the previous session's scope declaration deferred subsystems, do not expand scope to cover them unless this run is explicitly a follow-on for the deferred areas. + +**Specification-primary repositories:** Some repositories ship a specification, configuration, or protocol document as their primary product, with executable code as supporting infrastructure. Examples: a skill definition with benchmark tooling, a schema registry with validation scripts, a pipeline config with orchestration helpers. When the primary product is a specification rather than executable code, derive requirements from the specification's internal consistency, completeness, and correctness — not just from the executable code paths. The specification is the thing users depend on; the tooling is secondary. If you find yourself writing 80%+ of requirements about helper scripts and <20% about the primary specification, you have the focus inverted. + ### Step 0: Ask About Development History Before exploring code, ask the user one question: @@ -87,6 +687,8 @@ This context is gold. A chat history where the developer discussed "why we chose If the user doesn't have chat history, proceed normally — the skill works without it, just with less context. +**Autonomous fallback:** When running in benchmark mode, via `bin/run_playbook.py` (benchmark runner, not shipped with the skill), or without user interaction (e.g., `--single-pass`), skip Step 0's question and proceed directly to Step 1. If chat history folders are visible in the project tree (e.g., `AI Chat History/`, `.chat_exports/`), scan them without asking. If no chat history is found, proceed — do not block waiting for a response that won't come. + ### Step 1: Identify Domain, Stack, and Specifications Read the README, existing documentation, and build config (`pyproject.toml` / `package.json` / `Cargo.toml`). Answer: @@ -114,6 +716,33 @@ When working from non-formal requirements, label each scenario and test with a * Use this exact tag format in QUALITY.md scenarios, functional test documentation, and spec audit findings. It makes clear which requirements are authoritative and which need validation. +### Step 1b: Evaluate Documentation Depth + +If `reference_docs/` exists, read every file in it before deciding which subsystems to focus on. For each document, classify its depth: + +- **Deep** — contains internal contracts, safety invariants, concurrency models, defensive patterns, error handling details, or line-number-level source references. Suitable for deriving requirements. +- **Moderate** — covers architecture and API surface with some implementation detail. Useful for orientation but insufficient alone for requirement derivation. +- **Shallow** — API catalog, feature overview, or marketing-level summary. Lists what exists but not how it works, how it fails, or what contracts it enforces. **Not sufficient for scoping decisions.** + +**The scoping rule:** Do not narrow the audit scope to only the subsystems that have deep documentation. If the most complex or most failure-prone module has only shallow documentation, that is a **documentation gap to flag in PROGRESS.md**, not a reason to skip the module. The highest-risk code with the thinnest documentation is where bugs hide — auditing only well-documented areas produces a safe-looking report that misses real defects. + +When documentation is shallow for a high-risk area: + +1. Note the gap explicitly in PROGRESS.md under a `## Documentation depth assessment` section. +2. Derive requirements from source code directly (doc comments, safety annotations, defensive patterns, existing tests) and tag them as `[Req: inferred — from source]`. +3. Flag the area for deeper documentation gathering in the completeness report. + +Record the depth classification for each `reference_docs/` file in PROGRESS.md so reviewers can assess whether the documentation influenced the scope appropriately. + +**Coverage commitment table:** After classifying all `reference_docs/` documents, produce this table in PROGRESS.md under the `## Documentation depth assessment` section: + +| Document | Depth | Subsystem | Requirements commitment | If excluded: justification | +|----------|-------|-----------|------------------------|---------------------------| + +For every **deep** document, map it to the subsystem it covers, then either commit to deriving requirements from it ("will cover in Phase 2") or provide a specific justification that names the tradeoff. A sentence like "out of scope for this run" is not sufficient — the justification must say *why*, e.g., "interpreter JIT is excluded because this run focuses on the parser/compiler/GC pipeline; separate run recommended." + +**Gate:** A high-risk subsystem documented deeply in `reference_docs/` must not silently disappear from the requirements set. If a deep document has a "will cover" commitment but produces zero requirements by the end of Step 7, the requirements pipeline is incomplete — go back and derive requirements for the gap before proceeding to Phase 2 artifact generation. + ### Step 2: Map the Architecture List source directories and their purposes. Read the main entry point, trace execution flow. Identify: @@ -135,7 +764,7 @@ Read the existing test files — all of them for small/medium projects, or a rep Walk each spec document section by section. For every section, ask: "What testable requirement does this state?" Record spec requirements without corresponding tests — these are the gaps the functional tests must close. -If using inferred requirements (from tests, types, or code behavior), tag each with its confidence tier using the `[Req: tier — source]` format defined in Step 1. Inferred requirements feed into QUALITY.md scenarios and should be flagged for user review in Phase 4. +If using inferred requirements (from tests, types, or code behavior), tag each with its confidence tier using the `[Req: tier — source]` format defined in Step 1. Inferred requirements feed into QUALITY.md scenarios and should be flagged for user review in Phase 7. ### Step 4b: Read Function Signatures and Real Data @@ -179,7 +808,9 @@ If the project has a validation layer (Pydantic models in Python, JSON Schema, T **Read `references/schema_mapping.md`** for the mapping format and why this matters for writing valid boundary tests. -### Step 6: Identify Quality Risks (Code + Domain Knowledge) +### Step 6: Domain-Knowledge Risk Analysis (Code + Domain Knowledge) + +**This is the primary bug-hunting pass for library and framework codebases.** Complete it before selecting any structured patterns. Write the results to the `## Quality Risks` section of EXPLORATION.md immediately — do not hold them in memory. Every project has a different failure profile. This step uses **two sources** — not just code exploration, but your training knowledge of what goes wrong in similar systems. @@ -190,159 +821,1761 @@ Every project has a different failure profile. This step uses **two sources** - Where do cross-cutting concerns hide? **From domain knowledge**, ask: -- "What goes wrong in systems like this?" — If it's a batch processor, think about crash recovery, idempotency, silent data loss, state corruption. If it's a web app, think about auth edge cases, race conditions, input validation bypasses. If it handles randomness or statistics, think about seeding, correlation, distribution bias. -- "What produces correct-looking output that is actually wrong?" — This is the most dangerous class of bug: output that passes all checks but is subtly corrupted. +- "What goes wrong in systems like this?" — If it's an HTTP router, think about header parsing edge cases (quality values, token lists, case sensitivity), middleware ordering dependencies, and path normalization. If it's an HTTP client, think about redirect credential stripping, encoding detection, and connection state leaking. If it's a serialization library, think about null handling asymmetry, API surface consistency between direct methods and view wrappers, lazy evaluation caching bugs, and round-trip fidelity. If it's a web framework, think about response helper edge cases, configuration compilation chains, and middleware state isolation. If it's a batch processor, think about crash recovery, idempotency, silent data loss, state corruption. If it handles randomness or statistics, think about seeding, correlation, distribution bias. +- "What produces correct-looking output that is actually wrong?" — This is the most dangerous class of bug: output that passes all checks but is subtly corrupted. A response with a `200 OK` but the wrong `Content-Type`. A redirect that succeeds but leaks credentials. A deserialized object that has silently truncated values. - "What happens at 10x scale that doesn't happen at 1x?" — Chunk boundaries, rate limits, timeout cascading, memory pressure. - "What happens when this process is killed at the worst possible moment?" — Mid-write, mid-transaction, mid-batch-submission. +- "Where do two surfaces that should behave the same drift on edge inputs?" — Overloads, aliases, sync/async APIs, builder vs direct APIs, direct mutators vs live views/wrappers, stdlib-compatible wrappers vs framework-native surfaces. For Java/Kotlin: `add(null)` vs `asList().add(null)`, `put(key,null)` vs `asMap().put(key,null)`. For Python: constructor encoding vs mutator encoding, sync vs async client behavior. +- "What emits plausible output with subtly wrong metadata?" — Content type, charset, route pattern, ETag strength, byte count, auth/header/cookie propagation, status code, cache validators. +- "What standard grammar or list syntax is being parsed with ad hoc string logic?" — Quality values (`q=0`), comma-separated headers, digest challenges, MIME types with parameters, query strings, enum/keyword sets, cookie merging. +- "What edge-case inputs would a domain expert reach for?" — For HTTP code: `Accept-Encoding: gzip;q=0`, `Connection: keep-alive, Upgrade`, `Content-Type: application/problem+json`. For serialization code: `null` through different API surfaces, values at `Integer.MAX_VALUE + 1`, round-tripping through encode-then-decode. For routing code: overlapping patterns, mounted prefix propagation, same path with different methods. - "What information does the user need before committing to an irreversible or expensive operation?" — Pre-run cost estimates, confirmation of scope (especially when fan-out or expansion will multiply the work), resource warnings. If the system can silently commit the user to hours of processing or significant cost without showing them what they're about to do, that's a missing safeguard. Search for operations that start long-running processes, submit batch jobs, or trigger expansion/fan-out — and check whether the user sees a preview, estimate, or confirmation with real numbers before the point of no return. - "What happens when a long-running process finishes — does it actually stop?" — Polling loops, watchers, background threads, and daemon processes that run until completion should have explicit termination conditions. If the loop checks "is there more work?" but never checks "is all work done?", it will run forever after completion. This is especially common in batch processors and queue consumers. -Generate realistic failure scenarios from this knowledge. You don't need to have observed these failures — you know from training that they happen to systems of this type. Write them as **architectural vulnerability analyses** with specific quantities and consequences. Frame each as "this architecture permits the following failure mode" — not as a fabricated incident report. Use concrete numbers to make the severity non-negotiable: "If the process crashes mid-write during a 10,000-record batch, `save_state()` without an atomic rename pattern will leave a corrupted state file — the next run gets JSONDecodeError and cannot resume without manual intervention." Then ground them in the actual code you explored: "Read persistence.py line ~340 (save_state): verify temp file + rename pattern." +Generate at least 5 ranked failure scenarios from this knowledge. You don't need to have observed these failures — you know from training that they happen to systems of this type. Write them as **specific bug hypotheses with file-path and line-number citations**, ranked by priority. Frame each as: "Because [code at file:line] does [X], a [domain-specific edge case] will produce [wrong behavior] instead of [correct behavior]." Then ground them in the actual code you explored: "Read persistence.py line ~340 (save_state): verify temp file + rename pattern." ---- +**Anti-patterns that fail the gate:** A Quality Risks section that lists defensive patterns the code already has (things the code does right) is not a risk analysis — it is a reassurance exercise and will not find bugs. A section that lists risky modules without specific failure scenarios is not actionable. A section that concludes "this is a mature, well-tested library so basic bugs are unlikely" is actively harmful — mature libraries have the most subtle API-contract and edge-case bugs, precisely because the obvious ones were found years ago. The test: could a code reviewer read each scenario and immediately know what function to open and what input to test? If not, the scenario is too abstract. -## Phase 2: Generate the Quality Playbook +### Step 7: Derive Testable Requirements -Now write the six files. For each one, follow the structure below and consult the relevant reference file for detailed guidance. +**Read `references/requirements_pipeline.md`** for the complete five-phase pipeline, domain checklist, and versioning protocol. -**Why six files instead of just tests?** Tests catch regressions but don't prevent new categories of bugs. The quality constitution (`QUALITY.md`) tells future sessions what "correct" means before they start writing code. The protocols (`RUN_*.md`) provide structured processes for review, integration testing, and spec auditing that produce repeatable results — instead of leaving quality to whatever the AI feels like checking. Together, these files create a quality system where each piece reinforces the others: scenarios in QUALITY.md map to tests in the functional test file, which are verified by the integration protocol, which is audited by the Council of Three. +This is the most important step for the code review protocol. Everything found during exploration — specs, ChangeLog entries, config structs, source comments, chat history — gets distilled into a set of testable requirements that the code review will verify. The pipeline separates contract discovery from requirement derivation, uses file-based external memory, and includes mechanical verification with a completeness gate. -### File 1: `quality/QUALITY.md` — Quality Constitution +**Why this matters:** Structural code review catches about 65% of real defects. The remaining 35% are intent violations — absence bugs, cross-file contradictions, and design gaps. These are invisible to code reading because the code that IS there is correct. You need to know what the code is supposed to do, then check whether it does it. That's what testable requirements provide. -**Read `references/constitution.md`** for the full template and examples. +**The five-phase pipeline:** -The constitution has six sections: +1. **Phase A — Contract extraction.** Read all source files, list every behavioral contract. Write to `quality/CONTRACTS.md`. This is discovery — list everything, even if it seems obvious. +2. **Phase B — Requirement derivation.** Read CONTRACTS.md and documentation. Group related contracts, enrich with user intent, write formal requirements. Write REQ records to `quality/requirements_manifest.json` (source of truth) and render to `quality/REQUIREMENTS.md`. For each requirement, record the `tier` (1–5 per schemas.md §3.1) and — when `tier ∈ {1, 2}` — the `citation` block produced by `bin/reference_docs_ingest` invoking `bin/citation_verifier` per schemas.md §5.4 / §5.5. The LLM does not shell out to `citation_verifier` directly; the excerpt is a product of the ingest pipeline and is re-verified by `quality_gate.py` at gate time. For Tier 3 REQs (code-is-the-spec), cite the source `file:line` in the `description`; citations are for FORMAL_DOC references only and must not appear on Tier 3/4/5 REQs. The tier + citation pair creates the forward link in the traceability chain: reference_docs/cite → requirements → bugs → tests. See the tier/citation framing block later in this step for the full field list and the Tier-1-wins-over-Tier-2 rule. -1. **Purpose** — What quality means for this project, grounded in Deming (built in, not inspected), Juran (fitness for use), Crosby (quality is free). Apply these specifically: what does "fitness for use" mean for *this system*? Not "tests pass" but the actual operational requirement. -2. **Coverage Targets** — Table mapping each subsystem to a target with rationale referencing real risks. Every target must have a "why" grounded in a specific scenario — without it, a future AI session will argue the target down. -3. **Coverage Theater Prevention** — Project-specific examples of fake tests, derived from what you saw during exploration. (Why: AI-generated tests often pad coverage numbers without catching real bugs — asserting that imports worked, that dicts have keys, or that mocks return what they were configured to return. Calling this out explicitly stops the pattern.) -4. **Fitness-to-Purpose Scenarios** — The heart of it. Each scenario documents a realistic failure mode with code references and verification method. Aim for 2+ scenarios per core module — typically 8–10 total for a medium project, fewer for small projects, more for complex ones. Quality matters more than count: a scenario that precisely captures a real architectural vulnerability is worth more than three generic ones. (Why: Coverage percentages tell you how much code ran, not whether it ran correctly. A system can have 95% coverage and still lose records silently. Fitness scenarios define what "working correctly" actually means in concrete terms that no one can argue down.) -5. **AI Session Quality Discipline** — Rules every AI session must follow -6. **The Human Gate** — Things requiring human judgment + **Optional `Pattern:` field on REQs.** A requirement that needs a Phase 3 + compensation grid should declare its pattern class: -**Scenario voice is critical.** Write "What happened" as architectural vulnerability analyses with specific quantities, cascade consequences, and detection difficulty — not as abstract specifications. "Because `save_state()` lacks an atomic rename pattern, a mid-write crash during a 10,000-record batch will leave a corrupted state file — the next run gets JSONDecodeError and cannot resume. At scale, this risks silent loss of 1,693+ records with no detection mechanism." An AI session reading that will not argue the standard down. Use your knowledge of similar systems to generate realistic failure scenarios, then ground them in the actual code you explored. Scenarios come from both code exploration AND domain knowledge about what goes wrong in systems like this. + - `Pattern: whitelist` — authoritative list of items, every site must handle + each one. + - `Pattern: parity` — symmetric operations that must match + (encode↔decode, setup↔teardown). + - `Pattern: compensation` — sites that must compensate for a shared gap. -Every scenario's "How to verify" must map to at least one test in the functional test file. + Missing the field means no grid. Setting an invalid value fails + `quality_gate.py`. -### File 2: Functional Tests + **Preservation rule (Phase 2).** While `Pattern:` is optional in the design + sense (some REQs are single-site and need no grid), it is REQUIRED to + preserve when the Phase-1 hypothesis already carried it. Phase 2 must + transcribe `Pattern:` from EXPLORATION.md to `quality/REQUIREMENTS.md` and + `quality/requirements_manifest.json` whenever present. Silent omission is a + documented v1.4.5-regression vector — the Phase 5 cardinality gate cannot + enforce coverage on a REQ it doesn't know is pattern-tagged. The gate's + structural backstop (C13.7/Fix 2) cross-checks REQs that carry per-site UC + references (`UC-N.a`/`UC-N.b` form emitted by Phase 1's Cartesian UC rule) + and fails the gate if Pattern is missing on such a REQ. -**This is the most important deliverable.** Read `references/functional_tests.md` for the complete guide. + **Primary-source extraction rule for code-presence claims.** When writing a requirement that asserts specific constants, values, or labels are handled by a specific function (e.g., "the whitelist must preserve X, Y, and Z"), the requirement must distinguish between what the **spec says should be there** and what the **code actually contains**. Extract the actual contents from the code (case labels, map keys, if-else branches) and compare to the spec's list. If a constant appears in the spec but NOT in the code, write the requirement as "must handle X — **[NOT IN CODE]**: defined in header.h:NN but absent from function() at file.c:NN-NN." Do not write "must preserve X" without verifying X is actually preserved. This prevents a contamination chain where a requirement asserts code presence, the code review copies the assertion, the spec audit inherits it, and the triage accepts it — all without anyone reading the actual code. This exact chain was observed in v1.3.17 virtio testing: REQUIREMENTS.md asserted RING_RESET was preserved in a switch, the code review copied the list, three spec auditors inherited the claim, and the bug went undetected. + **Mechanical verification artifact for dispatch functions (mandatory).** When a contract asserts that a function handles, preserves, or dispatches a set of named constants (feature bits, enum values, opcode tables, event types, handler registries), you must generate and execute a shell command or script that mechanically extracts the actual case labels/branches from the function body **before writing the contract line**. Save the raw output to `quality/mechanical/_cases.txt`. The command must be a non-interactive pipeline (e.g., `awk` + `grep`) that cannot hallucinate — it reads file bytes and prints matches. Example: -Organize the tests into three logical groups (classes, describe blocks, modules, or whatever the test framework uses): + ```bash + awk '/void vring_transport_features/,/^}$/' drivers/virtio/virtio_ring.c \ + | grep -E '^\s*case\s+' > quality/mechanical/vring_transport_features_cases.txt + ``` -- **Spec requirements** — One test per testable spec section. Each test's documentation cites the spec requirement it verifies. -- **Fitness scenarios** — One test per QUALITY.md scenario. 1:1 mapping, named to match. -- **Boundaries and edge cases** — One test per defensive pattern from Step 5. + After execution, read the output file and use it as the sole source of truth for what the function handles. A contract line asserting "function preserves constant X" is **forbidden** unless `quality/mechanical/_cases.txt` contains a matching `case X:` line. If a constant appears in a spec or header but NOT in the mechanical output, the contract must record it as absent: `"must handle X — **[NOT IN CODE]**: defined in header.h:NN but absent from function() per mechanical check."` Downstream artifacts (`REQUIREMENTS.md`, `RUN_SPEC_AUDIT.md`, code review) must cite the mechanical file path when referencing dispatch-function coverage — they may not replace the mechanical output with a hand-written list. -Key rules: -- **Match the existing import pattern exactly.** Read how existing tests import project modules and do the same thing. Getting this wrong means every test fails. -- **Read every function's signature before calling it.** Read the actual `def` line — parameter names, types, defaults. Read real data files from the project to understand data shapes. Do not guess at function parameters or fixture structures. -- **No placeholder tests.** Every test must import and call actual project code. If the body is `pass` or the assertion is trivial (`assert isinstance(x, list)`), delete it. A test that doesn't exercise project code inflates the count and creates false confidence. -- **Test count heuristic** = (testable spec sections) + (QUALITY.md scenarios) + (defensive patterns). For a medium project (5–15 source files), this typically yields 35–50 tests. Significantly fewer suggests missed requirements or shallow exploration. Significantly more is fine if every test is meaningful — don't pad to hit a number. -- **Cross-variant heuristic: ~30%** — If the project handles multiple input types, aim for roughly 30% of tests parametrized across all variants. The exact percentage matters less than ensuring every cross-cutting property is tested across all variants. -- **Test outcomes, not mechanisms** — Assert what the spec says should happen, not how the code implements it. -- **Use schema-valid mutations** — Boundary tests must use values the schema accepts (from Step 5b), not values it rejects. + **Mechanical artifact integrity check (mandatory).** For each mechanical extraction command, also append it to `quality/mechanical/verify.sh` as a verification step. The script must re-run the same extraction pipeline and diff the result against the saved file. Generate `verify.sh` with this structure: -### File 3: `quality/RUN_CODE_REVIEW.md` + ```bash + #!/bin/bash + # Auto-generated: re-run mechanical extraction commands and verify saved artifacts + set -euo pipefail + FAIL=0 + + # Verify + ACTUAL=$(awk '/void vring_transport_features/,/^}$/' drivers/virtio/virtio_ring.c | grep -nE '^\s*case\s+') + SAVED=$(cat quality/mechanical/vring_transport_features_cases.txt) + if [ "$ACTUAL" != "$SAVED" ]; then + echo "MISMATCH: vring_transport_features_cases.txt" + diff <(echo "$ACTUAL") <(echo "$SAVED") || true + FAIL=1 + else + echo "OK: vring_transport_features_cases.txt" + fi + + exit $FAIL + ``` -**Read `references/review_protocols.md`** for the template. + **Phase 6 must execute `bash quality/mechanical/verify.sh`** and the benchmark fails if any artifact mismatches. This catches a failure mode observed in v1.3.19: the model executed the extraction command but wrote its own expected output to the file instead of letting the shell redirect capture it, inserting a hallucinated `case VIRTIO_F_RING_RESET:` line that the real command does not produce. Re-running the same command in a separate step and diffing against the file detects this tampering. -Key sections: bootstrap files, focus areas mapped to architecture, and these mandatory guardrails: + **Immediate integrity gate (mandatory, Phase 2a).** Run `bash quality/mechanical/verify.sh` **immediately** after writing each `*_cases.txt` file and **before** writing any contract, requirement, or prose artifact that cites the extraction. If exit code ≠ 0: stop, delete the failed `*_cases.txt`, re-run the extraction command with a fresh shell redirect (do not hand-edit the output), and re-verify. Do not advance to Phase 3/2c until verify.sh exits 0. Save verify.sh stdout and exit code to `quality/results/mechanical-verify.log` and `quality/results/mechanical-verify.exit` as durable receipt files. This gate exists because v1.3.23 showed that deferring verification to Phase 6 allows downstream artifacts (CONTRACTS.md, REQUIREMENTS.md, triage probes) to build on a forged extraction — the model reconciles a discrepancy between requirements and the artifact by corrupting the artifact instead of correcting the requirement. -- Line numbers are mandatory — no line number, no finding -- Read function bodies, not just signatures -- If unsure: flag as QUESTION, not BUG -- Grep before claiming missing -- Do NOT suggest style changes — only flag things that are incorrect + **Mechanical artifacts are immutable after extraction.** Once a `*_cases.txt` file has been written by the shell redirect and verified by `verify.sh`, it must not be modified, overwritten, or regenerated for the remainder of the run. If a downstream step discovers a discrepancy between the mechanical artifact and a requirement or contract, the requirement or contract is wrong — not the artifact. Fix the prose, not the extraction. This rule prevents the v1.3.23 failure mode where the model overwrote a correct extraction with fabricated content to match its own narrative. -**Phase 2: Regression tests.** After the review produces BUG findings, write regression tests in `quality/test_regression.*` that reproduce each bug. Each test should fail on the current implementation, confirming the bug is real. Report results as a confirmation table (BUG CONFIRMED / FALSE POSITIVE / NEEDS INVESTIGATION). See `references/review_protocols.md` for the full regression test protocol. + **Forbidden probe pattern (triage and verification).** Triage probes, verification probes, and audit assertions must not use `open('quality/mechanical/...')` or `cat quality/mechanical/...` as sole evidence for what a source file contains at a given line. To verify that function F handles constant C at line N, the probe must either: (a) read the source file directly (`open('drivers/virtio/virtio_ring.c')` with line-anchored assertions), or (b) re-execute the same extraction pipeline used by `verify.sh` and check its output. Reading the saved artifact proves only what the artifact says, not what the code says — this is circular verification. In v1.3.23, Probe C validated the forged artifact instead of the source code, passing with fabricated data. -### File 4: `quality/RUN_INTEGRATION_TESTS.md` + **Do not create an empty mechanical/ directory.** Only create `quality/mechanical/` if the project's contracts include dispatch functions, registries, or enumeration checks that require mechanical extraction. If no such contracts exist, skip the directory entirely and record in PROGRESS.md: `Mechanical verification: NOT APPLICABLE — no dispatch/registry/enumeration contracts in scope.` Creating an empty mechanical/ directory (or one without verify.sh) is non-conformant — it signals that extraction was attempted and abandoned. Decide before creating the directory: does this project have dispatch-function contracts? If no, don't `mkdir`. If yes, populate it fully. -**Read `references/review_protocols.md`** for the template. + **Normative vs. descriptive split.** Requirements and contracts must use normative language ("must preserve," "should handle") for expected behavior. They may only use descriptive language ("preserves," "handles") when the mechanical verification artifact confirms the claim. A requirement that says "the implementation preserves VIRTIO_F_RING_RESET" without a confirming mechanical artifact is non-conformant — write "the implementation **must** preserve VIRTIO_F_RING_RESET" and cite the mechanical check result showing whether the constant is currently present or absent. -Must include: safety constraints, pre-flight checks, test matrix with specific pass criteria, an execution UX section, and a structured reporting format. Cover happy path, cross-variant consistency, output correctness, and component boundaries. +3. **Phase C — Coverage verification.** Cross-reference every contract against every requirement. Fix gaps. Loop up to 3 times until coverage reaches 100%. Write to `quality/COVERAGE_MATRIX.md`. The matrix must have **one row per requirement** (REQ-001, REQ-002, etc.) — not grouped ranges like "C-001 to C-007 | REQ-001, REQ-003". Grouped ranges make machine verification impossible and hide gaps. +4. **Phase D — Completeness check + self-refinement loop.** Apply the domain checklist, testability audit, and cross-requirement consistency check. Also verify that every deep document with a "will cover" commitment in the coverage commitment table has at least one requirement traced to it — if not, add requirements for the gap before continuing. -**All commands must use relative paths.** The generated protocol should include a "Working Directory" section at the top stating that all commands run from the project root using relative paths. Never generate commands that `cd` to an absolute path — this breaks when the protocol is run from a different machine or directory. Use `./scripts/`, `./pipelines/`, `./quality/`, etc. + Write to `quality/COMPLETENESS_REPORT.md` as a **baseline** completeness report (without a `## Verdict` section — the verdict is deferred to Phase 5 post-reconciliation, which produces the only verdict that counts for closure). Then run up to 3 self-refinement iterations: read the report, fix gaps, re-check. Short-circuit when fewer than 3 changes per iteration. +5. **Phase E — Narrative pass.** Add project overview (with overview validation gate), then derive use cases (with use case derivation gate). Both gates must pass before proceeding to category narratives, cross-cutting concerns, and final reordering. This sequencing prevents multi-pass loops where a failed late gate forces re-derivation. Reorder for top-down flow. Renumber sequentially. -**Include an Execution UX section.** When someone tells an AI agent to "run the integration tests," the agent needs to know how to present its work. The protocol should specify three phases: (1) show the plan as a numbered table before running anything, (2) report one-line progress updates as each test runs (`✓`/`✗`/`⧗`), (3) show a summary table with pass/fail counts and a recommendation. See `references/review_protocols.md` section "Execution UX" for the template and examples. Without this, the agent dumps raw output or stays silent — neither is useful. +**REQUIREMENTS.md must begin with a human-readable overview** that answers: What is this project? What does it do? Who are the actors (users, systems, hardware, protocols)? What are the highest-risk areas? This overview should be useful to someone who has never seen the project before. If the project is a library or driver where all actors are systems, describe the system actors (kernel maintainers, protocol peers, integrators, end-user developers) and their interactions. Do not start with raw scope metadata or HTML comments — lead with a plain-language description. -**This protocol must exercise real external dependencies.** If the project talks to APIs, databases, or external services, the integration test protocol runs real end-to-end executions against those services — not just local validation checks. Design the test matrix around the project's actual execution modes and external dependencies. Look for API keys, provider abstractions, and existing integration test scripts during exploration and build on them. +**Overview validation gate (mandatory).** After writing the overview, perform this self-check before proceeding to use case derivation: -**Derive quality gates from the code, not generic checks.** Read validation rules, schema enums, and generation logic during exploration. Turn them into per-pipeline quality checks with specific fields and acceptable value ranges. "All units validated" is not enough — the protocol must verify domain-specific correctness. +> Does this overview describe the project the way its actual users would recognize it? Specifically: +> - Does it name the project's ecosystem role and real-world significance? +> - Does it identify who depends on it and for what? +> - Would a developer who uses this project daily say "yes, that's what it is and why it matters"? +> - For well-known projects, does it reflect publicly known adoption (e.g., Cobra → kubectl/Hugo/GitHub CLI; Express → millions of Node.js API servers; Zod → form validation/tRPC; Serde → the default Rust serialization layer)? -**Script parallelism, don't just describe it.** Group runs so independent executions (different providers) run concurrently. Include actual bash commands with `&` and `wait`. One run per provider at a time to avoid rate limits. +If the overview reads like it was written by someone who only read the source code and never used the software, revise it before proceeding. The overview sets the frame for everything downstream — feature-oriented use cases and internally focused requirements are symptoms of an overview that only describes the code, not the project. -**Calibrate unit counts to the project.** Read `chunk_size` or equivalent config. Use enough units to span at least 2 chunks and enough to verify distribution checks. Typically 10–30 for integration testing. +**Use case derivation (mandatory, runs after overview gate).** Derive 5–7 use cases from the validated overview and gathered documentation, then validate them against the code. Each use case must: -**Deep post-run verification.** Don't stop at "process completed." Verify log files, manifest state, output data existence, sample record content, and any existing quality check scripts — for every run. +- Describe a **real user outcome**, not a code feature. "Developer builds a CLI tool with nested subcommands, persistent flags, and shell completion" — not "Framework supports command trees." +- Name a **concrete actor** and what they are trying to accomplish. Actors include end-user developers, system administrators, kernel maintainers, protocol peers, integrators, and automated consumers. +- Be **recognizable to an actual user** of the software. For well-known projects, validate use cases against the model's own knowledge of the project, community docs, tutorials, and real-world adoption patterns. +- Connect to at least one requirement through testable conditions of satisfaction. -**Find and use existing verification tools.** Search for existing scripts that verify output quality (e.g., `integration_checks.py`, validation scripts, quality gate functions). If they exist, call them from the protocol. If the project has a TUI or dashboard, include TUI verification commands (e.g., `--dump` flags) in the post-run checklist. +The pipeline should explicitly ask: "Based on this project's overview, gathered documentation, and known user base, what are the 5–7 most important things real users do with this software?" Derive use cases from that question — not from scanning the code and grouping features into categories. -**Build a Field Reference Table before writing quality gates.** This is the most important step for protocol accuracy. AI models confidently write wrong field names even after reading schemas — `document_id` becomes `doc_id`, `sentiment_score` becomes `sentiment`, `float 0-1` becomes `int 0-100`. The fix is procedural: **re-read each schema file IMMEDIATELY before writing each table row.** Do not rely on what you read earlier in the conversation — your memory of field names drifts over thousands of tokens. Copy field names character-for-character from the file contents. Include ALL fields from each schema (if the schema has 8 fields, the table has 8 rows). See `references/review_protocols.md` section "The Field Reference Table" for the full process and format. Do not skip this step — it prevents the single most common protocol inaccuracy. +**Use case validation against code:** After deriving use cases from the overview and docs, verify each one against the codebase. If a use case describes something the code doesn't actually support, revise or remove it. If the code supports an important user outcome that no use case covers, add one. The goal is use cases that are both user-recognizable AND code-grounded. -### File 5: `quality/RUN_SPEC_AUDIT.md` — Council of Three +**Acceptance criteria span check (mandatory, runs after use case derivation).** After use cases are finalized and validated against code, check whether the conditions of satisfaction across all requirements collectively span the project's main behaviors: -**Read `references/spec_audit.md`** for the full protocol. +> Do these acceptance criteria, taken together, cover the project? Is there a major user-facing behavior described in the overview or use cases that no requirement's conditions of satisfaction would catch if it broke? -Three independent AI models audit the code against specifications. Why three? Because each model has different blind spots — in practice, different auditors catch different issues. Cross-referencing catches what any single model misses. +For each use case, at least one requirement's conditions of satisfaction must be traceable to it, and at least one linked requirement must be `specific` (not `architectural-guidance`). Use cases with no linked specific requirements indicate a gap. When gaps are found, either: (a) add new requirements or sharpen existing conditions to cover the gap, or (b) revise the use case if it doesn't reflect what the requirements actually protect. Record the results of this check in the completeness report. -The protocol defines: a copy-pasteable audit prompt with guardrails, project-specific scrutiny areas, a triage process (merge findings by confidence level), and fix execution rules (small batches by subsystem, not mega-prompts). +Follow the use cases with the individual requirements. -### File 6: `AGENTS.md` +**v1.5.3 tier and citation scheme (schemas.md §3.1, §5).** Every REQ carries a `tier` integer 1–5 per `schemas.md` §3.1: -If `AGENTS.md` already exists, update it — don't replace it. Add a Quality Docs section pointing to all generated files. +- **Tier 1** — project's own formal spec (a `FORMAL_DOC` record with `tier=1`; highest authority). +- **Tier 2** — external formal standard (RFC, W3C, ISO, published API contract — a `FORMAL_DOC` record with `tier=2`). +- **Tier 3** — source-of-truth code when no formal spec exists; the code IS the spec. +- **Tier 4** — informal documentation loaded by `bin/reference_docs_ingest.load_tier4_context` from top-level `reference_docs/` (AI chats, design notes, retrospectives). +- **Tier 5** — inferred from code behavior with no documentation backing. -If creating from scratch: project description, setup commands, build & test commands, architecture overview, key design decisions, known quirks, and quality docs pointers. +For `tier ∈ {1, 2}`, the REQ also carries a `citation` block per `schemas.md` §5 with `document`, `document_sha256`, at least one of `section`/`line`, and a mechanically-extracted `citation_excerpt`. Do NOT write the excerpt by hand. The excerpt is produced at ingest time by `bin/reference_docs_ingest` invoking `bin/citation_verifier` per the deterministic algorithm in `schemas.md` §5.4 (with section resolution per §5.5) — the LLM consumes the excerpt from `formal_docs_manifest.json`; it never shells out to the verifier directly. Ingest-time extraction is how Layer 1 of the hallucination gate works. If you cannot cite a document in `quality/formal_docs_manifest.json` (with hash and locator), the REQ is at most Tier 3. `page`-only locators are diagnostic-only and are never sufficient. ---- +**Tier-1-wins-over-Tier-2 rule.** When a project's own spec (Tier 1) and an external standard (Tier 2) contradict each other, record the REQ at Tier 1 citing the project's position. A project's documented deviation from an external standard is authoritative intent, not a defect — the `upstream-spec-issue` disposition applies only when the project's spec is silent on the conflict. -## Phase 3: Verify +**Spec-Gap degradation (valid output state).** If `formal_docs_manifest.json` contains zero `FORMAL_DOC` records covering the project's own behavior, every REQ ends up at Tier 3/4/5 and the run degrades gracefully into a Spec Gap Analyzer. Report the meta-finding "0 Tier 1/2 requirements" in the completeness report as a metric, not a failure. Do NOT fabricate citations to make the tier distribution look richer — `quality_gate.py` re-invokes `bin/citation_verifier` (via `extract_excerpt`) per §5.4 at verification time and rejects any Tier 1/2 REQ whose `citation_excerpt` does not byte-equal the fresh extraction (schemas.md §10 invariant #11). -**Why a verification phase?** AI-generated output can look polished and be subtly wrong. Tests that reference undefined fixtures report 0 failures but 16 errors — and "0 failures" sounds like success. Integration protocols can list field names that don't exist in the actual schemas. The verification phase catches these problems before the user discovers them, which is important because trust in a generated quality playbook is fragile — one wrong field name undermines confidence in everything else. +**`functional_section` is a required field.** Every REQ carries a short `functional_section` string (e.g., `"Authentication"`, `"Bus enumeration"`) that groups related REQs. This is LLM-derived from the code and documentation; there is no predefined ontology. Phase 2's rendering groups REQs under these sections (with a short intro paragraph per section) and the Phase 4 Council reviews the grouping for coherence. See `schemas.md` §6.1. -### Self-Check Benchmarks +**Traceability is one-way: REQ → UC.** The REQ carries a `use_cases[]` list of UC-NN IDs. The UC record does NOT carry a `requirements[]` back-link — the reverse direction is derived at render time by querying REQ records for matching entries (schemas.md §7). Do not populate a `requirements[]` field on UC records. -Before declaring done, check every benchmark. **Read `references/verification.md`** for the complete checklist. +**For each requirement, provide all of these fields:** -The critical checks: +- **ID**: `REQ-NNN` (zero-padded three-digit sequence). +- **Title**: Short, one-line statement. +- **Tier**: Integer 1–5 per schemas.md §3.1. +- **Functional section**: Short LLM-derived string (see above). +- **Citation** (required when `tier ∈ {1, 2}`): produced by `bin/reference_docs_ingest` invoking `bin/citation_verifier`; never hand-authored and never invoked directly by the LLM. Shape per schemas.md §5.1. +- **Summary / Description**: State the requirement as a testable assertion: "X must satisfy Y" or "When A, the system must B". +- **User story**: Frame it from the caller's perspective: "As a [role] doing [action], I expect [behavior] **so that** [outcome]." The "so that" clause is mandatory — it forces you to articulate the intent behind the requirement. +- **Implementation note**: How the code achieves this requirement — the mechanism, the relevant code paths, the design choice. +- **Conditions of satisfaction**: Specific, testable scenarios that prove this requirement is met. Include the happy path, edge cases, and failure modes. Each individual contract from Phase A that was grouped into this requirement becomes a condition of satisfaction. +- **Alternative paths**: Multiple code paths, modes, or entry points that must all satisfy the requirement. Alternative paths are where bugs hide. +- **Use cases**: `use_cases[]` — list of `UC-NN` IDs this REQ participates in. One-way forward link. +- **References**: Cite the source — spec section, ChangeLog entry, config field definition, source comment, issue number, or domain knowledge. For Tier 1/2 REQs the `citation` block carries the authoritative locator; free-form references are supplementary. +- **Specificity**: **specific** (testable — must have conditions of satisfaction that a code reviewer can check against a specific code location or behavior; this is the default and counts toward coverage metrics) or **architectural-guidance** (not testable against individual code paths — covers cross-cutting properties like "remain lightweight and stdlib-compatible" or "no_std support"; informs the quality constitution but is not counted in coverage metrics; most projects should have 0–3 architectural-guidance requirements — more than 3 triggers the mandatory self-check below). The category "directional" is retired. Any requirement that would have been "directional" must either be made specific (with testable conditions) or explicitly classified as architectural-guidance. -1. **Test count** near heuristic target (spec sections + scenarios + defensive patterns) -2. **Scenario coverage** — scenario test count matches QUALITY.md scenario count -3. **Cross-variant coverage** — ~30% of tests parametrize across all input variants -4. **Boundary test count** ≈ defensive pattern count from Step 5 -5. **Assertion depth** — Majority of assertions check values, not just presence -6. **Layer correctness** — Tests assert outcomes (what spec says), not mechanisms (how code implements) -7. **Mutation validity** — Every fixture mutation uses a schema-valid value from Step 5b -8. **All tests pass — zero failures AND zero errors.** Run the test suite using the project's test runner (Python: `pytest -v`, Scala: `sbt testOnly`, Java: `mvn test`/`gradle test`, TypeScript: `npx jest`, Go: `go test -v`, Rust: `cargo test`) and check the summary. Errors from missing fixtures, failed imports, or unresolved dependencies count as broken tests. If you see setup errors, you forgot to create the fixture/setup file or referenced undefined test helpers. -9. **Existing tests unbroken** — The new files didn't break anything. -10. **Integration test quality gates were written from a Field Reference Table.** Verify that you built a Field Reference Table by re-reading each schema file before writing quality gates, and that every field name in the quality gates is copied from that table — not from memory. If you skipped the table, go back and build it now. + **Architectural-guidance self-check (mandatory, runs after requirement derivation).** Count the requirements tagged `architectural-guidance`. Apply both bounds: -If any benchmark fails, go back and fix it before proceeding. + - **Maximum bound (>3):** If the count exceeds 3, stop and re-examine each one. For each, ask: "Can I add a testable condition of satisfaction that a code reviewer could verify against a specific code location?" If yes, reclassify it as `specific` and add the condition. Only requirements that genuinely cannot be verified against any specific code path should remain `architectural-guidance`. A final count above 3 requires an explicit justification per excess requirement explaining why it cannot be made specific. + - **Minimum bound (0 on 15+ requirements):** If the total requirement count is 15 or more and the `architectural-guidance` count is 0, re-examine the requirements for cross-cutting design invariants. Libraries that span protocol layers, manage resource lifecycles, enforce ordering guarantees, or maintain compatibility contracts (e.g., "remain stdlib-compatible," "preserve no_std support," "maintain wire-format backward compatibility") typically have 1–3 architectural-guidance requirements. Write one sentence in the completeness report explaining why no requirement qualified as architectural-guidance, or reclassify the appropriate requirements. ---- + Record the count and any reclassifications in the completeness report. -## Phase 4: Present, Explore, Improve (Interactive) +**Do not cap the requirement count.** Derive as many as the project warrants. A small utility might have 20. A mature library might have 100+. The goal is completeness. -After generating and verifying, present the results clearly and give the user control over what happens next. This phase has three parts: a scannable summary, drill-down on demand, and a menu of improvement paths. +**Step 7a: Documentation-to-requirement reconciliation** -**Do not skip this phase.** The autonomous output from Phases 1-3 is a solid starting point, but the user needs to understand what was generated, explore what matters to them, and choose how to improve it. A quality playbook is only useful if the people who own the project trust it and understand it. Dumping six files without explanation creates artifacts nobody reads. +Re-read the coverage commitment table from PROGRESS.md. For each deep document you committed to covering ("will cover in Phase 2"), verify that at least one requirement traces to the subsystem it documents. If your requirements cover only some committed subsystems, add requirements for the gaps before completing Step 7. -### Part 1: The Summary Table +For each subsystem, record one of the following in PROGRESS.md: +- the requirement IDs that cover it, or +- an explicit exclusion with rationale, risk acknowledgment, and recommended follow-up -Present a single table the user can scan in 10 seconds: +A deep-documented subsystem with a "will cover" commitment and zero mapped requirements is a process failure, not a legitimate scope choice. Do not proceed to artifact generation until every commitment is satisfied or explicitly converted to a justified exclusion. -``` -Here's what I generated: +**Step 7b: Code-path → REQ reverse traceability audit (mandatory)** -| File | What It Does | Key Metric | Confidence | -|------|-------------|------------|------------| -| QUALITY.md | Quality constitution | 10 scenarios | ██████░░ High — grounded in code, but scenarios are inferred, not from real incidents | -| Functional tests | Automated tests | 47 passing | ████████ High — all tests pass, 35% cross-variant | -| RUN_CODE_REVIEW.md | Code review protocol | 8 focus areas | ████████ High — derived from architecture | -| RUN_INTEGRATION_TESTS.md | Integration test protocol | 9 runs × 3 providers | ██████░░ Medium — quality gates need threshold tuning | -| RUN_SPEC_AUDIT.md | Council of Three audit | 10 scrutiny areas | ████████ High — guardrails included | -| AGENTS.md | AI session bootstrap | Updated | ████████ High — factual | +**Timing: Execute Step 7a and 7b after Phase E completes** (i.e., after the overview validation gate, use case derivation, and acceptance criteria span check have all run). The audit depends on finalized requirements AND finalized use cases. + +After requirements derivation is complete, run a reverse traceability audit. Forward traceability (gathered docs → requirements → bugs → tests) is already built into the pipeline. This step checks the reverse direction at code-path granularity: do significant code paths map back to requirement conditions? This is an audit activity — NOT a structural bidirectional link. (Structural traceability in v1.5.3 is one-way REQ → UC per `schemas.md` §7 and is enforced by schema; this audit checks code coverage against REQs, which is a separate concern.) + +This operates at **path/branch/helper granularity**, not file level. File-level coverage was 100% in v1.3.13 and still missed two real bugs. The question is not "does this file map to some requirement?" but "does this significant branch map to a requirement clause that states what must be preserved here?" + +**Scoped to four categories** (not an open-ended branch audit): + +1. **Alternative paths already named in requirements.** If a requirement mentions fallback or alternative paths (e.g., "primary vs. degraded mode," "negotiated vs. default configuration," "sync vs. async"), each alternative must have an explicit **symmetry condition** — a statement of what invariant must hold across both paths. A requirement that says "the system handles both X and Y" without specifying what "handles" means for each is incomplete. + +2. **Helpers that translate public constants into runtime behavior.** If a helper function whitelists, filters, or translates between defined constants and runtime behavior (e.g., feature flag gates, codec registry lookups, capability whitelist helpers), it must have a helper-specific requirement enumerating the expected preserved/translated values. + +3. **Capability-negotiation and fallback logic.** Code paths where the system negotiates capabilities with an external peer (protocol version negotiation, feature detection, graceful degradation) must have requirements covering both the negotiated-up and negotiated-down paths. + +4. **Functions named in prior BUGS.md, VERSION_HISTORY.md, or spec audit outputs.** If a previous run found a bug in a specific function, future runs must show explicit re-check evidence for that function ("known bug class sentinels"). This prevents the "lost requirement" regression class. If prior spec audit outputs exist in `quality/spec_audits/`, read them before running the sentinel check — cross-model findings from council reviews are a high-value source of known bug surfaces. + +For each category, check whether the requirements contain specific conditions covering the identified paths. Orphaned paths — significant code paths without requirement coverage — trigger a "coverage gap" marker in the completeness report. These gaps must be resolved (by adding requirement conditions or by providing explicit justification) before the completeness report can declare requirements sufficient. + +**Carry-forward rule:** When a prior run's REQUIREMENTS.md exists in the quality directory, the pipeline must read it and check whether any conditions from the prior version were dropped. If conditions were dropped, the pipeline must either: (a) re-derive them with updated justification, or (b) document why the condition is no longer relevant. Silent drops are not permitted — they are a direct cause of regressions where previously learned requirements are lost between runs. + +**After the pipeline:** Phase 7 can generate `quality/REVIEW_REQUIREMENTS.md` (interactive review protocol) and `quality/REFINE_REQUIREMENTS.md` (refinement pass protocol). These are not Phase 2 artifacts — they support the Phase 7 interactive improvement paths. The user can review requirements interactively, run refinement passes with different models, and keep versioned backups of each iteration. See `references/requirements_pipeline.md` for the full versioning protocol and backup structure. + +Record all requirements in a structured format. These feed directly into the code review protocol's verification and consistency passes. + +### Checkpoint: Update PROGRESS.md after Phase 1 + +**v1.5.6 update — PROGRESS.md is now initialized at run start, not after Phase 1.** Per the "Run-state instrumentation" section earlier in this file, `quality/PROGRESS.md` and `quality/run_state.jsonl` are written before any phase work begins. By this point in Phase 1, both files already exist. This checkpoint is the **Phase 1 completion update** to PROGRESS.md, not the initialization. + +The PROGRESS.md format combines the run-state header (Started / Benchmark / Lever / Runner / Playbook version), the phase checklist (now driven by `phase_start` / `phase_end` events from `quality/run_state.jsonl`), and the legacy content sections below (Run metadata, Artifact inventory, Cumulative BUG tracker, etc.) — they are complementary, not competing. + +**Phase 1 completion action:** Mark Phase 1 as `[x]` in the phase checklist with summary stats (findings count, patterns walked); add the Phase 1 artifacts (EXPLORATION.md and any sub-artifacts) to the Artifact inventory. Append a `phase_end phase=1` event to `run_state.jsonl` per the cross-validation rules in the Run-state instrumentation section. + +**Why PROGRESS.md exists:** In single-session runs, the agent holds context in memory. But context degrades over long sessions — findings from Phase 1 are forgotten by Phase 6, BUG counts drift, spec-audit bugs get orphaned because the closure check never saw them. PROGRESS.md solves this by making every phase write its state to disk. The agent reads it back before each phase, so it always has an accurate picture of what happened so far. As a side benefit, it makes the skill work correctly even if the run is split across multiple sessions. + +**Checkpoint discipline for long runs:** After each requirements-pipeline phase (Contracts, Requirements, Coverage Matrix, Completeness, Narrative), update `quality/PROGRESS.md` with: completed phase, artifact paths, current scoped subsystems, remaining work, and exact resume point. This ensures a resumed session can continue from the last completed checkpoint without redoing work. Per v1.5.6, also append the corresponding `pass_started` / `pass_ended` events to `run_state.jsonl`. + +**Timestamp discipline:** Write each phase completion entry to PROGRESS.md immediately when you finish that phase, before starting the next phase. Do not batch-write or back-fill timestamps after the fact. The timestamps are an audit trail — if Phase 2 shows a completion time earlier than Phase 1, a reviewer cannot verify that phases ran in the correct sequence. If you realize you forgot to write a checkpoint, write it now with an honest timestamp and a note explaining the gap. + +The full PROGRESS.md format the v1.5.6-initialized file will have populated by Phase 1 completion includes the sections below (the legacy template, retained because Phase 5+ depend on its Cumulative BUG tracker and Terminal Gate Verification sections): + +```markdown +# Quality Playbook Progress + +## Run metadata +Started: [date/time] +Project: [project name] +Skill version: [read from SKILL.md metadata using the reference file resolution order — must match exactly] +With docs: [yes/no] + +## Phase completion +- [x] Phase 1: Exploration — completed [date/time] +- [ ] Phase 2: Artifact generation (QUALITY.md, REQUIREMENTS.md, tests, protocols, RUN_TDD_TESTS.md) — `AGENTS.md` is generated by the orchestrator after Phase 6, not here +- [ ] Phase 3: Code review + regression tests +- [ ] Phase 4: Spec audit + triage +- [ ] Phase 5: Post-review reconciliation + closure verification +- [ ] TDD logs: red-phase log for every confirmed bug, green-phase log for every bug with fix patch +- [ ] Phase 6: Verification benchmarks +- [ ] Phase 7: Present, Explore, Improve (interactive) + +## Artifact inventory +| Artifact | Status | Path | Notes | +|----------|--------|------|-------| +| QUALITY.md | pending | | | +| REQUIREMENTS.md | pending | | | +| CONTRACTS.md | pending | | | +| COVERAGE_MATRIX.md | pending | | | +| COMPLETENESS_REPORT.md | pending | | | +| Functional tests | pending | | | +| RUN_CODE_REVIEW.md | pending | | | +| RUN_INTEGRATION_TESTS.md | pending | | | +| BUGS.md | pending | | | +| RUN_TDD_TESTS.md | pending | | | +| RUN_SPEC_AUDIT.md | pending | | | +| tdd-results.json | pending | quality/results/ | Structured TDD output | +| integration-results.json | pending | quality/results/ | Structured integration output | +| Bug writeups | pending | quality/writeups/ | One per TDD-verified bug | + +## Cumulative BUG tracker + + +| # | Source | File:Line | Description | Severity | Closure Status | Test/Exemption | +|---|--------|-----------|-------------|----------|----------------|----------------| + + + +## Terminal Gate Verification + + +## Exploration summary +[Brief notes on architecture, key modules, spec sources, defensive patterns found] +``` + +Update this file after every phase. The cumulative BUG tracker is the most important section — it ensures no finding is orphaned regardless of which phase produced it. + +### Write exploration findings to disk + +After initializing PROGRESS.md, write your full exploration findings to `quality/EXPLORATION.md`. This file captures everything you learned in Phase 1 so it can survive a context boundary (session break, multi-pass handoff, or long-run memory degradation). Structure it as: + +```markdown +# Exploration Findings + +## Domain and Stack +[Language, framework, build system, deployment target] + +## Architecture +[Key modules with file paths, entry points, data flow, layering] + +## Existing Tests +[Test framework, test count, coverage areas, gaps] + +## Specifications +[What reference_docs/ contains, key spec sections, behavioral rules] + +## Open Exploration Findings +[At least 8 concrete findings from domain-driven investigation. +Each must have a file path, line number, and specific bug hypothesis. +At least 4 must reference different modules or subsystems. +At least 3 must trace a behavior across 2+ functions.] + +## Quality Risks +[At least 5 domain-driven failure scenarios ranked by priority. +Each must name a specific function, file, and line and explain the failure +mechanism using domain knowledge of what goes wrong in systems like this. +These are hypotheses, not confirmed bugs — they tell Phase 2 where to look. +Frame each as: "Because [code at file:line] does [X], a [domain-specific +edge case] will produce [wrong behavior] instead of [correct behavior]." +A section that lists defensive patterns the code already has does NOT belong here.] + +## Skeletons and Dispatch +[State machines, dispatch tables, feature registries — with file:line citations] + +## Pattern Applicability Matrix +| Pattern | Decision (`FULL` / `SKIP`) | Target modules | Why | +|---|---|---|---| +| Fallback and Degradation Path Parity | | | | +| Dispatcher Return-Value Correctness | | | | +| Cross-Implementation Consistency | | | | +| Enumeration and Representation Completeness | | | | +| API Surface Consistency | | | | +| Spec-Structured Parsing Fidelity | | | | + +[3 to 4 patterns must be marked FULL. The rest are SKIP with codebase-specific rationale. Select 4 when a fourth pattern clearly applies and covers different code areas.] + +## Pattern Deep Dive — [Pattern Name] +[Use the output format from `exploration_patterns.md`. +Trace the relevant code path across 2+ functions, implementations, or API surfaces. +Each deep dive should pressure-test, refine, or extend findings from the open +exploration and quality risks stages.] + +## Pattern Deep Dive — [Pattern Name] +[Repeat for each selected FULL pattern. 3 to 4 deep-dive sections total.] + +## Pattern Deep Dive — [Pattern Name] +[Third and final deep dive.] + +## Candidate Bugs for Phase 2 +[Consolidated from ALL earlier sections — open exploration, quality risks, AND patterns. +Minimum 4 candidates with file:line references. At least 2 from open exploration or +quality risks, at least 1 from a pattern deep dive. For each candidate include the +source stage and what the Phase 2 code review should inspect.] + +## Derived Requirements +[REQ-001 through REQ-NNN, each with spec basis and tier] + +## Derived Use Cases +[UC-01 through UC-NN, each with actor, trigger, expected outcome] + +## Notes for Artifact Generation +[Anything the next phase needs to know — naming conventions, test patterns, framework quirks] + +## Gate Self-Check +[Written by the Phase 1 gate. Each check 1–12 with PASS/FAIL and one-line evidence. +This section proves the gate was executed. Do not write this section until you have +actually verified each check against the file contents.] +``` + +**Minimum depth expectation:** EXPLORATION.md must contain at least 120 lines of substantive content — not padding or boilerplate headers, but actual findings (file paths, behavioral rules, derived requirements, architecture observations). A skeleton that lists section headers with one-line placeholders is not a valid handoff artifact. If the file is thinner than this, go back and add the detail Phase 2 will need. + +**Re-read after writing (mandatory).** After writing EXPLORATION.md, explicitly read the file back from disk before proceeding to Phase 2. This serves two purposes: (1) it confirms the file was written correctly, and (2) it loads the structured findings into working memory for artifact generation. Do not skip this step and rely on what you remember writing — the "write then read" cycle is the context bridge. + +This file is essential in all modes. In single-pass mode it forces the model to articulate specific findings (file paths, function names, line numbers) before generating artifacts. In multi-pass mode it is also the handoff artifact between passes. Either way, the write-then-read cycle is the quality gate for exploration depth. + +**Phase 1 completion gate (mandatory — STOP HERE before Phase 2).** You MUST execute this gate before proceeding to Phase 2. This is not optional. Re-read `quality/EXPLORATION.md` from disk and run every check below. After checking, append a `## Gate Self-Check` section to the bottom of EXPLORATION.md that lists each check number (1–12) with PASS or FAIL and a one-line evidence note. If any check fails, fix EXPLORATION.md and re-run the gate. Do not proceed to Phase 2 until all checks pass AND the Gate Self-Check section is written to disk. + +**Common gate-bypass failure mode:** In v1.3.43 benchmarking, two repos (chi, zod) produced EXPLORATION.md files with completely wrong section structure — sections like "Architecture summary", "Behavioral contracts", "Repository and architecture map" instead of the required sections. The model never ran the gate checks and proceeded directly to Phase 2, producing zero bugs. If your EXPLORATION.md does not contain sections with the EXACT titles listed below, it is non-conformant and must be rewritten before proceeding. + +1. The file exists on disk and contains at least 120 lines of substantive content. +2. `quality/PROGRESS.md` exists and marks Phase 1 complete. +3. The Derived Requirements section contains at least one REQ-NNN with specific file paths and function names — not abstract subsystem descriptions. +4. A section titled **exactly** `## Open Exploration Findings` exists and contains at least 8 concrete bug hypotheses or suspicious findings, each with a file path and line number. These must come from domain-driven investigation, not just from applying patterns. At least 4 must reference different modules or subsystems. +5. **Open-exploration depth check:** At least 3 findings in `## Open Exploration Findings` must trace a behavior across 2 or more functions or 2 concrete code locations. A list of isolated single-function suspicions is not sufficient depth. +6. A section titled **exactly** `## Quality Risks` exists and contains at least 5 domain-driven failure scenarios ranked by priority. Each scenario must: (a) name a specific function, file, and line, (b) describe a domain-specific edge case or failure mode, and (c) explain why the code produces wrong behavior. These must come from domain knowledge about what goes wrong in systems like this one — not from structural analysis of the code alone. A section that lists defensive patterns the code already has (things the code does RIGHT) does not satisfy this gate. A section that lists risky modules without specific failure scenarios does not satisfy this gate. A section that concludes the library is mature and unlikely to have basic bugs does not satisfy this gate. +7. A section titled **exactly** `## Pattern Applicability Matrix` exists and evaluates all six patterns from `exploration_patterns.md`, marking each as `FULL` or `SKIP` with target modules and codebase-specific rationale. +8. Between 3 and 4 patterns (inclusive) are marked `FULL` in the applicability matrix. +9. There are between 3 and 4 sections (inclusive) whose titles begin with `## Pattern Deep Dive — `. Each must contain concrete file:line evidence, not just pattern-name placeholders. The count must match the number of `FULL` patterns in the matrix. +10. **Pattern depth check:** At least 2 of the pattern deep-dive sections must trace a code path across 2 or more functions. A section that says "function X at file:line has a gap" is a surface finding. A section that says "function X at file:line calls function Y at file:line, which does A but not B; compare with function Z which does both" is a depth finding. +11. A section titled **exactly** `## Candidate Bugs for Phase 2` exists and contains at least 4 prioritized bug hypotheses with file:line references, the stage that surfaced each one (open exploration, quality risks, or pattern), and what the code review should look for. +12. **Ensemble balance check:** At least 2 candidate bugs must originate from open exploration or quality risks, and at least 1 must originate from or be materially strengthened by a pattern deep dive. This ensures both domain-knowledge and structural-analysis findings flow into Phase 2. + +Do not begin Phase 2 until all twelve checks pass AND the `## Gate Self-Check` section is written to EXPLORATION.md on disk. Phase 1 is your only chance to understand the codebase deeply. Every requirement you miss here is a bug you will not find in Phase 3. Invest the time. + +**If you find yourself about to start Phase 2 without having written a Gate Self-Check section, STOP.** Go back and run the gate. This instruction exists because models reliably skip the gate when they feel confident about their exploration — and that confidence is precisely when bugs are missed. + +**End-of-phase message (mandatory — print this after Phase 1 completes, then STOP):** + +``` +# Phase 1 Complete — Exploration + +I've finished exploring the codebase and written my findings to `quality/EXPLORATION.md`. +[Summarize: how many candidate bugs, which subsystems explored, key risks identified.] + +To continue to Phase 2 (Generate quality artifacts), say: + + Run quality playbook phase 2. + +Or say "keep going" to continue automatically. +``` + +**After printing this message, STOP. Do not proceed to Phase 2 unless the user explicitly asks.** + +--- + +## Phase 2: Generate the Quality Playbook + +**v1.5.6 instrumentation:** Append `phase_start phase=2` to `quality/run_state.jsonl` now. At phase end, cross-validate by calling `bin/run_state_lib.validate_phase_artifacts(quality_dir, 2)` — it checks the full Generate contract (REQUIREMENTS.md, QUALITY.md, CONTRACTS.md, COVERAGE_MATRIX.md, COMPLETENESS_REPORT.md, RUN_CODE_REVIEW.md, RUN_INTEGRATION_TESTS.md, RUN_SPEC_AUDIT.md, RUN_TDD_TESTS.md, plus one non-empty `quality/test_functional.`). If validation passes, append `phase_end phase=2`. If it fails, append an `error` event with `recoverable: true` and re-run the missing artifact generation. (BUG-014 fix: pre-v1.5.6 this note referenced the v1.5.5-design triage model that never shipped.) + +> **Required references for this phase** — read these before proceeding: +> - `quality/EXPLORATION.md` — your Phase 1 findings (architecture, requirements, use cases, pattern analysis) +> - `references/requirements_pipeline.md` — five-phase pipeline for requirement derivation +> - `references/defensive_patterns.md` — grep patterns for finding defensive code +> - `references/schema_mapping.md` — field mapping format for schema-aware tests +> - `references/constitution.md` — QUALITY.md template +> - `references/functional_tests.md` — test structure and anti-patterns +> - `references/review_protocols.md` — code review and integration test templates + +**Phase 2 source-modification guardrail (mandatory — HARD STOP).** Phase 2 writes ONLY into `quality/`. Do NOT create, modify, or delete any file outside the target repo's `quality/` directory — this includes (but is not limited to) `AGENTS.md`, `CLAUDE.md`, `README.md`, `bin/**`, `.github/**`, `.claude/**`, `references/**`, `agents/**`, `SKILL.md`, `schemas.md`, source code, or test files. The target repo's `AGENTS.md` is generated by the orchestrator AFTER Phase 6 completes; if you write `AGENTS.md` in Phase 2 you will trip the orchestrator's source-unchanged invariant and abort the run. The 2026-04-30 codex bootstrap test failed exactly this way: a Phase 2 LLM updated the existing `AGENTS.md` (because earlier SKILL.md prose told it to), the source-unchanged gate detected the modification, and the run aborted before Phase 3. If a generated artifact would naturally live outside `quality/`, write it under `quality/` instead and let the orchestrator move or copy it during finalization. The single permitted exception is `quality/PROGRESS.md`, which is itself inside `quality/`. + +**Phase 2 entry gate (mandatory — HARD STOP).** Before generating any artifacts, read `quality/EXPLORATION.md` from disk and verify ALL of the following exact section titles exist (grep or search — do not rely on memory): + +1. `quality/EXPLORATION.md` must have at least 120 lines — a shorter file indicates incomplete exploration +2. `## Open Exploration Findings` — must exist verbatim +3. `## Quality Risks` — must exist verbatim +4. `## Pattern Applicability Matrix` — must exist verbatim +5. At least 3 sections starting with `## Pattern Deep Dive — ` — must exist verbatim +6. `## Candidate Bugs for Phase 2` — must exist verbatim +7. `## Gate Self-Check` — must exist (proves the Phase 1 gate was run) +8. `quality/PROGRESS.md` exists and its Phase 1 line is marked `[x]` — proves Phase 1 was completed, not just started +9. The `## Open Exploration Findings` section contains at least 8 concrete bug hypotheses — count lines with file:line citations +10. At least 3 findings in `## Open Exploration Findings` trace behavior across 2+ functions — look for multi-location traces +11. Between 3 and 4 patterns are marked `FULL` in `## Pattern Applicability Matrix` — count FULL entries +12. At least 2 pattern deep-dive sections trace code paths across 2+ functions — look for multi-function traces +13. `## Candidate Bugs for Phase 2` has ≥2 bugs from open exploration/risks AND ≥1 from a pattern deep dive — check source stage labels + +If the file does not exist, has fewer than 120 lines, or is **missing ANY of these exact section titles**, STOP and go back to Phase 1. Do not attempt to proceed with "equivalent" sections under different names — the exact titles above are required. Write EXPLORATION.md now, starting with domain-driven open exploration, then domain-knowledge risk analysis, then selecting 3–4 patterns from `references/exploration_patterns.md` for deep dives. Do not proceed with Phase 2 until EXPLORATION.md passes the Phase 1 completion gate. This check exists because single-pass execution can skip the Phase 1 gate — this is the backstop. In v1.3.43, two repos bypassed both gates and produced zero bugs. + +Use `quality/EXPLORATION.md` as your primary source for this phase — do not re-explore the codebase from scratch. The exploration findings contain the architecture map, derived requirements, use cases, and risk analysis that drive every artifact below. If you find yourself reading source files to figure out what the project does, go back to EXPLORATION.md instead. Re-exploration wastes context and produces inconsistencies between what Phase 1 found and what Phase 2 generates. + +Now write the Phase 2 artifacts. The requirements pipeline above produced REQUIREMENTS.md, CONTRACTS.md, COVERAGE_MATRIX.md, and COMPLETENESS_REPORT.md. The seven files below complete the set. For each one, follow the structure below and consult the relevant reference file for detailed guidance. + +**Version stamp (mandatory on every generated file).** Every Markdown file the playbook generates must begin with the following attribution line immediately after the file's title heading: + +``` +> Generated by [Quality Playbook](https://github.com/andrewstellman/quality-playbook) v1.5.3 — Andrew Stellman +> Date: YYYY-MM-DD · Project: +``` + +Every generated code file (test files, scripts) must begin with a comment header: + +``` +# Generated by Quality Playbook v1.5.6 — https://github.com/andrewstellman/quality-playbook +# Author: Andrew Stellman · Date: YYYY-MM-DD · Project: +``` + +Use the comment syntax appropriate for the language (`#`, `//`, `/* */`, etc.). The version in the stamp must match the `metadata.version` in this skill's frontmatter. This stamp makes every generated artifact traceable back to the tool, version, and run that created it — essential when files are emailed, attached to tickets, or reviewed outside the repository context. Use the date the playbook generation started, not the date each individual file was written. + +**Stamp placement and exemptions:** +- For Python files with an encoding pragma (`# -*- coding: utf-8 -*-`) or a shebang (`#!/usr/bin/env python`), place the stamp comment *after* the pragma/shebang, not before — pushing it past line 2 causes `SyntaxWarning`. +- For sidecar JSON files (`tdd-results.json`, `integration-results.json`), the `skill_version` field already serves as the version stamp. JSON does not support comments — do not inject one. +- For JUnit XML files, no stamp is needed — these are framework-generated. +- For `.patch` files, do not inject a stamp into the diff body — it would break `git apply`. Rely on the surrounding artifact metadata (BUGS.md, tdd-results.json) for provenance. + +**Artifact dependency rules:** +- `quality/RUN_CODE_REVIEW.md` Pass 2 depends on a stable `quality/REQUIREMENTS.md` — thin requirements produce thin Pass 2 review. If the requirements count seems low for the code surface (fewer than ~3–4 requirements per core module), note this at the start of the Pass 2 report. +- Functional tests depend on `quality/REQUIREMENTS.md` and `quality/QUALITY.md` — after any requirements refinement, re-verify that `test_functional.*` still covers every requirement. +- `quality/RUN_SPEC_AUDIT.md` depends on requirements, quality scenarios, and docs validation. +- `quality/COMPLETENESS_REPORT.md` has two stages: baseline (pre-review, no verdict section) and final (post-reconciliation in Phase 5, with the authoritative verdict). +- `quality/PROGRESS.md` is the authoritative state file and must be updated before each downstream artifact begins. + +**Why nine files instead of just tests?** Tests catch regressions but don't prevent new categories of bugs. The quality constitution (`QUALITY.md`) tells future sessions what "correct" means before they start writing code. The protocols (`RUN_*.md`) provide structured processes for review, integration testing, and spec auditing that produce repeatable results — instead of leaving quality to whatever the AI feels like checking. Together, these files create a quality system where each piece reinforces the others: scenarios in QUALITY.md map to tests in the functional test file, which are verified by the integration protocol, which is audited by the Council of Three. + +### v1.5.3 JSON manifest discipline (read before writing any artifact) + +Phase 2 writes two parallel renderings of every derived record: a **JSON manifest** (machine-readable, gate-validated — the source of truth) and a **Markdown artifact** (human-readable, rendered from the manifest). A phase script that updates one without updating the other is a bug. + +Manifests live at: + +- `quality/formal_docs_manifest.json` — written by `bin/reference_docs_ingest.py` in Phase 1. Do not rewrite. +- `quality/requirements_manifest.json` — authoritative REQ records per schemas.md §6. +- `quality/use_cases_manifest.json` — authoritative UC records per schemas.md §7. +- `quality/bugs_manifest.json` — authoritative BUG records per schemas.md §8. Written after Phase 3/4/5 confirm bugs. +- `quality/citation_semantic_check.json` — Phase 4 Council Layer 2 verdicts (see Phase 4 below). + +Every manifest follows the §1.6 wrapper with `schema_version`, `generated_at`, and a top-level records array. The four record-shaped manifests (`formal_docs_manifest.json`, `requirements_manifest.json`, `use_cases_manifest.json`, `bugs_manifest.json`) use `records` as the array key: + +```json +{ + "schema_version": "", + "generated_at": "", + "records": [ /* per-schema records, per schemas.md §4–§8 */ ] +} +``` + +**Exception — `citation_semantic_check.json` uses `reviews` instead of `records`** per `schemas.md` §9.1. Same wrapper shape, different array key; the records inside are Council review entries, not REQ/UC/BUG records. If you find yourself writing `records` into `citation_semantic_check.json`, stop and re-read schemas.md §9 — the gate rejects this file as a manifest-wrapper violation (schemas.md §10 invariant #13) when the key is wrong. + +`schema_version` MUST equal this skill's `metadata.version` at generation time — read it from SKILL.md, do not hardcode. `generated_at` uses `datetime.now(timezone.utc).isoformat(timespec="seconds").replace("+00:00", "Z")`. Record shapes and invariants are defined in `schemas.md` — do not redefine them in this skill. `quality_gate.py` (Phase 5/6) validates manifests field-by-field against those schemas. + +**REQUIREMENTS.md rendering convention.** REQUIREMENTS.md is organized by `functional_section`. Each section opens with a brief LLM-written intro paragraph describing what that functional area does, then lists the REQs under it (ordered by REQ id). Use cases render their `formal_doc_refs` but DO NOT list `requirements[]` — traceability is one-way REQ → UC, and the reverse direction is derived at render time by querying REQ records. + +### File 1: `quality/QUALITY.md` — Quality Constitution + +**Read `references/constitution.md`** for the full template and examples. + +The constitution has six sections: + +1. **Purpose** — What quality means for this project, grounded in Deming (built in, not inspected), Juran (fitness for use), Crosby (quality is free). Apply these specifically: what does "fitness for use" mean for *this system*? Not "tests pass" but the actual operational requirement. +2. **Coverage Targets** — Table mapping each subsystem to a target with rationale referencing real risks. Every target must have a "why" grounded in a specific scenario — without it, a future AI session will argue the target down. +3. **Coverage Theater Prevention** — Project-specific examples of fake tests, derived from what you saw during exploration. (Why: AI-generated tests often pad coverage numbers without catching real bugs — asserting that imports worked, that dicts have keys, or that mocks return what they were configured to return. Calling this out explicitly stops the pattern.) +4. **Fitness-to-Purpose Scenarios** — The heart of it. Each scenario documents a realistic failure mode with code references and verification method. Aim for 2+ scenarios per core module — typically 8–10 total for a medium project, fewer for small projects, more for complex ones. Quality matters more than count: a scenario that precisely captures a real architectural vulnerability is worth more than three generic ones. (Why: Coverage percentages tell you how much code ran, not whether it ran correctly. A system can have 95% coverage and still lose records silently. Fitness scenarios define what "working correctly" actually means in concrete terms that no one can argue down.) +5. **AI Session Quality Discipline** — Rules every AI session must follow +6. **The Human Gate** — Things requiring human judgment + +**Scenario voice is critical.** Write "What happened" as architectural vulnerability analyses with specific quantities, cascade consequences, and detection difficulty — not as abstract specifications. "Because `save_state()` lacks an atomic rename pattern, a mid-write crash during a 10,000-record batch will leave a corrupted state file — the next run gets JSONDecodeError and cannot resume. At scale, this risks silent loss of 1,693+ records with no detection mechanism." An AI session reading that will not argue the standard down. Use your knowledge of similar systems to generate realistic failure scenarios, then ground them in the actual code you explored. Scenarios come from both code exploration AND domain knowledge about what goes wrong in systems like this. + +Every scenario's "How to verify" must map to at least one test in the functional test file. + +### File 2: Functional Tests + +**This is the most important deliverable.** Read `references/functional_tests.md` for the complete guide. + +Organize the tests into three logical groups (classes, describe blocks, modules, or whatever the test framework uses): + +- **Spec requirements** — One test per testable spec section. Each test's documentation cites the spec requirement it verifies. +- **Fitness scenarios** — One test per QUALITY.md scenario. 1:1 mapping, named to match. +- **Boundaries and edge cases** — One test per defensive pattern from Step 5. + +Key rules: +- **Match the existing import pattern exactly.** Read how existing tests import project modules and do the same thing. Getting this wrong means every test fails. +- **Read every function's signature before calling it.** Read the actual `def` line — parameter names, types, defaults. Read real data files from the project to understand data shapes. Do not guess at function parameters or fixture structures. +- **No placeholder tests.** Every test must import and call actual project code. If the body is `pass` or the assertion is trivial (`assert isinstance(x, list)`), delete it. A test that doesn't exercise project code inflates the count and creates false confidence. +- **Test count heuristic** = (testable spec sections) + (QUALITY.md scenarios) + (defensive patterns). For a medium project (5–15 source files), this typically yields 35–50 tests. Significantly fewer suggests missed requirements or shallow exploration. Significantly more is fine if every test is meaningful — don't pad to hit a number. +- **Cross-variant heuristic: ~30%** — If the project handles multiple input types, aim for roughly 30% of tests parametrized across all variants. The exact percentage matters less than ensuring every cross-cutting property is tested across all variants. +- **Test outcomes, not mechanisms** — Assert what the spec says should happen, not how the code implements it. +- **Use schema-valid mutations** — Boundary tests must use values the schema accepts (from Step 5b), not values it rejects. + +### File 3: `quality/RUN_CODE_REVIEW.md` + +**Read `references/review_protocols.md`** for the template. + +The code review protocol has three passes. Each pass runs independently — a fresh session with no shared context except the requirements document. This clean separation prevents cross-contamination between structural review and requirement-based review. + +**Pass 1 — Structural Review.** Read the code and spot anomalies. This is what every AI code review tool already does well. No requirements, no focus areas — just the model's own knowledge of code correctness. Keep these mandatory guardrails: + +- Line numbers are mandatory — no line number, no finding +- Read function bodies, not just signatures +- If unsure: flag as QUESTION, not BUG +- Grep before claiming missing +- Do NOT suggest style changes — only flag things that are incorrect + +**Minimum required Pass 1 scrutiny areas (address each explicitly):** + +1. **Input validation and boundary handling** — check every trust boundary where external or caller-supplied data enters the code. Every string parser, enum lookup, and binary-format parser must reject input that shares a valid prefix with a valid token but contains additional characters. +2. **Resource lifecycle** — allocation, refcount management, error-path cleanup, lock release on failure, file descriptor/handle lifetime. Every function that acquires a reference or resource must release it on every early-exit path, or must complete all validation before acquiring the resource. +3. **Concurrency and state management** — lock ordering, atomic operation correctness (every atomic modification of a shared state word must use read-modify-write semantics and preserve bits outside the intended modification), state machine completeness (all states handled at all consumers). +4. **Unit and encoding correctness** — every field read from hardware, protocol structures, or user input that has defined units must be converted correctly before use in calculations or comparisons. +5. **Enumeration and whitelist completeness** — when a function uses a `switch`/`case`, `match`, if-else chain, or any branching construct to handle a set of named constants (feature bits, enum values, event types, command codes, permission flags), perform a **mechanical enumeration check**: + + (a) **List A (code extraction):** If a `quality/mechanical/_cases.txt` artifact exists for this function, use it as the authoritative code-side list — do not re-extract manually. If no mechanical artifact exists, extract every branch/case label actually present in the code. List each with its exact line number: "line 3511: `case VIRTIO_RING_F_INDIRECT_DESC`", "line 3513: `case VIRTIO_RING_F_EVENT_IDX`", etc. **Extract this list from the code only — do not copy from REQUIREMENTS.md, CONTRACTS.md, or any other generated artifact.** If you cannot cite a line number for a case label, it is not present. + + (b) **List B (spec extraction):** List every constant defined in the relevant header, enum, or spec that *should* be handled. + + (c) **Diff:** Compare the two lists. For each constant in List B, mark it as "FOUND (line NNN)" or "NOT IN CODE." Report any constants that are defined but not handled. + + **Do not assert that a whitelist "covers all values" or "preserves supported bits" without performing this two-list comparison.** AI models reliably hallucinate completeness for switch/case constructs — the model sees the function, sees the constants defined elsewhere, and assumes coverage without checking each case label. The most dangerous form of this hallucination is copying from an upstream artifact (like REQUIREMENTS.md) that asserts a constant is present, rather than extracting from the code. In v1.3.17, the code review's "case labels present" list was word-for-word identical to the requirements list — proving it was copied rather than extracted. The mechanical check with per-label line numbers is the fix. + +These five areas must appear as labeled subsections in the Pass 1 report. If a project has no meaningful concurrency, say so explicitly and document why rather than omitting the section. Add project-specific scrutiny areas beyond these five as warranted. + +Pass 1 catches ~65% of real defects: race conditions, null pointer hazards, resource leaks, off-by-one errors, type mismatches — structural problems visible in the code. + +**Pass 2 — Requirement Verification.** For each testable requirement derived in Step 7 of Phase 1, check whether the code satisfies it. For each requirement, either show the code that satisfies it or explain specifically why it doesn't. This is a pure verification pass — the reviewer's only job is "does the code satisfy this requirement?" Not a general review. Not looking for other bugs. Just verification. + +**Minimum evidence rule:** Pass 2 must cite at least one code location (file:line or file:function) **per requirement**. Blanket satisfaction claims like "REQ-003 through REQ-012 — satisfied by the client paths reviewed during the pass" without per-requirement code citations do not satisfy Pass 2. If two or three requirements are satisfied by the same function, cite the function once and list those specific requirements — but each requirement must appear individually with its own SATISFIED/VIOLATED verdict, not as part of an unverified range. A group of more than three requirements under a single citation is a sign that the verification was superficial. The point is traceability — a reviewer reading Pass 2 should be able to follow the evidence chain from any single requirement to the code that satisfies it without re-reading the entire codebase. + +**Enumeration completeness claims require mechanical proof.** When evaluating a requirement that involves a whitelist, lookup table, feature-bit set, handler registry, or any claim of the form "all X are covered by Y," the reviewer must perform the two-list enumeration check from Pass 1 scrutiny area 5: extract every item from the code (with line numbers), extract every item from the spec, and diff. **The code-side list must be extracted fresh from the source — do not reuse any list from REQUIREMENTS.md, CONTRACTS.md, the code review prompt, or any other generated artifact.** If the code-side list matches the requirements list word-for-word, that is a sign the list was copied rather than extracted, and the check must be redone. + +Do not mark such a requirement SATISFIED based on reading the function and believing it handles everything — that is the specific hallucination pattern this rule prevents. Example: a requirement says "the transport feature whitelist must preserve all supported ring features." The reviewer reads `vring_transport_features()` and sees it has a switch/case. The correct check: extract each case label with its line number (`line 3511: INDIRECT_DESC`, `line 3513: EVENT_IDX`, ..., `line 3527: default`), then list the header constants, then diff. The hallucination: "the whitelist preserves supported bits including VIRTIO_F_RING_RESET" without checking that RING_RESET actually appears as a case label. This exact failure mode has been observed in practice across multiple versions — the model asserted coverage of a constant that was absent from the switch, and in v1.3.17, the code review's "case labels present" list was copied from the requirements rather than extracted from the code, causing three independent spec auditors to inherit the false claim. + +Pass 2 catches violations of individual requirements — cases where the code doesn't do what the specification says it should. This finds bugs that structural review misses because the code that IS there is correct; the bug is what's missing or what doesn't match the spec. + +**Pass 3 — Cross-Requirement Consistency.** Compare pairs of requirements that reference the same field, constant, range, or security policy. For each pair, verify that their constraints are mutually consistent. Do numeric ranges match bit widths? Do security policies propagate to all connection types? Do validation bounds in one file agree with encoding limits in another? + +Pass 3 catches contradictions where two individually-correct pieces of code disagree about a shared constraint. These bugs are invisible to both structural review and per-requirement verification because each requirement IS satisfied individually — the bug only appears when you compare them. This is the pass that catches cross-file arithmetic bugs and design gaps where a security configuration doesn't propagate to all connection paths. + +**Source code boundary rule:** The playbook must never modify files outside the `quality/` directory. All source-tree changes — bug fixes, test additions to the project's own test suite — must be expressed as `git diff`-format patch files saved under `quality/patches/`. This ensures the original source tree remains untouched, patches are reviewable and reversible, and the playbook's findings are cleanly separable from the code it audited. + +**BUGS.md:** After all review and audit phases, generate `quality/BUGS.md` — a consolidated bug report with full reproduction details for each confirmed bug. For each bug, include: bug ID, source (code review or spec audit), file:line, description, severity, minimal reproduction scenario (what input or sequence triggers the bug), expected vs actual behavior, references to the regression test and any proposed fix patch, and **spec basis**. + +**BUGS.md — v1.5.3 BUG record fields (schemas.md §8).** Every BUGS.md entry (and every `quality/writeups/BUG-NNN.md`) corresponds to a BUG record written to `quality/bugs_manifest.json`. In addition to the narrative conventions above, every BUG carries: + +- `id` — `BUG-NNN` zero-padded three-digit sequence. +- `divergence_description` — one-paragraph summary of the divergence between documented intent and code behavior. +- `documented_intent` — direct quote or close paraphrase of the REQ / spec language. +- `code_behavior` — what the code actually does, with `file:line` references. +- `disposition` — enum from schemas.md §3.2: `code-fix` | `spec-fix` | `upstream-spec-issue` | `mis-read` | `deferred`. Required, non-null. Do NOT invent new values. +- `disposition_rationale` — one paragraph explaining why THIS disposition and not an adjacent one (e.g., why `code-fix` and not `upstream-spec-issue`, or why `spec-fix` and not `mis-read`). Formulaic rationales ("code is wrong because spec says so") fail the Council review. +- `req_id` — singular. The primary REQ that revealed the divergence. If a bug appears to touch multiple REQs, split into one bug per REQ sharing a root cause and cross-link in `disposition_rationale` (schemas.md §8.1). Do NOT smuggle multiple REQ IDs into one entry. +- `proposed_fix` — required unless `disposition == "mis-read"`. Patch-shaped when `fix_type` ∈ {`code`, `both`}; textual redline when `fix_type == "spec"`. For `mis-read` records the field is optional and, when present, documents the re-read (what the playbook misread and how the correct reading was established), not a shipped change. +- `fix_type` — enum from schemas.md §3.4: `code` | `spec` | `both`. The combination of `disposition × fix_type` is constrained by the legal-combinations matrix in §3.4 (enforced by §10 invariant #12). Illegal pairings: `code-fix` × `spec`, `spec-fix` × `code`, `upstream-spec-issue` × `code`, `mis-read` × `both`. Consult the matrix before authoring the record — the gate rejects illegal combinations. + +**Divergence framing (writeup voice, schemas.md v1.5.3).** A defect is a divergence between documented intent and code implementation — not a judgment about whether the code is "good." The writeup's opening sections (Summary, Spec reference, The code) are the human rendering of `divergence_description`, `documented_intent`, and `code_behavior`. Write them to read as a **side-by-side diff**, not a narrative. The reader should be able to scan the REQ/spec language next to the code behavior and see the gap immediately. An adversarial tone ("the code is sloppy") or value judgment in the title ("Sloppy trailer handling") fails this framing — a bug is an observation of a divergence, not an accusation. Upstream maintainers engage with disposition (code-fix vs spec-fix vs upstream-spec-issue) rather than defending the code against a critique. + +**What counts as sufficient evidence to confirm a bug (critical).** A code-path trace that demonstrates a specific behavioral violation IS sufficient evidence to confirm a bug. You do NOT need executed request-level evidence, a running test, or an integration-level reproduction to promote a finding from candidate to confirmed. Specifically: + +- A code-path trace showing function A calls function B which does X but should do Y, with file:line references — **sufficient to confirm**. +- A missing case/branch identified by enumeration comparison (spec says X should be handled, code has no handler for X) — **sufficient to confirm**. +- A requirement violation identified in Pass 2 where the code demonstrably does not implement the specified behavior — **sufficient to confirm**. +- A domain-knowledge finding where you can trace from input through specific code to wrong output — **sufficient to confirm**. + +Do NOT demand "executed request-level evidence" or defer findings because "they require runtime testing to distinguish implementation choice from spec gap." If the spec or documentation says the behavior should be X, and the code demonstrably produces Y (traceable through the code path), that is a confirmed bug — not a candidate awaiting runtime validation. The regression test and TDD protocol exist to provide runtime evidence AFTER confirmation, not as a prerequisite FOR confirmation. + +**Why this rule exists:** In v1.3.43 javalin benchmarking, the code review and triage both identified 4 legitimate candidate bugs with code-path traces and requirement violations, then demoted all of them because "the highest-confidence items still require executed request-level evidence." This produced zero confirmed bugs from a codebase where previous versions found 5. The evidentiary bar was set at runtime-proof-before-confirmation, which is backwards — the playbook's design is confirm-then-prove-with-TDD. + +**Severity calibration:** Credential leakage, authentication bypass, and injection-class bugs are always high severity regardless of assessed likelihood. Authorization header exposure across trust boundaries (e.g., cross-domain redirects) is credential leakage. When in doubt about security-relevant severity, default to high. + +**Development-scaffolding exclusion (early filter before confirming any security-class finding).** The severity calibration rule above auto-escalates credential leakage and authentication bypass findings. Before confirming any such finding as a bug, apply this mechanical test: + +> **Does the value under scrutiny contain a self-documenting marker — words like `change-me`, `replace-me`, `placeholder`, `example`, `default`, `dummy`, `TODO`, `FIXME`, `your-secret-here`, `insert-`, or `set-this-` — that explicitly tells the deployer to replace it?** + +If yes: **this is not a bug.** Do not log it. A self-documenting development placeholder is scaffolding, not a defect — it exists to make the project build and run locally without configuration. The same logic applies to test fixtures, example configs, seed data, and any value whose name or surrounding comments declare it non-production. + +This is an early filter that catches the most obvious false positives at confirmation time. The challenge gate in Phase 5 (see `references/challenge_gate.md`) is the broader mechanism that catches subtler cases — documented feature gaps, design decisions with WHY annotations, and findings where the "expected behavior" was invented by the auditor. Any security-class finding that passes the scaffolding exclusion will still be challenge-gated in Phase 5 if it matches an auto-trigger pattern. + +**Spec basis (mandatory field per bug):** Cite the specific documentation passage that establishes the expected behavior — the gathered doc filename, section/page, and the behavioral contract it defines. If no gathered doc covers the behavior, check whether the project's own comments, README, or API docs define it. If no documentation exists for the expected behavior, classify the bug as a "code inconsistency" rather than a "spec violation" and note this in the severity assessment. A spec violation is a stronger finding than a code inconsistency — it means the code contradicts an authoritative source, not just that the code looks wrong. This distinction matters when reporting upstream: maintainers respond to "your code violates section X.Y of your own spec" differently than "this looks like it might be a bug." + +**Patch files (mandatory for every confirmed bug).** For each confirmed bug, generate: +- `quality/patches/BUG-NNN-regression-test.patch` — a `git diff` that adds a test demonstrating the bug. **This patch is mandatory, not optional.** It is the strongest evidence a bug exists — independent of any opinion about the fix. A confirmed bug without a regression-test patch is incomplete and will cause `quality_gate.py` to FAIL. Generate this patch immediately after confirming the bug, before moving to the next bug. +- `quality/patches/BUG-NNN-fix.patch` (optional but strongly encouraged) — a `git diff` with the proposed fix. For bugs where the fix is a single-line or few-line change (e.g., adding a case label, fixing an argument), generate the fix patch — these are low-effort and high-value. + +**How to generate patch files.** Use `git diff` format. The simplest approach: write the patch content directly as a unified diff. Example for a regression test patch: + +``` +--- /dev/null ++++ b/quality/test_regression_virtio.c +@@ -0,0 +1,15 @@ ++// Generated by Quality Playbook v1.5.6 ++// Regression test for BUG-004: VIRTIO_F_RING_RESET missing from vring_transport_features() ++#include ++#include ++... +``` + +For fix patches that modify existing source files, use the `--- a/path` / `+++ b/path` format with correct line offsets. If you cannot determine exact line offsets, generate the patch content and note "offsets approximate" — an approximate patch is more valuable than no patch. + +Patches must apply cleanly against the original source tree with `git apply`. Do not modify the source tree directly. + +**Patch validation gate (mandatory).** Before declaring any bug as confirmed with a fix patch, run this gate: + +1. **Apply test:** `git apply --check quality/patches/BUG-NNN-regression-test.patch` — must exit 0. +2. **Apply test + fix:** `git apply --check quality/patches/BUG-NNN-fix.patch` — must exit 0 (test against clean tree, not against regression-test-applied tree, unless the fix patch depends on the regression test). +3. **Compile check:** After applying both patches, run the project's build/compile command (e.g., `go build ./...`, `mvn compile`, `cargo check`, `tsc --noEmit`). Must succeed. + +**Temporary worktree for step 3.** Steps 1–2 use `--check` (non-destructive). Step 3 requires actually applying patches and compiling, which modifies the source tree. To comply with the source code boundary rule ("never modify files outside `quality/`"), run step 3 in a disposable worktree: + +```bash +git worktree add /tmp/qpb-patch-check HEAD --quiet +cd /tmp/qpb-patch-check +git apply quality/patches/BUG-NNN-regression-test.patch quality/patches/BUG-NNN-fix.patch + +cd - +git worktree remove /tmp/qpb-patch-check --force +``` + +If `git worktree` is unavailable (shallow clone, detached HEAD), use `git stash && git apply ... && && git checkout . && git stash pop` as a fallback, or accept `--check`-only validation and note the limitation. + +**Compile check for interpreted languages.** The compile command varies by ecosystem: +- **Go:** `go build ./...` +- **Rust:** `cargo check` +- **Java/Kotlin (Maven):** `mvn compile -q` +- **Java/Kotlin (Gradle):** `./gradlew compileJava compileTestJava -q` +- **TypeScript:** `tsc --noEmit` +- **Python:** `python -m py_compile ` for syntax, then `pytest --collect-only -q` for import/discovery validation +- **JavaScript (Node.js):** `node --check ` for syntax; if the project uses ESLint, `npx eslint ` for structural issues +- **JavaScript (Mocha/Jest):** Run the specific test in discovery-only mode (`mocha --dry-run` or `jest --listTests`) to verify it loads without errors + +If no compile/syntax check is feasible for the project's language, document this in the patch entry and rely on the TDD red phase to catch syntax errors. + +If any step fails, fix the patch before recording the bug as confirmed. A bug with a corrupt patch that won't apply is not a confirmed bug — it's a hypothesis with broken evidence. The TDD red-green cycle cannot run on patches that don't apply, and reporting a bug with an unapplyable patch undermines credibility with upstream maintainers. Common patch failures: truncated hunks (missing closing braces), wrong line offsets (patch generated against modified tree instead of clean tree), and syntax errors in generated test code. + +**Fix patch requirement.** Every confirmed bug must have either: +- A `quality/patches/BUG-NNN-fix.patch` that passes the validation gate above, OR +- An explicit justification in BUGS.md explaining why no fix patch is provided (e.g., "fix requires architectural change beyond patch scope," "multiple valid fix strategies — deferring to maintainer judgment," "bug is in upstream dependency"). + +A bug with a regression test but no fix patch and no justification is incomplete. The regression test proves the bug exists; the fix patch (or justification for its absence) completes the evidence chain. Bugs without fix patches cannot achieve "TDD verified (FAIL→PASS)" status — they remain at "confirmed open (xfail)" until a fix is provided. + +**TDD verification cycle:** Each confirmed bug with a fix patch should go through the red-green TDD cycle (test fails on unpatched code, passes after fix). This is executed via the `quality/RUN_TDD_TESTS.md` protocol (File 7), not inline during the code review. The protocol generates spec-grounded tests where every assertion message, variable name, and comment traces back to gathered documentation. + +**After all three passes:** Combine findings. Write regression tests in `quality/test_regression.*` that reproduce each confirmed bug. Use the same test framework as `test_functional.*` — if functional tests use pytest, regression tests use pytest (with `@pytest.mark.xfail(strict=True)`); if functional tests use unittest, regression tests use unittest (with `@unittest.expectedFailure`). Report results as a confirmation table (BUG CONFIRMED / FALSE POSITIVE / NEEDS INVESTIGATION). See `references/review_protocols.md` for the full three-pass template and regression test protocol. + +**Regression test skip guards (mandatory).** Every regression test in `quality/test_regression.*` must include a skip/xfail guard so that running the full test suite on unpatched code does not produce unexpected failures. The guard must be the **earliest syntactic guard for the framework** — a decorator or annotation where idiomatic, otherwise the first executable line in the test body. Use the language-appropriate mechanism: + +- **Python (pytest):** `@pytest.mark.xfail(strict=True, reason="BUG-NNN: [description]")` — placed as a **decorator above** `def test_...():`, not inside the function body. When the bug is present, the test fails → XFAIL (expected). When the bug is fixed but the marker isn't removed, the test passes → XPASS → strict mode makes this a failure, signaling the guard should be removed. +- **Python (unittest):** `@unittest.expectedFailure` — decorator above the test method. +- **Go:** `t.Skip("BUG-NNN: [description] — unskip after applying quality/patches/BUG-NNN-fix.patch")` — first line inside the test function. Note: Go's `t.Skip` hides the test entirely (reports SKIP, not FAIL), which is weaker evidence than Python's xfail. This is a known limitation of Go's test primitives. +- **Java (JUnit 5):** `@Disabled("BUG-NNN: [description]")` — annotation above the test method. +- **Rust:** `#[ignore]` attribute on the test function (the standard "don't run in default suite" mechanism). Use `#[should_panic]` only for bugs that manifest as panics; use `compile_fail` doctest annotation only for compile-time bugs. +- **TypeScript/JavaScript (Jest):** `test.failing("BUG-NNN: [description]", () => { ... })` +- **TypeScript/JavaScript (Vitest):** `test.fails("BUG-NNN: [description]", () => { ... })` +- **JavaScript (Mocha):** `it.skip("BUG-NNN: [description]", () => { ... })` or `this.skip()` inside the test body for conditional skipping. + +When a bug is fixed (fix patch applied permanently), remove the skip guard and update the BUG tracker closure status from "confirmed open" to "fixed (test passes)". The skip guard message must reference the bug ID and the fix patch path so that someone encountering a skipped test knows exactly how to resolve it. + +**Source-inspection tests must execute (no `run=False`).** Regression tests that verify source-file structure — string presence in function bodies, case label existence, enum extraction, generated-code shape checks — are safe, deterministic, and fast. They read repository files and perform string matches. For these tests, use `@pytest.mark.xfail(strict=True)` with execution enabled. **Do not use `run=False`** unless the test would mutate external state, hang, or require unavailable infrastructure. A source-inspection test with `run=False` is the worst possible state: the correct check exists but is inert. In v1.3.18, the regression test for BUG-004 (`test_bug_004_transport_feature_whitelist_keeps_ring_reset`) contained the correct assertion `assert "case VIRTIO_F_RING_RESET:" in func` but was marked `run=False` — so the test never executed, the assertion never fired, and the bug remained undetected despite the test suite "passing." When an `xfail(strict=True)` test actually executes and fails, the test suite reports it as XFAIL (expected failure) — this is correct behavior, not a suite failure. + +**TDD red/green interaction with skip guards.** During the TDD verification cycle, the red and green phases must temporarily bypass the skip guard to actually execute the test. The protocol should instruct the agent to: +- **Red phase (NEVER SKIPPED):** Remove or disable the skip/xfail guard, then run the test against unpatched code. It must fail. Re-enable the guard after recording the result. **The red phase is mandatory for every confirmed bug, even when no fix patch exists.** A bug without red-phase evidence is unverified — do not record `verdict: "skipped"` without a failing red run. If the red phase cannot execute for a documented reason (compilation failure, environment unavailable), record `red_phase: "error"` with an explanation in `notes`. +- **Green phase:** Remove or disable the guard, apply the fix patch, run the test. It must pass. If the fix will be reverted, re-enable the guard. **If no fix patch exists, record `green_phase: "skipped"` — but the red phase must still have run.** +- **After TDD cycle:** The guard remains in the committed regression test file. It is only permanently removed when the fix is merged into the source tree. + +**TDD execution enforcement (mandatory).** Regression tests must be actually executed during the TDD verification cycle, not just generated as patch files. For every confirmed bug, the red-phase test run must produce a log file at `quality/results/BUG-NNN.red.log` capturing the test output. The green-phase (if a fix patch exists) must produce `quality/results/BUG-NNN.green.log`. Each log file's first line must be a status tag: `RED` (test failed as expected), `GREEN` (test passed after fix), `NOT_RUN` (test could not be executed — with explanation), or `ERROR` (test infrastructure failed — with explanation). + +**Language-aware test execution commands.** Use the project's native test runner to execute regression tests. Detect the project language and use the appropriate command: + +- **Go:** `go test -v -run TestBugNNN ./path/to/package` +- **Python (pytest):** `python -m pytest -xvs quality/test_regression.py::test_bug_nnn` +- **Python (unittest):** `python -m unittest quality.test_regression.TestRegression.test_bug_nnn` +- **Java (Maven + JUnit):** `mvn test -pl module -Dtest=RegressionTest#testBugNnn` +- **Java (Gradle + JUnit):** `./gradlew test --tests RegressionTest.testBugNnn` +- **Rust:** `cargo test bug_nnn -- --nocapture` +- **TypeScript/JavaScript (Jest):** `npx jest --verbose --testNamePattern="BUG-NNN"` +- **TypeScript/JavaScript (Vitest):** `npx vitest run --reporter=verbose --testNamePattern="BUG-NNN"` +- **C (kernel/make-based):** Source-inspection tests via shell script (grep/awk on source files) — log the script output. + +If the project uses a language or test framework not listed above, use whatever test runner the project already uses (check for `Makefile`, `package.json`, `build.gradle`, `Cargo.toml`, `go.mod`, `setup.py`, `pyproject.toml`, etc.) and adapt the pattern. If no test runner is available or the language runtime is not installed, record `NOT_RUN` with an explanation — do not skip the log file entirely. + +**Log capture format.** Each `BUG-NNN.red.log` and `BUG-NNN.green.log` must follow this format: +``` +RED +--- Test output for BUG-NNN red phase --- +Command: [exact command run] +Exit code: [exit code] +[full stdout/stderr from test execution] +``` + +The status tag (`RED`, `GREEN`, `NOT_RUN`, `ERROR`) on the first line is machine-readable — `quality_gate.py` will check for its presence. The `NOT_RUN` status is acceptable when the test runner is unavailable (e.g., a C project where the kernel build environment is not present), but the log file must still exist with an explanation of why the test could not be executed. + +**Ready-to-run TDD log template.** For each confirmed BUG-NNN, execute this sequence (adapt the test command for the project's language per the table above): + +```bash +# ── Red phase: revert fix, run test, expect FAIL ── +git apply -R quality/patches/BUG-NNN-fix.patch 2>/dev/null # revert fix if applied +TEST_CMD="python -m pytest -xvs quality/test_regression.py::test_bug_nnn" # adapt per language +OUTPUT=$($TEST_CMD 2>&1); EXIT=$? +printf 'RED\n--- Test output for BUG-NNN red phase ---\nCommand: %s\nExit code: %d\n%s\n' \ + "$TEST_CMD" "$EXIT" "$OUTPUT" > quality/results/BUG-NNN.red.log + +# ── Green phase: apply fix, run test, expect PASS ── +git apply quality/patches/BUG-NNN-fix.patch +OUTPUT=$($TEST_CMD 2>&1); EXIT=$? +printf 'GREEN\n--- Test output for BUG-NNN green phase ---\nCommand: %s\nExit code: %d\n%s\n' \ + "$TEST_CMD" "$EXIT" "$OUTPUT" > quality/results/BUG-NNN.green.log +``` + +Run this for every confirmed bug. If the test runner is not available, create the log file with `NOT_RUN` on the first line and an explanation. Do not skip this step — the TDD Log Closure Gate in Phase 5 will block completion if logs are missing. + +**TDD execution gate.** Before the terminal gate in Phase 5, verify that for every confirmed bug in `quality/BUGS.md`, a corresponding `quality/results/BUG-NNN.red.log` exists. Bugs without red-phase logs are incomplete — the regression test patch exists but was never proven to detect the bug. This gate exists because v1.3.45 benchmarking showed that most repos generate regression test patches but never execute them, leaving the TDD verdict unverified. + +### File 4: `quality/RUN_INTEGRATION_TESTS.md` + +**Read `references/review_protocols.md`** for the template. + +Must include: safety constraints, pre-flight checks, test matrix with specific pass criteria, an execution UX section, and a structured reporting format. Cover happy path, cross-variant consistency, output correctness, and component boundaries. + +**Use-case traceability (mandatory).** The test matrix must include a **use-case traceability column**. Each integration test group must either: + +1. **Map to a use case** — Name the use case (e.g., UC-03) it validates and describe how the test exercises the user outcome from that use case. These are primary integration tests — they verify that the end-to-end behavior described in the use case actually works. + +2. **Be labeled as infrastructure** — Tests that don't map to a use case (build validation, race detection, compatibility checks, existing test suite regression guards) are explicitly labeled `[Infrastructure]` in the traceability column. They have value but don't count toward use-case coverage. + +After generating the test matrix, check: does every use case in REQUIREMENTS.md have at least one integration test mapped to it? If not, flag the uncovered use case as a gap. Integration tests mapped to use cases should test the **end-to-end behavior** described in the use case — not just run existing unit tests that happen to touch the same code paths. For example, if a use case says "Developer authenticates and follows redirects without leaking secrets," the integration test should perform a redirect across domains with auth headers and verify they're stripped — not just run `pytest -k auth`. + +**Per-UC group splitting (mandatory).** Each integration test group must map to at most **2 use cases**. A group that maps to 3+ UCs is too coarse — it can't distinguish which use case failed when a test breaks. If a single test command (e.g., `mvn test`, `go test ./...`) would exercise multiple use cases, split it into separate groups with targeted test selectors (`-Dtest=`, `-run`, `-k`, `--tests`, `-- test_name`, etc.) so each group isolates 1–2 UCs. Groups covering all UCs in one undifferentiated command are explicitly prohibited — they provide no diagnostic value when a failure occurs. + +**No-selector fallback.** If the project's test framework cannot select tests at the granularity needed for splitting (e.g., a monolithic test suite with no tag/filter support), document the limitation in the integration protocol and use the narrowest feasible command. Record which UCs the group covers and why further splitting is not possible. **A single-command project must still use the grouped JSON schema** — wrap the command in one group with a `use_cases` list covering all UCs that command exercises. A flat list of commands is never a valid substitute for the `groups[]` structure. + +**Pre-flight command validation (mandatory).** Before finalizing `RUN_INTEGRATION_TESTS.md`, verify that each group's test command actually discovers and runs tests. Use the framework's dry-run or list mode to confirm: +- **Python:** `pytest --collect-only -q ` — must list at least one test +- **Go:** `go test -list "." ` — must list at least one test name +- **Java/Kotlin:** `mvn -Dtest= test -pl --batch-mode -DfailIfNoTests=true` +- **TypeScript (Vitest):** `vitest list --config ` — must list at least one test +- **TypeScript (Jest):** `jest --listTests ` — must list at least one file +- **Rust:** `cargo test -- --list` — must list at least one test +- **JavaScript (Mocha):** `mocha --dry-run ` — must list at least one test + +If the dry-run exits with "no tests found," "No test files found," or a zero-test count, fix the selector before recording the group. Common fixes: add `--config` or `--root` flags, use file paths instead of `-t` name patterns, anchor regex patterns to the right package. Do not record a group whose command fails discovery — it will produce a `covered_fail` result that masks a selector bug as a code bug. + +If the dry-run fails with a **build error** (compilation failure, import error, missing dependency, test setup exception) rather than "no tests found," record the failure in the group's `notes` field as `"pre_flight_error": "environment"` and do not attempt to fix the selector. Environment errors during pre-flight require environment setup, not selector changes. + +**Infrastructure group definition.** A single `[Infrastructure]` group may cover build validation, race detection, static analysis, and platform compatibility checks without UC mapping. Infrastructure tests verify build toolchain and platform support, not user-observable behavior. Infrastructure groups: +- Do **not** count toward use-case coverage (the UC coverage check ignores them) +- Must include a one-line rationale explaining what they validate +- May **not** be used to relabel broad user-workflow commands to avoid splitting — if the tests exercise user-facing behavior described in a use case, they must be mapped to that UC regardless of how the test is organized + +**All commands must use relative paths.** The generated protocol should include a "Working Directory" section at the top stating that all commands run from the project root using relative paths. Never generate commands that `cd` to an absolute path — this breaks when the protocol is run from a different machine or directory. Use `./scripts/`, `./pipelines/`, `./quality/`, etc. + +**Include an Execution UX section.** When someone tells an AI agent to "run the integration tests," the agent needs to know how to present its work. The protocol should specify three phases: (1) show the plan as a numbered table before running anything, (2) report one-line progress updates as each test runs (`✓`/`✗`/`⧗`), (3) show a summary table with pass/fail counts and a recommendation. See `references/review_protocols.md` section "Execution UX" for the template and examples. Without this, the agent dumps raw output or stays silent — neither is useful. + +**Structured output (mandatory).** The protocol must instruct the agent to produce machine-readable results alongside the Markdown report, using **JUnit XML** for test execution and a **sidecar JSON** for QPB-specific metadata. + +**JUnit XML output:** Each test group should run with the framework's native JUnit XML reporter: +- Python: `pytest --junitxml=quality/results/integration-group-N.xml` +- Go: `gotestsum --junitxml quality/results/integration-group-N.xml -- -run "TestPattern"` +- Java/Kotlin: Copy Surefire XML reports to `quality/results/` +- TypeScript: `jest --reporters=jest-junit` with `JEST_JUNIT_OUTPUT_DIR=quality/results/` +- Rust: `cargo test 2>&1 | cargo2junit > quality/results/integration-group-N.xml` (if available) + +If the JUnit XML reporter is unavailable, skip XML and note `"junit_available": false` in the sidecar JSON. + +**Sidecar JSON:** Generate `quality/results/integration-results.json` by copying the template below verbatim and filling in only the values. Do not invent fields, rename keys, or restructure the schema. A flat list of commands without the `groups` array is **invalid** — even if the project runs all tests through a single command, wrap it in one group. + +```json +{ + "schema_version": "1.1", + "skill_version": "", + "date": "YYYY-MM-DD", + "project": "", + "recommendation": "SHIP", + "groups": [ + { + "group": 1, + "name": "Core routing dispatch", + "use_cases": ["UC-01", "UC-02"], + "result": "pass", + "tests_passed": 5, + "tests_failed": 0, + "junit_file": "integration-group-1.xml", + "junit_available": true, + "notes": "" + } + ], + "summary": { + "total_groups": 9, + "passed": 8, + "failed": 1, + "skipped": 0 + }, + "uc_coverage": { + "UC-01": "covered_pass", + "UC-02": "covered_pass", + "UC-03": "not_mapped" + } +} +``` + +**Required top-level fields:** `schema_version`, `skill_version`, `date`, `project`, `recommendation`, `groups`, `summary`, `uc_coverage`. If any of these fields are missing from your output, the result is non-conformant. + +**Invalid examples (do not emit these):** +- A flat `"results": [{"command": "go test ./...", "result": "pass"}]` — this is not the grouped schema. +- A schema with `"commands_run"` instead of `"groups"` — wrong key name. +- A schema missing `"uc_coverage"` — every use case from REQUIREMENTS.md must appear. +- A schema with `"use_case_traceability"` instead of `"use_cases"` — wrong field name. + +Valid `result` values: `"pass"`, `"fail"`, `"skipped"`, `"error"`. Valid `recommendation` values: `"SHIP"` (all groups pass), `"FIX BEFORE MERGE"` (failures in non-blocking groups), `"BLOCK"` (failures in critical groups). The `uc_coverage` section maps every use case from REQUIREMENTS.md to one of: `"covered_pass"` (at least one mapped group passed), `"covered_fail"` (groups mapped but all failed), or `"not_mapped"` (no integration test group maps to this use case). The distinction between `"covered_fail"` and `"not_mapped"` matters: the first means the test exists but the code is broken; the second means the test is missing. + +Runner scripts and CI tools should read the sidecar JSON for results rather than grepping the Markdown report. This eliminates the class of bugs where grep-based counting produces wrong numbers from matching words in prose. + +**Post-write validation (mandatory).** After writing `integration-results.json`, reopen the file and verify: (1) every required top-level field is present, (2) every `groups[]` entry has `group`, `name`, `use_cases`, `result`, and `notes`, (3) all `result` and `recommendation` values use only the allowed enum values listed above, (4) `uc_coverage` maps every use case from REQUIREMENTS.md, (5) no extra undocumented root keys exist. If any check fails, fix the file before proceeding. + +**This protocol must exercise real external dependencies.** If the project talks to APIs, databases, or external services, the integration test protocol runs real end-to-end executions against those services — not just local validation checks. Design the test matrix around the project's actual execution modes and external dependencies. Look for API keys, provider abstractions, and existing integration test scripts during exploration and build on them. + +**Derive quality gates from the code, not generic checks.** Read validation rules, schema enums, and generation logic during exploration. Turn them into per-pipeline quality checks with specific fields and acceptable value ranges. "All units validated" is not enough — the protocol must verify domain-specific correctness. + +**Script parallelism, don't just describe it.** Group runs so independent executions (different providers) run concurrently. Include actual bash commands with `&` and `wait`. One run per provider at a time to avoid rate limits. + +**Calibrate unit counts to the project.** Read `chunk_size` or equivalent config. Use enough units to span at least 2 chunks and enough to verify distribution checks. Typically 10–30 for integration testing. + +**Deep post-run verification.** Don't stop at "process completed." Verify log files, manifest state, output data existence, sample record content, and any existing quality check scripts — for every run. + +**Find and use existing verification tools.** Search for existing scripts that verify output quality (e.g., `integration_checks.py`, validation scripts, quality gate functions). If they exist, call them from the protocol. If the project has a TUI or dashboard, include TUI verification commands (e.g., `--dump` flags) in the post-run checklist. + +**Build a Field Reference Table before writing quality gates.** This is the most important step for protocol accuracy. AI models confidently write wrong field names even after reading schemas — `document_id` becomes `doc_id`, `sentiment_score` becomes `sentiment`, `float 0-1` becomes `int 0-100`. The fix is procedural: **re-read each schema file IMMEDIATELY before writing each table row.** Do not rely on what you read earlier in the conversation — your memory of field names drifts over thousands of tokens. Copy field names character-for-character from the file contents. Include ALL fields from each schema (if the schema has 8 fields, the table has 8 rows). See `references/review_protocols.md` section "The Field Reference Table" for the full process and format. Do not skip this step — it prevents the single most common protocol inaccuracy. + +### File 5: `quality/RUN_SPEC_AUDIT.md` — Council of Three + +**Read `references/spec_audit.md`** for the full protocol. + +Three independent AI models audit the code against specifications. Why three? Because each model has different blind spots — in practice, different auditors catch different issues. Cross-referencing catches what any single model misses. + +The protocol defines: a copy-pasteable audit prompt with guardrails, project-specific scrutiny areas, a triage process (merge findings by confidence level), and fix execution rules (small batches by subsystem, not mega-prompts). + +**Secondary emphasis lenses:** Optionally assign each audit model a secondary emphasis — for example, one starts with input validation, one with resource lifecycle, one with concurrency. Each model still performs a full independent audit; the emphasis biases attention without restricting coverage. Do not split models into disjoint ownership by bug class. + +**Minority finding rule:** During triage, any finding where only one of three auditors flags it (a minority finding) requires a re-investigation — read the specific code location and make an explicit CONFIRMED/FALSE-POSITIVE determination rather than discarding by default. Minority findings are disproportionately likely to be real bugs that two models missed. + +**Triage must not raise the evidentiary bar above code-path analysis.** The triage step confirms or rejects findings — it does not defer them pending runtime evidence. If a finding includes a code-path trace showing a behavioral violation (function calls, missing branches, wrong return values with file:line references), the triage should confirm it. Do not demote code-path-traced findings to "candidate" or "needs runtime verification." The TDD protocol (Phase 5) provides runtime evidence AFTER confirmation. See "What counts as sufficient evidence to confirm a bug" in the BUGS.md section for the full evidentiary standard. + +**Code review vs spec audit conflicts:** If the code review and spec audit disagree on the same finding, the spec audit finding is not automatically correct. Deploy a verification probe — read the specific code location and determine which assessment is accurate. Record the resolution in the BUG tracker. A code review BUG not flagged by any spec auditor is still confirmed but should be verified with a targeted probe before closure. + +**Verification probes must produce executable evidence.** When the triage step confirms OR rejects a finding via verification probe, prose reasoning alone is not sufficient. The probe must produce a test assertion that mechanically proves the determination: + +- **For rejections** (finding is false positive): Write an assertion that PASSES, proving the finding is wrong. Example: if rejecting "function X is missing null check," write `assert "if (ptr == NULL)" in source_of("X"), "X has null check at line NNN"`. If you cannot write a passing assertion that proves your rejection, **do not reject the finding** — escalate it to confirmed or flag it for manual review. + +- **For confirmations** (finding is a real bug): Write an assertion that FAILS (expected-failure), proving the bug exists. Example: if confirming "RING_RESET missing from switch," write `assert "case VIRTIO_F_RING_RESET:" in source_of("vring_transport_features"), "RING_RESET should be in the switch but is not"`. + +- **Every assertion must cite an exact line number** for the evidence it references. Not "lines 3527-3528" but "line 3527: `default:`" — showing what the line actually contains. Assertions without line-number citations are insufficient. + +**Why this rule exists:** In v1.3.16 virtio testing, the triage correctly received a minority finding that `VIRTIO_F_RING_RESET` was missing from a switch/case whitelist. The triage performed a "verification probe" that claimed lines 3527-3528 "explicitly preserve VIRTIO_F_RING_RESET" — but those lines actually contained the `default:` branch. The triage hallucinated compliance with the code. Had it been required to write `assert "case VIRTIO_F_RING_RESET:" in source`, the assertion would have failed, exposing the hallucination. Requiring executable evidence for rejections makes hallucinated rejections self-defeating: the model cannot write a passing assertion for something that isn't in the code. + +**Triage evidence must be written to disk.** Verification probe assertions must appear in a file on disk — either appended to `quality/mechanical/verify.sh` or written to a dedicated `quality/spec_audits/triage_probes.sh`. Assertions described in the triage report prose but never written to an executable file are not executable evidence. The gate checks for the existence of probe assertions in the triage output; a triage report that says "verification probe confirms..." without a corresponding assertion in an executable file is non-conformant. This prevents the failure mode where the model narrates what a probe *would* show without actually running it. + +### File 6: `AGENTS.md` (orchestrator-generated; you do NOT write this in Phase 2) + +**v1.5.4 contract: `AGENTS.md` is generated by `bin/run_playbook.py` after Phase 6 succeeds, NOT by you during Phase 2.** The orchestrator's `_safe_write_agents_md` helper writes the file to the **target's repo root** (`/AGENTS.md`, not `quality/AGENTS.md`) using the `quality/` artifacts you produced in Phase 2 + the Phase 6 gate verdict as input. The generator carries a `` sentinel on the first non-empty line so subsequent runs detect QPB-managed copies. + +**What you (the Phase 2 LLM) must NOT do:** + +1. **Do not create `/AGENTS.md`.** The orchestrator owns it. Creating it from Phase 2 collides with the orchestrator's idempotent regeneration path and can be flagged by the source-unchanged invariant. +2. **Do not modify an existing `/AGENTS.md`.** If one exists at the target's repo root and lacks the QPB sentinel, it's operator-authored — leave it alone (the orchestrator will preserve it with a warning per `_safe_write_agents_md`'s `"preserved"` outcome). If it carries the QPB sentinel, the orchestrator will regenerate it post-Phase-6; you don't help by editing it mid-run. +3. **Do not write `quality/AGENTS.md`.** That's not the contract; AGENTS.md lives at the repo root, not under `quality/`. + +If you find yourself wanting to "Add a Quality Docs section to the existing AGENTS.md" — stop. The orchestrator does that for you, sourced from the canonical paths the Phase 6 gate validates. Your Phase 2 deliverables are the `quality/` artifacts and nothing else. + +**This bug fix prevents the bootstrap-test failure mode** that surfaced on 2026-04-30: a Phase 2 LLM read the v1.5.3-era "If `AGENTS.md` already exists, update it" instruction at this exact section, appended a Quality Docs section to the project's root AGENTS.md, and triggered the source-unchanged invariant — aborting the run and discarding 20 minutes of Phase 2 work. + +### File 7: `quality/RUN_TDD_TESTS.md` — TDD Verification Protocol + +This protocol is executed after the code review and spec audit have confirmed bugs and generated fix patches. It runs the red-green TDD cycle for each confirmed bug: test fails on unpatched code, apply fix, test passes. + +**Why a separate protocol?** The code review finds bugs and writes regression tests with `xfail` markers. The TDD protocol takes those tests and proves they actually detect the bug — and that the fix actually fixes it. This is a stronger claim than "we found a bug and wrote a test." It's "here's a test that fails without the patch and passes with it." The distinction matters when reporting bugs upstream: maintainers trust a FAIL→PASS demonstration more than a bug description. + +The generated protocol must include: + +1. **Spec-grounded test requirements.** For each bug in `quality/BUGS.md`, the protocol instructs the agent to: + - Read the bug's **spec basis** field to identify the documentation passage that defines the expected behavior + - Read the gathered doc (from `reference_docs/` or the project's own docs) at the cited section + - Write test assertions using **language from the spec** — variable names, constants, function names, and assertion messages should echo the spec's terminology, not the code's internal naming + - Include a comment block in each test citing: the requirement ID (from REQUIREMENTS.md), the bug ID (from BUGS.md), and the spec passage (doc name, section, and a ≤15-word quote of the behavioral contract) + +2. **Red-green execution steps.** For each bug with a fix patch: + - **Red:** Run the regression test against unpatched source. It must fail. If it passes, the test doesn't detect the bug — rewrite it using the spec basis to understand what behavior to assert. + - **Green:** Apply the fix patch (`git apply quality/patches/BUG-NNN-fix.patch`), run the same test. It must pass. + - **Record:** Log both results in the BUG tracker with closure status "TDD verified (FAIL→PASS)". + +3. **Framework adaptation.** The protocol must detect the project's test framework and generate idiomatic tests: + - **Projects with test infrastructure** (pytest, JUnit, Go testing, Jest, cargo test, etc.): Write tests in the project's own framework, following existing test conventions discovered during exploration. + - **Projects without test infrastructure** (e.g., Linux kernel, embedded C): Extract the target function with `sed`, write a self-contained C test file with minimal type shims, compile and run directly. Include the extraction command in the test file's header comment so it's self-documenting. + +4. **Upstream reporting format.** For each TDD-verified bug, generate a ready-to-send report block containing: + - One-sentence description citing the spec section violated + - The FAIL→PASS output (copy-pasteable terminal session) + - The test file (as an attachment or inline) + - The fix patch (as an attachment or inline) + +5. **Traceability table.** The protocol produces a `quality/TDD_TRACEABILITY.md` file mapping: + + | Bug ID | Requirement ID | Spec Doc | Spec Section | Behavioral Contract | Test File:Function | Red Result | Green Result | + |--------|---------------|----------|-------------|--------------------|--------------------|------------|--------------| + + Every row must be fully populated. A bug without a spec doc entry is a code inconsistency, not a spec violation — note this in the table and adjust the upstream reporting language accordingly. + +6. **Structured output (mandatory).** The protocol must produce machine-readable results alongside the Markdown report. Use **JUnit XML** for test execution results and a **sidecar JSON** file for QPB-specific metadata that JUnit XML cannot represent. + + **JUnit XML output:** For each red-green phase, run the test with the framework's native JUnit XML output flag: + - Python: `pytest --junitxml=quality/results/tdd-red-BUG-NNN.xml` + - Go: `gotestsum --junitxml quality/results/tdd-red-BUG-NNN.xml -- -run TestRegression_BUG_NNN` + - Java/Kotlin: Maven Surefire reports are generated automatically in `target/surefire-reports/`; copy relevant XML to `quality/results/` + - Rust: `cargo test --test regression 2>&1 | cargo2junit > quality/results/tdd-red-BUG-NNN.xml` (if cargo2junit available; otherwise skip XML for Rust) + - TypeScript: `jest --reporters=default --reporters=jest-junit` with `JEST_JUNIT_OUTPUT_DIR=quality/results/` + + If the framework's JUnit XML reporter is not available or requires a missing dependency, skip the XML output for that language and note it in the sidecar JSON (`"junit_available": false`). Do not fail the TDD run over missing XML tooling. + + **Sidecar JSON (strict schema enforcement):** Generate `quality/results/tdd-results.json` by copying the template below **verbatim** and filling in only the values. Do not invent fields, rename keys, or restructure the schema. The template is the schema — any deviation (extra keys, missing keys, renamed keys, restructured nesting) makes the output non-conformant. Copy-paste the template into your editor first, then fill in the values. Do not write the JSON from memory. + + ```json + { + "schema_version": "1.1", + "skill_version": "", + "date": "YYYY-MM-DD", + "project": "", + "bugs": [ + { + "id": "BUG-001", + "requirement": "REQ-003", + "red_phase": "fail", + "green_phase": "pass", + "verdict": "TDD verified", + "regression_patch": "quality/patches/BUG-001-regression-test.patch", + "fix_patch": "quality/patches/BUG-001-fix.patch", + "fix_patch_present": true, + "patch_gate_passed": true, + "writeup_path": "quality/writeups/BUG-001.md", + "junit_red": "tdd-red-BUG-001.xml", + "junit_green": "tdd-green-BUG-001.xml", + "junit_available": true, + "notes": "" + } + ], + "summary": { + "total": 6, + "verified": 4, + "confirmed_open": 1, + "red_failed": 1, + "green_failed": 0 + } + } + ``` + + **Required top-level fields:** `schema_version`, `skill_version`, `date`, `project`, `bugs`, `summary`. **Required per-bug fields:** `id`, `requirement`, `red_phase`, `green_phase`, `verdict`, `fix_patch_present`, `writeup_path`. If any required field is missing, the result is non-conformant. **Optional per-bug fields** (shown in the template above but not gate-checked): `regression_patch`, `fix_patch`, `patch_gate_passed`, `junit_red`, `junit_green`, `junit_available`, `notes`. Include these when the data is available; omit them without penalty. + + **Required summary sub-keys:** The `summary` object must contain exactly these keys: `total`, `verified`, `confirmed_open`, `red_failed`, `green_failed`. All five are required — omitting any of them (especially `red_failed` or `green_failed`) makes the summary non-conformant. + + **Canonical patch file names:** Regression test patches must be named `BUG-NNN-regression-test.patch`. Fix patches must be named `BUG-NNN-fix.patch`. The gate script globs for these exact patterns — creative variants like `BUG-001-regression.patch` or `BUG-001-test.patch` will not be counted. + + **Date field:** Use the actual date of this session (e.g., `"2026-04-12"`), not the template placeholder `"YYYY-MM-DD"`. The gate validates that the date is a real ISO 8601 date and rejects placeholder strings and future dates. + + **Invalid examples (do not emit these):** + - `"runs": [{"phase": "red", "command": "...", "result": "4 xfailed"}]` — this is a flat runs array, not the bug-indexed `"bugs"` schema. + - A schema with ad-hoc root keys like `"generated"`, `"scope"`, `"status"`, `"testsRun"` — these are not the standard schema fields. + - `"verdict": "skipped"` — this value is deprecated; use `"confirmed open"` with `red_phase: "fail"` and `green_phase: "skipped"`. + - Missing `"schema_version"` at the root — every tdd-results.json must include this field. + + Valid `verdict` values: `"TDD verified"` (FAIL→PASS), `"red failed"` (test passed on unpatched code — test doesn't detect the bug), `"green failed"` (test still fails after fix — fix is incomplete or patch is corrupt), `"confirmed open"` (red phase ran and confirmed the bug, no fix patch available), `"deferred"` (TDD cannot execute in this environment — use with `notes` explaining why). **Do not use `"skipped"` as a verdict** — every confirmed bug must have a red-phase result. A bug with `verdict: "confirmed open"` must have `red_phase: "fail"` (red ran and confirmed the bug) and `green_phase: "skipped"` (no fix to apply). Valid `red_phase`/`green_phase` values: `"fail"`, `"pass"`, `"error"` (compile/apply failure), `"skipped"` (green only — red is never skipped). The `patch_gate_passed` field records whether the patch validation gate (apply-check + compile) succeeded — `false` if the gate failed and the patch was repaired, `null` if no fix patch exists. The `writeup_path` field points to the per-bug writeup file (see "Bug writeup generation" below) — `null` if no writeup was generated for this bug. + + Runner scripts and CI tools should read the sidecar JSON for pass/fail counts rather than grepping the Markdown report. + + **Post-write validation (mandatory).** After writing `tdd-results.json`, reopen the file and verify: (1) every required top-level field is present, (2) every required per-bug field is present in each `bugs[]` entry, (3) all `verdict`, `red_phase`, and `green_phase` values use only the allowed enum values listed above, (4) no extra undocumented root keys exist. If any check fails, fix the file before proceeding. This step catches the most common failure mode: the agent paraphrases the schema from memory instead of copying the template, producing plausible but non-conformant output. + +**TDD artifact closure gate (mandatory).** If `quality/BUGS.md` contains any confirmed bugs, `quality/results/tdd-results.json` is mandatory — not optional. If any bug has a red-phase result (whether TDD-verified or confirmed-open), `quality/TDD_TRACEABILITY.md` is also mandatory. Zero-bug repos may omit both files. A run that confirms bugs but produces no tdd-results.json is incomplete — the phase cannot close. For repos where TDD cannot execute (environment blocked, no test infrastructure), generate tdd-results.json with `verdict: "deferred"` and a `notes` field explaining why (e.g., `"environment_blocked: missing workspace Cargo.toml"`, `"no_test_infrastructure: kernel C code without userspace harness"`). The deferred verdict makes the gap visible instead of silently omitting the file. + +**Execution UX:** Same three-phase pattern as the integration tests — (1) show the plan as a numbered table of bugs to verify, (2) report one-line progress as each red-green cycle runs (`FAIL ✓ → PASS ✓` or `FAIL ✗ — test passes on unpatched code, rewriting`), (3) show a summary table with verified/failed/rewritten counts. + +7. **Bug writeup generation (for all confirmed bugs).** After a successful red→green cycle (`verdict: "TDD verified"`) or confirmation without a fix (`verdict: "confirmed open"`), generate a self-contained writeup at `quality/writeups/BUG-NNN.md`. This file is designed to be emailed to a maintainer, attached to a Jira ticket, or reviewed outside the repository — it must stand alone without requiring the reader to navigate the rest of the quality artifacts. + + **Template (sections 1–4, 6, 7 are required in every writeup; add 5 when the depth judgment fires; add 8 when related bugs exist):** + + 1. **Summary** — One paragraph: what's wrong, where (file:line), what breaks in practice. + 2. **Spec reference** — The specific spec section violated, with URL if available. Quote the behavioral contract (≤15 words) that the code fails to satisfy. + 3. **The code** — The buggy code with file:line citation. Explain why it's wrong in terms of the spec, not just "it looks weird." + 4. **Observable consequence** — What actually breaks. Not "could theoretically fail" — what does fail, under what conditions, with what symptoms. + 5. **Depth judgment** *(include only when expansion is warranted)* — After drafting sections 1–4, assess: is the consequence self-evident from the code and test alone? If a reader would reasonably ask "why hasn't anyone noticed this?" or "does this affect all configurations equally?", expand the analysis. Trace the buggy function's callers. Show which code paths expose the bug and which mask it. Concrete expansion triggers: transport/config-dependent behavior, feature flags that mask the bug on some paths, indirect dispatch hiding callers, bugs in negotiation/initialization code that only manifest under specific runtime conditions. If the consequence is obvious from the immediate code (e.g., a null dereference, an off-by-one), keep sections 1–4 tight and omit this section. + 6. **The fix** — A proposed fix as an inline diff (unified diff format), with a brief explanation of why this is the right fix. **Always include a concrete diff** — even for confirmed-open bugs without a separate `.patch` file. If the fix is a one-line change (adding a case label, fixing an argument), write the diff. If the fix requires broader changes, write the minimal diff that addresses the core defect and note what additional changes a full fix would need. The inline diff in the writeup is what makes the writeup actionable — a writeup that says "No fix patch is included" is incomplete and not useful to a maintainer. Example format: + ```diff + --- a/drivers/virtio/virtio_ring.c + +++ b/drivers/virtio/virtio_ring.c + @@ -3527,6 +3527,7 @@ void vring_transport_features(...) + case VIRTIO_F_ORDER_PLATFORM: + case VIRTIO_F_IN_ORDER: + + case VIRTIO_F_RING_RESET: + default: + ``` + 7. **The test** — What the test proves, how to run it, and what output to expect on unpatched vs patched code. + 8. **Related issues** *(include only when related bugs exist)* — Other bugs in the same class, if any. Flag them even if they're not confirmed yet. Omit this section if no related issues were identified. + + **Include the version stamp** at the top of the writeup file (same format as all other generated files). + + **Writeup generation for all confirmed bugs (mandatory).** Generate a writeup at `quality/writeups/BUG-NNN.md` for every confirmed bug — both TDD-verified and confirmed-open. Use the numbered section template above (sections 1–8). For confirmed-open bugs, follow the same template including a proposed fix diff in section 6 (the diff is always required even without a separate `.patch` file). The writeup threshold is bug confirmation, not TDD completion. A run with confirmed bugs and no writeups directory is incomplete. + + **Inline diff is gate-enforced.** The `quality_gate.py` script checks that every writeup contains a ` ```diff ` block. A writeup without an inline diff will cause the gate to FAIL. Do not write "see patch file" — paste the actual diff inline in the writeup body, inside a fenced ` ```diff ` code block. This is the single most important element of the writeup because it makes the bug actionable for a maintainer reading just the writeup. + +### Checkpoint: Update PROGRESS.md after artifact generation + +Re-read `quality/PROGRESS.md`. Update: +- Mark Phase 2 complete with timestamp +- Update the artifact inventory: set each generated artifact to "generated" with its file path +- Add exploration summary notes if not already present + +**Phase 2 completion gate (mandatory).** Before proceeding to Phase 3, verify: +1. All core artifacts exist on disk under `quality/` (`QUALITY.md`, `CONTRACTS.md`, `REQUIREMENTS.md`, `COVERAGE_MATRIX.md`, `COMPLETENESS_REPORT.md`, `test_functional.*`, `RUN_CODE_REVIEW.md`, `RUN_INTEGRATION_TESTS.md`, `RUN_SPEC_AUDIT.md`, `RUN_TDD_TESTS.md`). `AGENTS.md` is NOT in this list — the orchestrator writes it to the target repo root after Phase 6, not Phase 2. +2. `REQUIREMENTS.md` contains requirements with specific conditions of satisfaction referencing actual code (file paths, function names, line numbers) — not abstract behavioral descriptions. +3. If dispatch/enumeration contracts exist: `quality/mechanical/verify.sh` exists and has been executed. +4. PROGRESS.md marks Phase 2 complete with timestamp. + +Re-read `quality/PROGRESS.md` and `quality/REQUIREMENTS.md` before starting Phase 3. The requirements are the target list for the code review — every requirement is a potential bug if the code doesn't satisfy its conditions. + +**End-of-phase message (mandatory — print this after Phase 2 completes, then STOP):** + +``` +# Phase 2 Complete — Quality Artifacts Generated + +I've generated the quality infrastructure for this project: +[List the key artifacts created: REQUIREMENTS.md with N requirements and N use cases, +QUALITY.md with N scenarios, functional tests, review protocols, etc.] + +The requirements are now the target list for Phase 3's code review — every requirement +is a potential bug if the code doesn't satisfy it. + +To continue to Phase 3 (Code review with regression tests), say: + + Run quality playbook phase 3. + +Or say "keep going" to continue automatically. +``` + +**After printing this message, STOP. Do not proceed to Phase 3 unless the user explicitly asks.** + +--- + +## Phase 3: Code Review and Regression Tests + +**v1.5.6 instrumentation:** Append `phase_start phase=3` to `quality/run_state.jsonl` now. At phase end, cross-validate (`quality/RUN_CODE_REVIEW.md` exists; one writeup per identified bug) then append `phase_end phase=3`. + +> **Required references for this phase:** +> - `quality/REQUIREMENTS.md` — target list for the code review +> - `references/review_protocols.md` — three-pass protocol and regression test conventions + +Run the code review protocol (all three passes) as described in File 3. After producing findings, write regression tests for every confirmed BUG per the closure mandate in `references/review_protocols.md`. + +**Update PROGRESS.md:** Add every confirmed BUG to the cumulative BUG tracker with source "Code Review", the file:line reference, description, severity, and closure status (regression test function name or exemption reason). Mark Phase 3 (Code review + regression tests) complete. + +**End-of-phase message (mandatory — print this after Phase 3 completes, then STOP):** + +``` +# Phase 3 Complete — Code Review + +The three-pass code review is done. [Summarize: N bugs confirmed, N regression test +patches generated, N fix patches generated. List the bug IDs and one-line summaries.] + +To continue to Phase 4 (Spec audit — Council of Three), say: + + Run quality playbook phase 4. + +Or say "keep going" to continue automatically. +``` + +**After printing this message, STOP. Do not proceed to Phase 4 unless the user explicitly asks.** + +--- + +## Phase 4: Spec Audit and Triage + +**v1.5.6 instrumentation:** Append `phase_start phase=4` now. For each pass A/B/C/D, append `pass_started phase=4 pass=X` and `pass_ended phase=4 pass=X`. At phase end, cross-validate (`quality/REQUIREMENTS.md` non-empty AND `quality/COVERAGE_MATRIX.md` exists) then append `phase_end phase=4`. + +> **Required references for this phase:** +> - `references/spec_audit.md` — Council of Three protocol, triage process, verification probes + +Run the spec audit protocol as described in File 5. The triage report **must** include a `## Pre-audit docs validation` section (see `references/spec_audit.md` for the full template). This section is required even if `reference_docs/` is empty — in that case, note what baseline the auditors used instead. Every verification probe in the triage must produce executable evidence (test assertions with line-number citations) per the "Verification probes must produce executable evidence" rule above. After triage, categorize each confirmed finding. + +**Effective council gating for enumeration checks.** If the effective council is less than 3/3 (fewer than three auditors returned usable reports) and the run includes any whitelist/enumeration/dispatch-function checks or any carried-forward seed checks, the audit may not conclude "no confirmed defects" for those checks without executed mechanical proof artifacts. An incomplete council with mechanical verification is acceptable. An incomplete council relying on prose-only validation for code-presence claims is not — escalate to "NEEDS VERIFICATION" and run the mechanical check before closing. + +**Pre-audit spot-checks must extract from code, not assert from docs.** When the spec audit prompt includes spot-check claims for pre-validation (e.g., "verify that function X handles constant Y at line Z"), the triage must validate each claim by extracting the actual code at the cited lines — not by confirming that the claim sounds plausible. For each spot-check claim about code contents, the pre-validation must report what the cited lines actually contain: "Line 3527 contains `default:` — NOT `case VIRTIO_F_RING_RESET:` as claimed." If the spot-check was generated from requirements or gathered docs rather than from the code itself, treat it as a hypothesis to test, not a fact to confirm. This rule prevents the contamination chain observed in v1.3.17 where a false spot-check claim ("RING_RESET at 3527-3528") was accepted as "accurate" without reading the actual lines, then propagated through the triage and into every downstream artifact. + +**Update PROGRESS.md:** Add every confirmed **code bug** from the spec audit to the cumulative BUG tracker with source "Spec Audit". This is critical — spec-audit bugs are systematically orphaned if they aren't added to the same tracker that the closure verification reads. + +### Layer 2 — Semantic Citation Check (v1.5.3 Council sub-pass) + +After the main spec audit triage, each Council member runs a per-REQ verdict against every Tier 1/2 REQ's `citation_excerpt`. This is **Layer 2** of the hallucination gate: Layer 1 is the mechanical byte-equality check — `bin/citation_verifier` is invoked by `bin/reference_docs_ingest` at ingest time and re-invoked by `quality_gate.py` at gate time; the LLM never shells out to it directly. Layer 2 is semantic — the reviewer decides whether the excerpt actually supports the requirement as stated, or whether the requirement overreaches what the excerpt says. + +**Protocol.** + +1. **One prompt per Council member, all Tier 1/2 REQs batched in.** Not one REQ at a time (3×N prompts is too many). Not a prose response (pattern-matching risk). The reviewer receives the full list of `(req_id, citation_excerpt, REQ description)` tuples and returns a structured per-REQ JSON response. + +2. **Structured response schema (schemas.md §9.2).** For each REQ the reviewer records `{"req_id": "REQ-NNN", "reviewer": "", "verdict": "supports" | "overreaches" | "unclear", "notes": ""}`. Valid `verdict` values are enumerated in schemas.md §3.5. + +3. **Batching threshold.** When a run produces more than 15 Tier 1/2 REQs, split into batches of up to 15 REQs per prompt per Council member. The same reviewer sees each batch sequentially; their response entries are concatenated into one `reviews[]` array under the same `reviewer` string. + +4. **Reviewer identifier stability.** Use fixed strings like `"claude-opus-4.7"`, `"gpt-5.4"`, `"gemini-2.5-pro"`. The majority computation in schemas.md §10 invariant #17 groups on this field — a typo silently becomes a fourth reviewer and breaks the 2-of-3 majority check. + +5. **Output.** Concatenate all Council members' responses into `quality/citation_semantic_check.json` using the §1.6 manifest wrapper, except the record array is named `reviews` rather than `records` (schemas.md §9.1). One file per run, regenerated on every audit pass. + +**Majority rule (gate-enforced).** For each Tier 1/2 REQ, the gate groups reviews by `req_id` and fails the run when ≥2 of 3 reviewers recorded `verdict == "overreaches"`. A single-member `overreaches` or `unclear` verdict surfaces as a warning but does not fail the gate. A REQ with fewer than three reviewer entries (missing reviewer, skipped batch) has insufficient evidence — the gate treats that as a fail. + +**No-op for Spec Gap runs.** If a run produces zero Tier 1/2 REQs, `citation_semantic_check.json` is still written with an empty `reviews` array — the file's existence is part of the artifact contract even when the check has nothing to evaluate. + +### Post-spec-audit regression tests + +After the spec audit triage, check the cumulative BUG tracker in PROGRESS.md. Any spec-audit BUG that doesn't have a regression test yet needs one now. Write regression tests for spec-audit confirmed code bugs using the same conventions as code-review regression tests (expected-failure markers, test-finding alignment, executable source files). + +**Why this step exists:** Code review bugs get regression tests immediately because tests are written right after the review. Spec audit runs after the tests are written, so its confirmed bugs are orphaned — they appear in the triage report but never get tests. This step closes that gap. + +**Individual auditor artifacts (mandatory).** The spec audit must produce individual auditor report files at `quality/spec_audits/` with filenames containing `auditor` (canonical format: `YYYY-MM-DD-auditor-N.md`, e.g., `2026-04-12-auditor-1.md`; also accepted: `auditor__.md`). The gate globs for `*auditor*` — any conformant name will match. One file per auditor, not just the triage synthesis. Each auditor report records what that auditor found independently before triage reconciliation. If only the triage file exists with no individual auditor artifacts, the audit is incomplete — the triage cannot be verified because there is no record of pre-reconciliation findings. This requirement exists because a single triage file conflates discovery with reconciliation, making it impossible to tell whether a finding was independently confirmed or synthesized from a single source. + +**Phase 4 completion gate.** Phase 4 is not complete until a triage file exists at `quality/spec_audits/YYYY-MM-DD-triage.md` **and** individual auditor reports exist. If only auditor reports exist with no triage synthesis, mark Phase 4 as "partial — triage pending" in PROGRESS.md and complete the triage before proceeding. If only the triage exists with no individual reports, mark Phase 4 as "partial — auditor artifacts missing" and regenerate them. The PROGRESS.md checkbox must not be set until both the triage file and auditor reports are confirmed present. + +Update the BUG tracker entries with regression test references. Mark Phase 4 (Spec audit + triage) complete. + +**End-of-phase message (mandatory — print this after Phase 4 completes, then STOP):** + +``` +# Phase 4 Complete — Spec Audit + +The Council of Three spec audit is done. [Summarize: N auditors ran, N net-new bugs +confirmed from triage, total bugs now at N. List any new bug IDs and summaries.] + +To continue to Phase 5 (Reconciliation — TDD verification, writeups, closure), say: + + Run quality playbook phase 5. + +Or say "keep going" to continue automatically. +``` + +**After printing this message, STOP. Do not proceed to Phase 5 unless the user explicitly asks.** + +--- + +## Phase 5: Post-Review Reconciliation and Closure Verification + +**v1.5.6 instrumentation:** Append `phase_start phase=5` now. For each gate check, append `gate_check gate_name=X verdict=pass|fail|warn|skip`. At phase end, cross-validate (`quality/results/quality-gate.log` non-empty) then append `phase_end phase=5`. + +**Source-edit guardrail (mandatory).** Phase 5 produces *proposed* fixes as patch artifacts at `quality/patches/-fix.patch` and `quality/patches/-regression-test.patch`. Phase 5 must NOT apply those patches to source files outside `quality/`. A self-audit run that mutates the target's source tree is a defect, not legitimate Phase 5 output — the operator chooses when to apply patches in a separate, supervised step. At run end the playbook calls `bin.run_state_lib.validate_no_source_edits(target_dir)`; if that helper reports any non-`quality/` paths dirty, append an `error recoverable:false` event citing the violations and end the run with `run_end status=aborted`. This rule was reaffirmed in v1.5.6 after the Codex bootstrap run on 2026-05-02 went off-rails in Phase 5 and edited five source files outside `quality/` before being killed. + +> **Required references for this phase:** +> - `quality/PROGRESS.md` — cumulative BUG tracker (authoritative finding list) +> - `references/challenge_gate.md` — two-round challenge protocol for false-positive detection +> - `references/requirements_pipeline.md` — post-review reconciliation process +> - `references/review_protocols.md` — regression test cleanup after reversals +> - `references/spec_audit.md` — verification probe protocol for conflicts + +**Phase 5 entry gate (mandatory — HARD STOP).** Before proceeding, verify ALL of the following Phase 4 artifacts exist: + +1. `quality/spec_audits/` directory exists and contains at least one `*triage*` file (the triage synthesis) +2. `quality/spec_audits/` contains at least one `*auditor*` file (individual auditor reports) +3. `quality/PROGRESS.md` exists and its Phase 4 line is marked `[x]` + +If any of these are missing, STOP and go back to Phase 4. Do not proceed with reconciliation until the spec audit artifacts are confirmed present — reconciliation without triage data produces an incomplete closure report. + +Re-read `quality/PROGRESS.md` — specifically the cumulative BUG tracker. This is the authoritative list of all findings across both code review and spec audit. + +**Challenge gate (mandatory before reconciliation).** Before running closure verification, apply the challenge gate to every confirmed bug that matches an auto-trigger pattern. Read `references/challenge_gate.md` for the full protocol. In summary: + +1. Scan the BUG tracker for bugs matching any auto-trigger pattern (security-class findings, code with design-decision comments at the cited location, findings with no spec basis, sibling code paths handling the same concern differently, findings about missing functionality). +2. For each triggered bug, run the two-round challenge using fresh sub-agents as described in the reference. +3. Record verdicts in `quality/challenge/BUG-NNN-challenge.md`. +4. Apply verdicts: CONFIRMED bugs proceed normally. DOWNGRADED bugs get their severity adjusted. REJECTED bugs are removed from the BUG tracker and relocated to a "Reviewed and dismissed" appendix in BUGS.md with the challenge reasoning. + +**Apply common sense throughout.** The challenge gate's primary purpose is to catch findings where pattern-matching overrode judgment. If a bug would make you look foolish reporting it to the upstream maintainer — a self-documenting placeholder flagged as a critical vulnerability, a documented design decision flagged as a defect, an intentional feature gap flagged as a security hole — it should not survive the challenge. The common-sense test is not one factor among many; it is the framing for the entire review. + +**Why this gate exists:** In v1.4.6 edgequake benchmarking, the code review confirmed 42 bugs including 7 rated CRITICAL. After manual review, the strongest finding (BUG-001, source_ids overwrite) was HIGH, not CRITICAL. Six "CRITICAL" tenant-isolation bugs were documented feature gaps with explicit WHY-OODA81 annotations. One "CRITICAL" JWT finding (BUG-041) was a self-documenting development placeholder containing the literal string "change-me-in-production." The model defended these findings through multiple rounds of pushback because its instinct was to find and defend bugs, not to apply common sense about what constitutes a defect. The challenge gate forces that common-sense review to happen before findings are finalized. + +1. **Run the Post-Review Reconciliation** as described in `references/requirements_pipeline.md`. Update COMPLETENESS_REPORT.md. +2. **Run closure verification:** For every row in the BUG tracker, verify it has either a regression test reference or an explicit exemption. If any BUG lacks both, write the test or exemption now. +3. **Triage-to-BUGS.md sync gate (mandatory).** Re-read the triage report (`quality/spec_audits/*-triage.md`). For every finding confirmed as a code bug, verify it appears in `quality/BUGS.md`. If BUGS.md does not exist, create it now. If BUGS.md exists but is missing confirmed bugs from the triage, append them. A triage report with confirmed code bugs and no corresponding BUGS.md entries is non-conformant — the phase cannot be marked complete until they are synced. This gate exists because in v1.3.21 benchmarking, javalin's triage confirmed 2 bugs but BUGS.md was never created. +4. **Clean up after spec-audit reversals:** If the spec audit reclassified any code review BUG as a design choice or false positive, remove or relocate the corresponding regression test per `references/review_protocols.md`. +5. **Resolve CR vs spec-audit conflicts:** If the code review and spec audit disagree on the same finding (one says BUG, the other says design choice), deploy a verification probe per `references/spec_audit.md` and record the resolution in the BUG tracker. + +**TDD sidecar-to-log consistency check (mandatory).** For every bug entry in `tdd-results.json`, verify the corresponding log files exist and agree. If `tdd-results.json` contains a bug with `verdict: "TDD verified"`, then `quality/results/BUG-NNN.red.log` must exist with first line `RED` and `quality/results/BUG-NNN.green.log` must exist with first line `GREEN`. If the sidecar claims "TDD verified" but no red-phase log exists, the verdict is unsubstantiated — either create the log by running the test, or downgrade the verdict to `"confirmed open"`. This check exists because v1.3.46 benchmarking showed agents writing "TDD verified" verdicts in the JSON based on narrative reasoning without ever executing the test. + +**Executed evidence outranks narrative artifacts (contradiction gate).** Before running the terminal gate, check for contradictions between executed evidence and prose artifacts. Executed evidence includes: mechanical verification artifacts (`quality/mechanical/*`), verification receipt files (`quality/results/mechanical-verify.log`, `quality/results/mechanical-verify.exit`), regression test results (`test_regression.*` with `xfail` outcomes), TDD red-phase log files (`quality/results/BUG-NNN.red.log`), and any shell command output saved during the pipeline. Prose artifacts include: `REQUIREMENTS.md`, `CONTRACTS.md`, code reviews, spec audit triage, and `BUGS.md`. If an executed artifact shows a constant is absent (mechanical check), a test fails (regression test), or a red-phase confirms a bug (TDD traceability) — but a prose artifact claims the constant is present, the bug is fixed, or the code is compliant — the executed result wins. Re-open and correct the contradictory prose artifact before proceeding. Specifically: if `mechanical-verify.exit` contains a non-zero value, PROGRESS.md may not claim "Mechanical verification: passed" and the terminal gate may not pass — regardless of what any other artifact says. In v1.3.18, the triage claimed RING_RESET was preserved (`spec_audits/triage.md`), BUGS.md claimed "fixed in working tree," but TDD traceability showed the assertion `assert "case VIRTIO_F_RING_RESET:" in func` failed on the current source. Those three cannot all be true — the executed failure is the ground truth. This gate would have caught that contradiction. + +**Version stamp consistency check (mandatory).** Read the `version:` field from the SKILL.md metadata (using the reference file resolution order). Then check every generated artifact: PROGRESS.md's `Skill version:` field, every `> Generated by` attribution line, every code file header stamp, and every sidecar JSON `skill_version` field. Every version stamp must match the SKILL.md metadata exactly. A single mismatch is a benchmark failure — fix the stamp before proceeding. This check exists because in v1.3.21 benchmarking, 5 of 9 repos had version stamps from older skill versions (v1.3.16 or v1.3.20) because the PROGRESS.md template contained a hardcoded version number. + +**Mechanical directory conformance check.** If `quality/mechanical/` exists, it must contain at minimum a `verify.sh` file. An empty `quality/mechanical/` directory is non-conformant — it implies the step was attempted but abandoned. If no dispatch-function contracts exist in this project's scope, do not create a `mechanical/` directory at all. Instead, record in PROGRESS.md: `Mechanical verification: NOT APPLICABLE — no dispatch/registry/enumeration contracts in scope.` If dispatch contracts do exist, `verify.sh` must include one verification block per saved extraction file under `quality/mechanical/` (not just one). A verify.sh that checks only one artifact when multiple exist is incomplete. + +**Verification receipt gate (mandatory before terminal gate).** If `quality/mechanical/` exists, the following receipt files must also exist before the terminal gate may run: +- `quality/results/mechanical-verify.log` — full stdout/stderr from `bash quality/mechanical/verify.sh` +- `quality/results/mechanical-verify.exit` — a single line containing the exit code (e.g., `0`) + +If either file is missing, run `bash quality/mechanical/verify.sh > quality/results/mechanical-verify.log 2>&1; echo $? > quality/results/mechanical-verify.exit` now. If the exit code is not `0`, the terminal gate fails — do not proceed until the mechanical mismatch is resolved (by fixing the extraction, not by editing verify.sh or the receipt). PROGRESS.md may not claim "Mechanical verification: passed" unless `mechanical-verify.exit` contains `0`. This gate exists because v1.3.23 PROGRESS.md claimed all verification passed when verify.sh actually returned exit 1 — the receipt file makes this claim auditable. + +**TDD Log Closure Gate (mandatory before terminal gate).** Before proceeding to the terminal gate, enumerate all confirmed bug IDs from `quality/BUGS.md` and verify: +1. `quality/results/BUG-NNN.red.log` exists for every confirmed bug. +2. If `quality/patches/BUG-NNN-fix.patch` exists for that bug, `quality/results/BUG-NNN.green.log` also exists. +3. The first line of each log file is one of: `RED`, `GREEN`, `NOT_RUN`, `ERROR`. +If any check fails, stop and generate the missing logs now using the language-aware test execution commands from the TDD execution enforcement section. Do not proceed to the terminal gate with missing TDD logs — a bug with a "TDD verified" verdict in tdd-results.json but no corresponding red-phase log is a contradiction. + +**Terminal gate (mandatory before marking Phase 5 complete):** + +**Prerequisite check:** The terminal gate may run only if Phase 3 (code review) and Phase 4 (spec audit) are both complete, or explicitly marked skipped with rationale in PROGRESS.md. A zero-bug outcome is valid only if code review and spec audit artifacts exist (i.e., `quality/code_reviews/` and `quality/spec_audits/` directories contain report files). If these artifacts are missing and the phases are not explicitly skipped, the terminal gate fails — do not mark Phase 5 complete. + +**BUGS.md is always required.** Every completed run must produce `quality/BUGS.md`, regardless of whether bugs were found. If code review and spec audit confirmed zero source-code bugs, create BUGS.md with a `## Summary` stating "No confirmed source-code bugs found" and listing how many candidates were evaluated and eliminated (e.g., "Code review evaluated N candidates; spec audit evaluated M candidates; all were reclassified as design choices, test-only issues, or false positives"). This provides a positive assertion of a clean outcome rather than ambiguous file absence. A completed run with no BUGS.md is non-conformant. + +**BUGS.md heading format.** Each confirmed bug must use the heading level `### BUG-NNN` (e.g., `### BUG-001` or `### BUG-H1`). Both numeric IDs (`BUG-001`) and severity-prefixed IDs (`BUG-H1`, `BUG-M3`, `BUG-L6`) are valid. This is the canonical heading format — not `## BUG-001`, not `**BUG-001**`, not a bullet point. The `### BUG-NNN` heading is what downstream tools grep for when counting bugs, and what the tdd-results.json `id` field must match. Inconsistent heading levels cause machine-readable counts to disagree with the document. + +Re-read `quality/PROGRESS.md`. Count the BUG tracker entries. Then: + +1. Print the following statement to the user (this is mandatory, not optional): + + > "BUG tracker has N entries. N have regression tests, N have exemptions, N are unresolved. Code review confirmed M bugs. Spec audit confirmed K code bugs (L net-new). Expected total: M + L." + +2. Write the same statement into PROGRESS.md under a new `## Terminal Gate Verification` section (immediately after the BUG tracker table). This persists the gate into the artifact so reviewers can verify it without reading session logs. + +If the tracker entry count does not equal M + L, stop and reconcile — a BUG was orphaned from the tracker. Do not mark Phase 5 complete until the counts match. This gate exists because the v1.3.5 bootstrap showed that agents reliably skip the tracker update after spec audit, orphaning 30-50% of confirmed bugs. + +**Regression test function-name verification:** For each BUG tracker entry that references a regression test, grep for the test function name in the regression test file and confirm it exists. An agent can write a test name in the tracker without actually creating the test. If any referenced test function does not exist, write it now before passing the gate. + +3. Verify the `With docs` metadata field in PROGRESS.md matches reality: if `reference_docs/` exists and contains files, it should say `yes`; otherwise `no`. Fix it if wrong. + +**Artifact file-existence gate (mandatory before marking Phase 5 complete).** Before writing the Phase 5 completion checkbox, verify that every required artifact exists as a file on disk — not just mentioned in PROGRESS.md. Run these checks (use `ls` or equivalent): + +- `quality/BUGS.md` exists (required for all completed runs, per benchmark 34) +- `quality/REQUIREMENTS.md` exists +- `quality/QUALITY.md` exists +- `quality/PROGRESS.md` exists (obviously — you're writing to it) +- `quality/COVERAGE_MATRIX.md` exists +- `quality/COMPLETENESS_REPORT.md` exists +- `quality/formal_docs_manifest.json` exists (v1.5.3 — written by `bin/reference_docs_ingest.py` in Phase 1; empty `records[]` is valid when no formal docs present) +- `quality/requirements_manifest.json` exists (v1.5.3 — authoritative REQ records, rendered to REQUIREMENTS.md) +- `quality/use_cases_manifest.json` exists (v1.5.3 — authoritative UC records, rendered to USE_CASES.md / the REQUIREMENTS.md narrative) +- `quality/citation_semantic_check.json` exists (v1.5.3 — Phase 4 Layer-2 output; empty `reviews[]` is valid for Spec Gap runs) +- If Phase 3 ran: `quality/code_reviews/` contains at least one `.md` file +- If Phase 4 ran: `quality/spec_audits/` contains a triage file AND individual auditor files +- If Phase 0 or 0b ran: `quality/SEED_CHECKS.md` exists as a standalone file (not inlined in PROGRESS.md) +- If confirmed bugs exist: `quality/bugs_manifest.json` exists (v1.5.3 — authoritative BUG records per schemas.md §8) +- If confirmed bugs exist: `quality/results/tdd-results.json` exists +- If confirmed bugs exist: `quality/results/BUG-NNN.red.log` exists for every confirmed bug ID in `quality/BUGS.md` +- If confirmed bugs exist with fix patches: `quality/results/BUG-NNN.green.log` exists for each bug that has a `quality/patches/BUG-NNN-fix.patch` + +For each missing file, create it now. Do not mark Phase 5 complete with missing artifacts — the terminal gate verification in PROGRESS.md is meaningless if the files it references don't exist on disk. This gate exists because v1.3.24 benchmarking showed express completing all phases and writing a terminal gate section in PROGRESS.md, but BUGS.md, SEED_CHECKS.md, and code review/spec audit files were never written to disk. + +**Sidecar JSON post-write validation (mandatory).** After writing `quality/results/tdd-results.json` and/or `quality/results/integration-results.json`, immediately reopen each file and verify it contains all required keys. For `tdd-results.json`, the required root keys are: `schema_version`, `skill_version`, `date`, `project`, `bugs`, `summary`. Each entry in `bugs` must have: `id`, `requirement`, `red_phase`, `green_phase`, `verdict`, `fix_patch_present`, `writeup_path`. The `summary` object must include `confirmed_open` alongside `verified`, `red_failed`, `green_failed`. For `integration-results.json`, the required root keys are: `schema_version`, `skill_version`, `date`, `project`, `recommendation`, `groups`, `summary`, `uc_coverage`. Both files must have `schema_version: "1.1"`. If any key is missing, add it now — do not leave a non-conformant JSON file on disk. This validation exists because v1.3.25 benchmarking showed 6 of 8 repos with non-conformant sidecar JSON: httpx invented an alternate schema, serde used legacy shape, javalin omitted `summary` and per-bug fields, and others used invalid enum values. + +**Script-verified closure gate (mandatory, final step before marking Phase 5 complete).** Locate `quality_gate.py` using the same fallback as reference files — walk these six canonical install layouts in order, taking the first hit: `quality_gate.py`, `.claude/skills/quality-playbook/quality_gate.py`, `.github/skills/quality_gate.py`, `.cursor/skills/quality-playbook/quality_gate.py`, `.continue/skills/quality-playbook/quality_gate.py`, `.github/skills/quality-playbook/quality_gate.py`. Run it from the project root directory. This script mechanically validates: file existence, BUGS.md heading format, sidecar JSON required keys AND per-bug field names (`id`, `requirement`, `red_phase`, `green_phase`, `verdict`, `fix_patch_present`, `writeup_path`) AND enum values AND summary consistency, use case identifiers, terminal gate section, mechanical verification receipts, version stamps, writeup completeness, **regression-test patch presence for every confirmed bug**, and **inline fix diffs in every writeup** (every `quality/writeups/BUG-NNN.md` must contain a ` ```diff ` block). If the script reports any FAIL results, fix each failing check before proceeding — the most common FAILs are: (1) missing `quality/patches/BUG-NNN-regression-test.patch` files, (2) non-canonical JSON field names like `bug_id` instead of `id`, (3) missing `confirmed_open` in the TDD summary, (4) writeups without inline fix diffs (section 6 must include a concrete diff, not just "see patch file"). Do not mark Phase 5 complete until `quality_gate.py` exits 0. Append the script's full output to `quality/results/quality-gate.log`. + +**v1.5.3 Layer-1 mechanical checks (schemas.md §10 invariants #1–#18).** Beyond the legacy gate checks above, `quality_gate.py` in v1.5.3 also enforces the Layer-1 invariants defined in `schemas.md` §10. A compact map of what each invariant covers: + +- **#1–#10 — core contract checks.** Citation tier gating, citation document existence, citation hash match, citation excerpt presence + locatability (section/line only; page never sufficient), bug→REQ resolution, forward-link resolution, disposition completeness, functional section presence, no orphan formal docs, INDEX.md field presence. +- **#11 — citation excerpt byte-equality.** The gate re-runs `bin/citation_verifier.extract_excerpt` per schemas.md §5.4 on every Tier 1/2 citation and rejects any stored `citation_excerpt` that does not byte-equal the freshly-extracted one. This is the Layer-1 anti-hallucination mechanism — it catches fabricated or paraphrased excerpts even when the locator is real. +- **#12 — legal `fix_type × disposition` combination** per schemas.md §3.4. +- **#13 — manifest wrapper validity** per schemas.md §1.6. +- **#14 — REQ tier bound to cited FORMAL_DOC tier** (a Tier 1 REQ cannot cite a Tier 2 FORMAL_DOC). +- **#15 — ID uniqueness** within each manifest. +- **#16 — redundant citation metadata** (`version`, `date`, `url`, `retrieved`) must match FORMAL_DOC when present. +- **#17 — semantic-check majority rule.** ≥2 of 3 `overreaches` verdicts for the same Tier 1/2 REQ fails the gate (see Layer 2 sub-pass in Phase 4). +- **#18 — array value uniqueness** in `REQ.use_cases` and `UC.formal_doc_refs`. + +**`citation_stale` is a gate-report marker, not a field on the citation record.** When the stored `citation.document_sha256` diverges from the live `FORMAL_DOC.document_sha256`, `quality_gate.py` writes a `citation_stale` entry into `quality_gate_report.json` (or equivalent). Do NOT write `citation_stale` onto the citation record itself — the record stays pure input, and the stale marker is gate-report output per schemas.md §5.1 / §10 invariant #3. + +**Do not implement the gate in this prose.** The Layer-1 check list above is a summary of what `quality_gate.py` enforces — the authoritative definitions live in schemas.md. Implementation of the gate (Phase 5 of v1.5.3 implementation) lives in `quality_gate.py`; SKILL.md describes the protocol but does not re-state the invariants. + +**Use case identifier format.** REQUIREMENTS.md must use canonical use case identifiers in the format `UC-01`, `UC-02`, etc. for all derived use cases. Each use case must be labeled with its identifier. This is required for machine-readable traceability — the identifier format enables `quality_gate.py` and downstream tooling to count and cross-reference use cases programmatically. Use cases written as prose paragraphs without identifiers are non-conformant. + +Update PROGRESS.md: mark Phase 5 complete. The BUG tracker should now show closure status for every entry. + +**End-of-phase message (mandatory — print this after Phase 5 completes, then STOP):** + +``` +# Phase 5 Complete — Reconciliation and TDD Verification + +All confirmed bugs now have regression tests, writeups, and TDD red-green verification. +[Summarize: N total confirmed bugs, N with TDD verified status, N with fix patches. +List all bug IDs with one-line summaries and their TDD verdicts.] + +To continue to Phase 6 (Final verification and quality gate), say: + + Run quality playbook phase 6. + +Or say "keep going" to continue automatically. +``` + +**After printing this message, STOP. Do not proceed to Phase 6 unless the user explicitly asks.** + +--- + +## Phase 6: Verify + +**v1.5.6 instrumentation:** Append `phase_start phase=6` now. At phase end, cross-validate (`quality/BUGS.md` non-empty with `^## BUG-` sections AND `quality/INDEX.md` updated with `gate_verdict` field) then append `phase_end phase=6`. After Phase 6 closes, append `run_end status=success` (or `aborted` / `failed` if applicable). + +> **Required references for this phase:** +> - `references/verification.md` — 45 self-check benchmarks + +**Why a verification phase?** AI-generated output can look polished and be subtly wrong. Tests that reference undefined fixtures report 0 failures but 16 errors — and "0 failures" sounds like success. Integration protocols can list field names that don't exist in the actual schemas. The verification phase catches these problems before the user discovers them, which is important because trust in a generated quality playbook is fragile — one wrong field name undermines confidence in everything else. + +**Phase 6 execution model: incremental, not monolithic.** Phase 6 runs as a series of independent verification steps, each reading only the file(s) it needs, checking one thing, and writing its result to `quality/results/phase6-verification.log` before moving to the next step. Do NOT load all artifacts into context at once. Do NOT try to hold the full verification checklist in memory while reading artifacts. Each step below is self-contained — read the file, check the condition, append the result, drop the context. + +### Step 6.1: Mechanical Verification Closure (mandatory first step) + +If `quality/mechanical/` exists, the **literal first action** of Phase 6 is: + +```bash +bash quality/mechanical/verify.sh > quality/results/mechanical-verify.log 2>&1 +echo $? > quality/results/mechanical-verify.exit +``` + +Execute this command in the shell. Do not substitute a Python script, do not read the artifact file and assert on its contents, do not skip this step. The command must be `bash quality/mechanical/verify.sh` — not `python3 -c "..."`, not `cat quality/mechanical/... | grep ...`, not any other equivalent. + +Record the exit code. If non-zero, **Phase 6 fails immediately.** Do not proceed to further steps. Go back to the extraction step: delete the mismatched `*_cases.txt`, re-run the extraction command with a fresh shell redirect, re-verify, and update all downstream artifacts that cited the old extraction. + +Record in PROGRESS.md under `## Phase 6 Mechanical Closure` and append to `quality/results/phase6-verification.log`: +``` +[Step 6.1] Mechanical verification: PASS (exit 0) +``` + +**Why this is non-substitutable:** In v1.3.23, the model replaced `bash verify.sh` with `python3 -c "from pathlib import Path; ..."` that read the (forged) artifact file and asserted on its contents — a circular check that passed despite the artifact being fabricated. The only trustworthy verification is re-running the same shell pipeline that produced the artifact and diffing the results. Any other method can be fooled by a corrupted intermediate file. + +### Step 6.2: Run quality_gate.py (script-verified checks) + +Run the mechanical validation gate: + +```bash +python3 quality_gate.py . > quality/results/quality-gate.log 2>&1 # locate via fallback (six canonical layouts, in order): quality_gate.py, .claude/skills/quality-playbook/quality_gate.py, .github/skills/quality_gate.py, .cursor/skills/quality-playbook/quality_gate.py, .continue/skills/quality-playbook/quality_gate.py, .github/skills/quality-playbook/quality_gate.py +echo $? >> quality/results/phase6-verification.log +``` + +Read `quality/results/quality-gate.log`. If it reports any FAIL results, fix each failing check before proceeding. The most common FAILs are: (1) missing `quality/patches/BUG-NNN-regression-test.patch` files, (2) non-canonical JSON field names like `bug_id` instead of `id`, (3) missing `confirmed_open` in the TDD summary, (4) writeups without inline fix diffs, (5) missing TDD red/green log files. Do not proceed until `quality_gate.py` exits 0. + +Append to `quality/results/phase6-verification.log`: +``` +[Step 6.2] quality_gate.py: PASS (exit 0) — N checks passed, 0 FAIL, 0 WARN +``` + +This step covers verification benchmarks: 14 (sidecar JSON), 17 (test file extension), 18 (use case count), 20 (writeups), 23 (mechanical artifacts), 26 (version stamps), 27 (mechanical directory), 29 (triage-to-BUGS sync), 34 (BUGS.md exists), 38 (individual auditor reports), 39 (BUGS.md heading format), 40 (artifact file existence), 41 (sidecar JSON validation), 42 (script-verified closure), 43 (use case identifiers), 44 (regression-test patches), 45 (writeup inline diffs). + +**v1.5.3 Layer-1 invariants also run here.** `quality_gate.py` additionally enforces schemas.md §10 invariants #1–#18 (summarized in Phase 5 above). In particular, the script re-runs `bin/citation_verifier.extract_excerpt` per schemas.md §5.4 on every Tier 1/2 citation and rejects any stored `citation_excerpt` that does not byte-equal the freshly-extracted output — this is the post-ingest tampering catch. If any Layer-1 invariant fails here, fix the underlying manifest record (not the gate, not the excerpt) and re-run. + +### Step 6.3: Test execution verification + +Run the functional test suite. Read only `quality/test_functional.*` to determine the test command: + +- **Python:** `pytest quality/test_functional.py -v 2>&1 | tail -20` +- **Java:** `mvn test -Dtest=FunctionalTest` or `gradle test --tests FunctionalTest` +- **Go:** `go test -v` targeting the generated test file's package +- **TypeScript:** `npx jest functional.test.ts --verbose` +- **Rust:** `cargo test` +- **Scala:** `sbt "testOnly *FunctionalSpec"` + +Check for both failures AND errors. Errors from missing fixtures, failed imports, or unresolved dependencies count as broken tests. Expected-failure (xfail) regression tests do not count against this check. + +Append to `quality/results/phase6-verification.log`: +``` +[Step 6.3] Functional tests: PASS — N tests, 0 failures, 0 errors +``` + +This covers benchmarks 8 (all tests pass) and 9 (existing tests unbroken). + +### Step 6.4: Verification checklist — file-by-file checks + +Process the remaining verification benchmarks from `references/verification.md` in small batches. For each batch, read only the file(s) needed, check the condition, and append the result. **Do not read more than 2 files per batch.** + +**Batch A — QUALITY.md (benchmarks 1-2, 10):** Read `quality/QUALITY.md`. Count scenarios. Verify each scenario references real code (grep for cited function names). Append results. + +**Batch B — Functional test file (benchmarks 3-7):** Read `quality/test_functional.*`. Check cross-variant coverage (~30%), boundary test count, assertion depth (value checks vs presence checks), layer correctness (outcomes vs mechanisms), mutation validity. + +**Batch C — Protocol files (benchmarks 11-13):** Read `quality/RUN_CODE_REVIEW.md`, then `quality/RUN_INTEGRATION_TESTS.md`, then `quality/RUN_SPEC_AUDIT.md` — one at a time. Check each is self-contained and executable. Verify Field Reference Table in integration tests. + +**Batch D — Regression tests (benchmarks 15-16, 24):** Read `quality/test_regression.*` if it exists. Verify skip guards reference bug IDs, verify patch validation gate commands, verify source-inspection tests don't use `run=False`. + +**Batch E — Enumeration and triage checks (benchmarks 19, 21-22, 25, 36):** Read `quality/code_reviews/*.md` (just the enumeration sections). Read `quality/spec_audits/*triage*` (just the verification probe sections). Check two-list comparisons, executable probe evidence, no circular mechanical artifact references, contradiction gate. + +**Batch F — Continuation mode (benchmarks 32-33):** Only if `quality/SEED_CHECKS.md` exists. Read it, verify mechanical execution, verify convergence section in PROGRESS.md. + +Append each batch result to `quality/results/phase6-verification.log`: +``` +[Step 6.4A] QUALITY.md scenarios: PASS — 8 scenarios, all reference real code +[Step 6.4B] Functional test quality: PASS — 30% cross-variant, assertion depth OK +[Step 6.4C] Protocol files: PASS — all self-contained and executable +[Step 6.4D] Regression tests: PASS — all skip guards present +[Step 6.4E] Enumeration/triage: PASS — two-list checks present, probes have assertions +[Step 6.4F] Continuation mode: SKIP — no SEED_CHECKS.md +``` + +If any batch fails, fix the issue immediately before proceeding to the next batch. + +### Step 6.5: Metadata Consistency Check + +Read `quality/PROGRESS.md` (just the metadata and artifact inventory sections). Then spot-check: +- The requirement count is consistent across REQUIREMENTS.md header, PROGRESS.md artifact inventory, and COVERAGE_MATRIX.md header. All three must state the same number. +- The `With docs` field accurately reflects whether `reference_docs/` exists +- The Terminal Gate Verification section is present and filled in + +Then read `quality/COMPLETENESS_REPORT.md` (just the verdict section). Verify no stale pre-reconciliation text remains — if both a `## Verdict` and an `## Updated verdict` (or `## Post-Review Reconciliation`) section exist, **delete the original `## Verdict` section entirely**. The final document must have exactly one `## Verdict` heading. + +Append to `quality/results/phase6-verification.log`: +``` +[Step 6.5] Metadata consistency: PASS — requirement counts match, version stamps consistent +``` + +If any metadata is stale, fix it now. + +### Checkpoint: Finalize PROGRESS.md + +Re-read `quality/PROGRESS.md`. Update: +- Mark Phase 6 (Verification benchmarks) complete with timestamp +- Verify the BUG tracker has closure for every entry +- Add a final summary line: "Run complete. N BUGs found (N from code review, N from spec audit). N regression tests written. N exemptions granted." +- **Print the suggested next prompt to the user (mandatory, all runs).** This applies to EVERY run, including baseline — it is not iteration-specific. Print the following block so the user can copy-paste it to start the next iteration: + + For a baseline run (no iteration strategy): + ``` + ──────────────────────────────────────────────────────── + Next iteration suggestion: + "Run the next iteration of the quality playbook using the gap strategy." + ──────────────────────────────────────────────────────── + ``` + + For iteration runs, use this mapping to determine the next strategy: + - **gap** → suggest unfiltered + - **unfiltered** → suggest parity + - **parity** → suggest adversarial + - **adversarial** → suggest "Run the quality playbook from scratch." (cycle complete) + +The completed PROGRESS.md is a permanent audit trail. It documents what the skill did, what it found, and how it resolved each finding. Users can read it to understand the run, debug failures, and compare across runs. + +### Convergence Check (continuation mode only) + +> **Scope:** This subsection only. The suggested-next-prompt step above is unconditional and must execute on every run regardless of whether this convergence check is skipped. + +**This step runs only if Phase 0 executed** (i.e., `quality/SEED_CHECKS.md` exists from prior-run analysis). If this is a first run with no prior history, skip to Phase 7. + +Compare this run's bug list against the seed list: + +1. **Count net-new bugs:** bugs in this run's BUGS.md that do NOT match any seed (by file:line). A bug is "net-new" if it was not found in any prior run. +2. **Count seed carryovers:** seeds that were re-confirmed in this run (FAIL result in Step 0b). +3. **Count seed resolutions:** seeds that are now passing (bug was fixed since prior run). + +Write a `## Convergence` section to PROGRESS.md: + +```markdown +## Convergence + +Run number: N (N prior runs in quality/previous_runs/) +Seeds from prior runs: S (S confirmed, R resolved) +Net-new bugs this run: K +Convergence: [CONVERGED | NOT CONVERGED] + +Net-new bugs: +- BUG-NNN: [summary] (file:line) — not in any prior run +``` + +**Convergence criterion:** The run is converged if **net-new bugs = 0** — every bug found in this run was already known from a prior run. This means further runs are unlikely to find additional bugs in the declared scope. + +**If CONVERGED:** Print to the user: "This run found no new bugs beyond the N already known from prior runs. Bug discovery has converged for this scope. Total confirmed bugs across all runs: T." Then proceed to Phase 7. + +**If NOT converged — automatic re-iteration.** When the convergence check shows net-new bugs > 0 and the iteration count has not reached the maximum (default: 5), the skill re-iterates automatically: + +1. Record the iteration number and net-new count in PROGRESS.md. +2. Archive the current `quality/` directory via `bin/run_playbook.archive_previous_run(repo_dir, timestamp)` (or `bin.archive_lib.archive_run()` at Phase 6 success). These snapshot `quality/` into `quality/previous_runs//quality/` and write the per-run `INDEX.md` plus a `RUN_INDEX.md` row. +3. Restart from **Phase 0** (which will now find the newly archived run in `quality/previous_runs/`). +4. Print to the user: "Iteration N found K net-new bugs. Archiving and starting iteration N+1 (max M)." + +The iteration counter starts at 1 for the first run. Each archive-and-restart increments it. When the counter reaches the maximum, stop iterating even if not converged and print: "Reached maximum iterations (M) without convergence. K net-new bugs found in the last run. Total confirmed bugs across all runs: T." + +**Iteration limits.** The default maximum is 5 iterations. If the user's prompt includes an explicit limit (e.g., "run the playbook with 3 iterations"), use that limit instead. If the user's prompt says "single run" or "no iteration," skip re-iteration entirely and treat NOT CONVERGED the same as the pre-iteration behavior: print the net-new count and suggest re-running. + +**Context window awareness.** If at any point during re-iteration you detect that your context window is substantially consumed (e.g., you are producing noticeably shorter or lower-quality output than earlier iterations), stop iterating, write the current state to PROGRESS.md, and print: "Stopping iteration due to context constraints. Completed N of M iterations. Re-run the playbook to continue — Phase 0 will pick up the seed list from quality/previous_runs/." This is a safety valve, not a target — most codebases converge in 2-3 iterations. + +**Why this matters:** A single playbook run explores a subset of the codebase non-deterministically. The first run on virtio might find BUG-001 and BUG-004 but miss BUG-005. The second run might find BUG-005 and BUG-006. By the third run, if no net-new bugs appear, the exploration has likely covered the high-value territory. The seed list ensures previously found bugs are never lost between runs, and the convergence check tells the user when additional runs have diminishing returns. Automatic re-iteration means the skill is self-contained — callers don't need external scripts or manual re-runs to achieve convergence. + +**End-of-phase message (mandatory — print this after Phase 6 completes, then STOP):** + +``` +# Phase 6 Complete — All Phases Done + +The quality playbook baseline run is complete. Here's the summary: + +[Include: total confirmed bugs, quality gate pass/fail/warn counts, +list of all bug IDs with one-line summaries and severities.] + +Key output files: +- quality/BUGS.md — all confirmed bugs with spec basis and patches +- quality/results/tdd-results.json — structured TDD verification results +- quality/patches/ — regression test and fix patches for every bug + +You can now run iteration strategies to find additional bugs. Iterations typically +add 40-60% more confirmed bugs on top of the baseline. The recommended cycle is: +gap → unfiltered → parity → adversarial. + +To run all four iterations automatically, say: + + Run all iterations. + +I'll orchestrate each strategy as a separate sub-agent with its own context window. + +To run one iteration at a time, say: + + Run the next iteration of the quality playbook. + +Or ask me about the results: "Tell me about BUG-001" or "Which bugs are highest priority?" + +After you fix the bugs, say "recheck" to verify the fixes were applied correctly. +``` + +**After printing this message, STOP. Do not proceed to iterations unless the user explicitly asks.** + +**End-of-iteration message (mandatory — print this after each iteration completes, then STOP):** + +``` +# Iteration Complete — [Strategy Name] + +[Summarize: N net-new bugs found in this iteration, total now at N. +List new bug IDs with one-line summaries.] + +[If there are remaining strategies in the recommended cycle, suggest the next one:] +The next recommended strategy is [next strategy]. To run it, say: + + Run the next iteration using the [next strategy] strategy. + +[If all four strategies have been run:] +All four iteration strategies have been run. Total confirmed bugs: N. +You can review the results, ask about specific bugs, or re-run any strategy. + +After you fix the bugs, say "recheck" to verify the fixes were applied correctly. + +Or say "keep going" to run the next iteration automatically. +``` + +**After printing this message, STOP. Do not proceed to the next iteration unless the user explicitly asks.** + +--- + +## Recheck Mode — Verify Bug Fixes + +Recheck mode is a lightweight verification pass that checks whether bugs from a previous run have been fixed. Instead of re-running the full six-phase pipeline (60-90 minutes), recheck reads the existing `quality/BUGS.md`, checks each bug against the current source tree, and reports which bugs are fixed vs. still open. A typical recheck takes 2-10 minutes. + +**When to use recheck mode:** After the user (or another agent) has applied fixes for bugs found by the playbook. The user says "recheck" or "verify the bug fixes" or "check which bugs are fixed." + +**Do not use recheck mode** as a substitute for running the full playbook. Recheck only verifies previously found bugs — it does not find new ones. + +### Recheck procedure + +**Step 1: Read the bug inventory.** + +Read `quality/BUGS.md` and parse every `### BUG-NNN` entry. For each bug, extract: +- Bug ID (e.g., BUG-001) +- File path and line number from the `**File:**` field +- Description summary (first sentence of `**Description:**`) +- Severity +- Fix patch path from `**Fix patch:**` field (e.g., `quality/patches/BUG-001-fix.patch`) +- Regression test path from `**Regression test:**` field + +**Step 2: Check each bug against the current source.** + +For each bug, perform these checks in order: + +1. **Fix patch check.** If a fix patch exists at the referenced path, run `git apply --check --reverse quality/patches/BUG-NNN-fix.patch` against the current tree. If the reverse-apply succeeds (exit 0), the fix patch is already applied — the bug is likely fixed. If it fails, the fix has not been applied or the code has changed. + +2. **Source inspection.** Open the file at the cited line number. Read the surrounding context (±20 lines). Compare what you see against the bug description. Has the problematic code been changed? Does the fix address the root cause described in the bug report? + +3. **Regression test execution.** If a regression test patch exists: + - Apply it: `git apply quality/patches/BUG-NNN-regression-test.patch` + - Run the test (using the project's test runner). If the test PASSES, the bug is fixed. If it FAILS, the bug is still present. + - Reverse the patch: `git apply -R quality/patches/BUG-NNN-regression-test.patch` + + If the regression test patch doesn't apply cleanly (because the source has changed), note this and fall back to source inspection alone. + +4. **Verdict.** Assign one of these statuses: + - **FIXED** — Fix patch is applied AND regression test passes (or source inspection confirms the fix if test can't run) + - **PARTIALLY_FIXED** — The problematic code has changed but the regression test still fails, or the fix addresses some but not all aspects of the bug + - **STILL_OPEN** — The original problematic code is unchanged, or the regression test still fails + - **INCONCLUSIVE** — Can't determine status (file moved, code heavily refactored, patches don't apply) + +**Step 3: Generate recheck results.** + +Write `quality/results/recheck-results.json` with this schema: + +Note: The recheck schema uses `"schema_version": "1.0"` (not `"1.1"`) because it has a different structure from the TDD sidecar — the `source_run` and per-bug `status`/`evidence` fields are unique to the recheck schema. The quality gate validates this value as `"1.0"`. + +```json +{ + "schema_version": "1.0", + "skill_version": "1.5.6", + "date": "YYYY-MM-DD", + "project": "", + "source_run": { + "bugs_md_date": "", + "total_bugs": + }, + "results": [ + { + "id": "BUG-001", + "severity": "HIGH", + "summary": "", + "status": "FIXED", + "evidence": "" + } + ], + "summary": { + "total": , + "fixed": , + "partially_fixed": , + "still_open": , + "inconclusive": + } +} +``` + +Also write a human-readable summary to `quality/results/recheck-summary.md`: + +```markdown +# Recheck Results + +> Recheck of quality/BUGS.md from +> Recheck run: +> Skill version: + +## Summary + +| Status | Count | +|--------|-------| +| Fixed | N | +| Partially fixed | N | +| Still open | N | +| Inconclusive | N | +| **Total** | **N** | + +## Per-Bug Results + +| Bug | Severity | Status | Evidence | +|-----|----------|--------|----------| +| BUG-001 | HIGH | FIXED | Reverse-apply succeeded, regression test passes | +| BUG-002 | MEDIUM | STILL_OPEN | Original code unchanged at quality_gate.py:125 | +| ... | ... | ... | ... | + +## Still Open — Details + +[For each STILL_OPEN or PARTIALLY_FIXED bug, include a brief explanation of what remains to be fixed.] +``` + +**Step 4: Print the recheck summary.** + +Print the summary table to the user, then STOP. Example: + +``` +# Recheck Complete + +Checked 19 bugs from quality/BUGS.md against current source. + +| Status | Count | +|--------|-------| +| Fixed | 17 | +| Still open | 2 | +| **Total** | **19** | + +Fixed bugs: BUG-001, BUG-002, BUG-003, BUG-004, BUG-005, BUG-006, BUG-007, +BUG-008, BUG-009, BUG-010, BUG-011, BUG-013, BUG-014, BUG-015, BUG-016, +BUG-017, BUG-018 + +Still open: BUG-012 (stale .orig file still present), BUG-019 (benchmark 40 +artifact list not updated) + +Results saved to: +- quality/results/recheck-results.json (machine-readable) +- quality/results/recheck-summary.md (human-readable) +``` + +### Triggering recheck mode + +Recheck mode activates when the user says any of: "recheck", "verify the bug fixes", "check which bugs are fixed", "recheck the bugs", "run recheck mode", or similar phrasing that clearly indicates they want to verify fixes rather than find new bugs. When triggered, skip Phases 1-7 entirely and execute only the recheck procedure above. + +--- + +## Phase 7: Present, Explore, Improve (Interactive) + +After generating and verifying, present the results clearly and give the user control over what happens next. This phase has three parts: a scannable summary, drill-down on demand, and a menu of improvement paths. + +**Do not skip this phase.** The autonomous output from Phases 1-6 is a solid starting point, but the user needs to understand what was generated, explore what matters to them, and choose how to improve it. A quality playbook is only useful if the people who own the project trust it and understand it. Dumping six files without explanation creates artifacts nobody reads. + +### Part 1: The Summary Table + +Present a single table the user can scan in 10 seconds: + +``` +Here's what I generated: + +| File | What It Does | Key Metric | Confidence | +|------|-------------|------------|------------| +| REQUIREMENTS.md | Testable requirements with use cases | N requirements, N use cases | ██████░░ Medium — solid baseline from 5-phase pipeline, improves with refinement passes | +| QUALITY.md | Quality constitution | 10 scenarios | ██████░░ High — grounded in code, but scenarios are inferred, not from real incidents | +| Functional tests | Automated tests | 47 passing | ████████ High — all tests pass, 35% cross-variant | +| RUN_CODE_REVIEW.md | Three-pass code review | 3 passes | ████████ High — structural + requirement verification + consistency | +| RUN_INTEGRATION_TESTS.md | Integration test protocol | 9 runs × 3 providers | ██████░░ Medium — quality gates need threshold tuning | +| RUN_SPEC_AUDIT.md | Council of Three audit | 10 scrutiny areas | ████████ High — guardrails included | +| RUN_TDD_TESTS.md | TDD verification protocol | N bugs to verify | ████████ High — mechanical red-green cycle with spec traceability | ``` Adapt the table to what you actually generated — the file names, metrics, and confidence levels will vary by project. The confidence column is the most important: it tells the user where to focus their attention. @@ -368,6 +2601,9 @@ To use these artifacts, start a new AI session and try one of these prompts: • Start a spec audit (Council of Three): "Read quality/RUN_SPEC_AUDIT.md and follow its instructions using [model name]." + +• Run TDD verification for confirmed bugs: + "Read quality/RUN_TDD_TESTS.md and follow its instructions to verify all confirmed bugs." ``` Adapt the test runner command and module names to the actual project. The point is to give the user copy-pasteable prompts — not descriptions of what they could do, but the actual text they'd type. @@ -390,21 +2626,36 @@ The user may go through several drill-downs before they're ready to improve anyt After the user has seen the summary (and optionally drilled into details), present the improvement options: -> "Three ways to make this better:" +> "Five ways to make this better:" +> +> **1. Review requirements interactively** — Read `quality/REVIEW_REQUIREMENTS.md` for a guided walkthrough of the requirements organized by use case. You can pick specific use cases to drill into, or walk through all of them sequentially. A different model can also fact-check the completeness report (cross-model audit). Good for: finding gaps the pipeline missed. > -> **1. Review and harden individual items** — Pick any scenario, test, or protocol section and I'll walk through it with you. Good for: tightening specific quality gates, fixing inferred scenarios, adding missing edge cases. +> **2. Refine requirements with a different model** — Read `quality/REFINE_REQUIREMENTS.md` and run a refinement pass. You can run this with any AI model — Claude, GPT, Gemini — and each will catch different gaps. Run as many models as you want until you hit diminishing returns. Each pass backs up the current version and logs changes in `quality/VERSION_HISTORY.md`. Good for: pushing requirements from the baseline toward completeness. > -> **2. Guided Q&A** — I'll ask you 3-5 targeted questions about things I couldn't infer from the code: incident history, expected distributions, cost tolerance, model preferences. Good for: filling knowledge gaps that make scenarios more authoritative. +> **3. Review and harden other items** — Pick any scenario, test, or protocol section and I'll walk through it with you. Good for: tightening specific quality gates, fixing inferred scenarios, adding missing edge cases. > -> **3. Review development history** — Point me to exported AI chat history (Claude, Gemini, ChatGPT exports, Claude Code transcripts) and I'll mine it for design decisions, incident reports, and quality discussions that should be in QUALITY.md. Good for: grounding scenarios in real project history instead of inference. +> **4. Guided Q&A** — I'll ask you 3-5 targeted questions about things I couldn't infer from the code: incident history, expected distributions, cost tolerance, model preferences. Good for: filling knowledge gaps that make scenarios more authoritative. +> +> **5. Feed in additional documentation** — The requirements pipeline works better with more intent sources. Point me to any of these and I'll use them to refine the requirements and quality constitution: +> - Exported AI chat history (Claude, Gemini, ChatGPT exports, Claude Code transcripts) +> - Slack or Teams channels where the project was discussed +> - Email threads, Jira/Linear tickets, or GitHub issues about the project +> - Design documents, architecture decision records, or meeting notes +> - Newsgroup posts, forum discussions, or mailing list archives +> +> You can use tools like Claude Cowork, GitHub Copilot, or OpenClaw to connect to these sources and gather them into a folder, then point me at the folder. Good for: grounding scenarios and requirements in real project history instead of inference. > > "You can do any combination of these, in any order. Which would you like to start with?" ### Executing Each Improvement Path -**Path 1: Review and harden.** The user picks an item. Walk through it: show the current text, explain your reasoning, ask if it's accurate. Revise based on their feedback. Re-run tests if the functional tests change. +**Path 1: Review requirements interactively.** Point the user to `quality/REVIEW_REQUIREMENTS.md` and offer to walk through it together. The protocol supports self-guided (pick use cases), fully guided (sequential walkthrough), and cross-model audit (different model fact-checks the completeness report). Progress is tracked in `quality/REFINEMENT_HINTS.md` so the user can pick up where they left off. + +**Path 2: Refine requirements with a different model.** Point the user to `quality/REFINE_REQUIREMENTS.md`. Each refinement pass: backs up the current version to `quality/history/vX.Y/`, reads feedback from REFINEMENT_HINTS.md, makes targeted improvements, bumps the minor version, and logs changes in VERSION_HISTORY.md. The user can run this with Claude, GPT, Gemini, or any other model — each catches different blind spots. Run until diminishing returns. + +**Path 3: Review and harden other items.** The user picks a scenario, test, or protocol section. Walk through it: show the current text, explain your reasoning, ask if it's accurate. Revise based on their feedback. Re-run tests if the functional tests change. -**Path 2: Guided Q&A.** Ask 3-5 questions derived from what you actually found during exploration. These categories cover the most common high-leverage gaps: +**Path 4: Guided Q&A.** Ask 3-5 questions derived from what you actually found during exploration. These categories cover the most common high-leverage gaps: - **Incident history for scenarios.** "I found [specific defensive code]. What failure caused this? How many records were affected?" - **Quality gate thresholds.** "I'm checking that [field] contains [values]. What distribution is normal? What signals a problem?" @@ -414,14 +2665,14 @@ After the user has seen the summary (and optionally drilled into details), prese After the user answers, revise the generated files and re-run tests. -**Path 3: Review development history.** If the user provides a chat history folder: +**Path 5: Feed in additional documentation.** The user points you to additional intent sources — chat history, Slack exports, email threads, Jira tickets, design docs, meeting notes, forum archives. These contain design decisions, incident history, and quality discussions that didn't make it into formal documentation. -1. Scan for index files and navigate to quality-relevant conversations (same approach as Step 0, but now with specific targets — you know which scenarios need grounding, which quality gates need thresholds, which design decisions need rationale). -2. Extract: incident stories with specific numbers, design rationale for defensive patterns, quality framework discussions, cross-model audit results. -3. Revise QUALITY.md scenarios with real incident details. Update integration test thresholds with real-world values. Add Council of Three empirical data if audit results exist. -4. Re-run tests after revisions. +1. Scan for index files and navigate to quality-relevant content (same approach as Step 0, but now with specific targets — you know which requirements need grounding, which scenarios need thresholds, which gaps need closing). +2. Extract: incident stories with specific numbers, design rationale for defensive patterns, quality framework discussions, cross-model audit results, and behavioral contracts that weren't visible from the code alone. +3. Feed findings into `quality/REFINEMENT_HINTS.md` as new feedback items, then run a refinement pass to update the requirements. +4. Revise QUALITY.md scenarios with real incident details. Update integration test thresholds with real-world values. Re-run tests after revisions. -If the user already provided chat history in Step 0, you've already mined it — but they may want to point you to specific conversations or ask you to dig deeper into a particular topic. +If the user already provided chat history in Step 0, you've already mined it — but they may want to point you to specific conversations, connect additional sources, or ask you to dig deeper into a particular topic. ### Iteration @@ -461,6 +2712,11 @@ Examine existing test files to understand how they set up test data. Whatever pa 3. Concrete failure modes make standards non-negotiable — abstract requirements invite rationalization 4. Guardrails transform AI review quality (line numbers, read bodies, grep before claiming) 5. Triage before fixing — many "defects" are spec bugs or design decisions +6. Structural review has a ceiling (~65%). The remaining ~35% are intent violations — absence bugs, cross-file contradictions, design gaps — invisible to any tool that only reads code. Requirements make the invisible visible. +7. The specification is the unique contribution, not the review structure. Focus areas and review protocols are secondary to having the right testable requirements derived from intent sources. +8. Cross-requirement consistency checking is essential. Bugs often live in the gap between two individually-correct pieces of code. Per-requirement verification alone can't find these. +9. Keep all derived requirements — do not filter. The cost of checking an extra requirement is low; the cost of missing a bug because you pruned the requirement that would have caught it is high. +10. A failing test is the strongest evidence a bug exists. Run the red-green TDD cycle (test fails on buggy code, passes on fixed code) for every confirmed bug with a fix patch. Show the FAIL→PASS output — reviewers can disagree with your fix but can't argue with a reproducing test. --- @@ -470,10 +2726,13 @@ Read these as you work through each phase: | File | When to Read | Contains | |------|-------------|----------| +| `references/exploration_patterns.md` | Phase 1 (explore) | Pattern applicability matrix, deep-dive templates, domain-knowledge questions | | `references/defensive_patterns.md` | Step 5 (finding skeletons) | Grep patterns, how to convert findings to scenarios | | `references/schema_mapping.md` | Step 5b (schema types) | Field mapping format, mutation validity rules | +| `references/requirements_pipeline.md` | Phase 2 (requirements) | Five-phase pipeline, versioning protocol, carry-forward rules | | `references/constitution.md` | File 1 (QUALITY.md) | Full template with section-by-section guidance | | `references/functional_tests.md` | File 2 (functional tests) | Test structure, anti-patterns, cross-variant strategy | -| `references/review_protocols.md` | Files 3–4 (code review, integration) | Templates for both protocols | +| `references/review_protocols.md` | Files 3–4 (code review, integration) | Templates for both protocols, patch validation, skip guards | | `references/spec_audit.md` | File 5 (Council of Three) | Full audit protocol, triage process, fix execution | -| `references/verification.md` | Phase 3 (verify) | Complete self-check checklist with all 13 benchmarks | +| `references/iteration.md` | Iterations (after Phase 6) | Four iteration strategies: gap, unfiltered, parity, adversarial | +| `references/verification.md` | Phase 6 (verify) | Complete self-check checklist (45 benchmarks) including structured output, patch gate, skip guard validation, pre-flight discovery, version stamps, bug writeups, enumeration completeness, triage executable evidence, code-extracted enumeration lists, mechanical verification artifacts, source-inspection test execution, contradiction gate, seed check execution, convergence tracking, sidecar JSON schema validation, script-verified closure gate, canonical use case identifiers, and writeup inline fix diffs | diff --git a/skills/quality-playbook/agents/calibration_orchestrator.md b/skills/quality-playbook/agents/calibration_orchestrator.md new file mode 100644 index 000000000..e455ecd40 --- /dev/null +++ b/skills/quality-playbook/agents/calibration_orchestrator.md @@ -0,0 +1,222 @@ +# Calibration Orchestrator — autonomous cycle prompt template (v1.5.6) + +*Prompt template for the AI session driving an end-to-end QPB calibration cycle. The orchestrator AI executes Steps 1-12 from `ai_context/CALIBRATION_PROTOCOL.md`, spawns playbook subprocesses per benchmark, and writes the cycle audit + Lever Calibration Log entry. Designed for Claude Code sessions but will work in any tool with bash + file tools.* + +*This prompt builds on `ai_context/CALIBRATION_PROTOCOL.md` Mode 1 (autonomous). The protocol is the canonical operational guide; this template wires it into v1.5.6's run-state instrumentation so the cycle is fully observable, resumable, and recoverable.* + +*Schema for cycle-level events: `references/run_state_schema.md`.* + +*Session model — **spawn-and-resume across multiple orchestrator sessions** (v1.5.6 cluster F.1 finding from the 2026-05-02 Pattern 7 cycle). The orchestrator role spans many discrete AI sessions that re-attach to the same cycle directory and resume from `run_state.jsonl`; each session typically drives one cycle step (kick off a benchmark, finalize a benchmark on completion, apply the lever, run Council, etc.) and exits. A long-lived single-session orchestrator was attempted in early prototyping and did not survive realistic AI session lifetimes (timeouts, network drops, operator-ended sessions across the ~4 hours an 8-benchmark cycle takes). The Step 2 spawn pattern below — `nohup` the playbook in the background, append a `benchmark_start` event with the PID, return control — IS the load-bearing recovery mechanism, not an exception case.* + +*Compare with `ai_context/AI_ORCHESTRATION_PATTERNS.md`. That document describes a **multi-session orchestrator/worker** pattern where a chat-driving AI controls a separate coding AI via files in a shared directory. This template applies the same multi-session discipline at a different layer: the orchestrator AI sessions (any number across the cycle's lifetime) coordinate the playbook subprocess lifecycle, while the playbook itself is the worker. Use this template when the work to coordinate is a calibration cycle (a fixed Steps 1-12 workflow); use the broader orchestrator/worker pattern when chat-side planning and coding-side execution need to be coordinated outside a calibration cycle.* + +--- + +## Role + +You are the **calibration orchestrator** for a Quality Playbook calibration cycle. Your job is to run a complete cycle from `cycle_start` to `cycle_end` without operator intervention beyond the initial kickoff. + +You are NOT the playbook AI. You spawn playbook AI sessions (via `python3 -m bin.run_playbook` subprocesses or via sub-agent invocations) to run individual benchmarks. You drive the cycle-level workflow above the playbook. + +--- + +## Inputs (operator provides at kickoff) + +The operator launches you with these inputs filled in: + +- **``** — short kebab-case identifier. Format: `-`. Example: `2026-05-15-pattern7-displacement-recovery`. +- **``** — the lever from `ai_context/IMPROVEMENT_LOOP.md` you're calibrating. Example: `lever-1-exploration-breadth-depth`. +- **``** — what you'll actually edit. Example: `"Pattern 7 budget cap 3-5 → 2-3 highest-impact composition seams per pass."` +- **``** — comma-separated benchmark list. Example: `chi-1.3.45,chi-1.5.1,virtio-1.5.1,express-1.3.50`. +- **``** — the testable claim. Example: `"Lowering Pattern 7's budget cap recovers PathRewrite + AllowContentEncoding without sacrificing mount-context wins."` +- **``** — iteration ordinal (1 for first attempt, 2 if re-running with a different sub-lever after a previous attempt's `iterate` verdict). Default: 1. +- **``** — maximum iterations before halt. Default: 3. + +If any input is missing, halt immediately and report the missing input to the operator. + +--- + +## Cycle directory layout + +Working directory: `~/Documents/AI-Driven Development/Quality Playbook/Calibration Cycles//` + +Files you produce: +- `run_state.jsonl` — cycle-level event log (your own append-only output). Schema: `references/run_state_schema.md` "Cycle-level events" section. +- `audit.md` — human-readable cycle audit. Written at cycle close. +- `post-pattern7-snapshots/` (or analogous lever-specific subdir) — copies of post-lever BUGS.md per benchmark, in case canonical paths get overwritten. +- `visualizations/` — populated by `bin/visualize_calibration.py` (available in current releases; may not exist yet during early cycles). + +Files you write to elsewhere: +- `metrics/regression_replay//--all.json` — per-benchmark cell.json (one per pre/post pair). +- `docs/process/Lever_Calibration_Log.md` — append a new cycle entry at cycle close. + +--- + +## Resume semantics + +Before doing anything else, check whether `Calibration Cycles//run_state.jsonl` exists. + +- **No file:** fresh cycle. Proceed to Step 0 below. +- **File exists:** read all events. Find the last event. Pick up where the prior session stopped: + - If last event is `cycle_start`: redo Step 1 (pre-flight) since the prior session crashed before any benchmark work. + - If last event is `benchmark_start ` without matching `benchmark_end`: that benchmark was in flight when the prior session crashed. Check whether `repos/archive//quality/run_state.jsonl` shows a `run_end` event. If yes: parse the BUGS.md, append `benchmark_end`, continue to next benchmark. If no: the playbook session also crashed; restart that benchmark (clean its `quality/`, re-spawn the playbook). + - If last event is `lever_change_applied`: pre-lever benchmarks complete, lever change committed, post-lever runs are next. + - If last event is `benchmark_end ` (last bench in the list): all benchmarks done; proceed to delta computation + cycle close. + +Trust artifacts (BUGS.md content, commit history) more than events. If events claim a benchmark complete but BUGS.md is empty, re-run. + +--- + +## Steps + +### Step 0: Initialize cycle run-state + +If fresh cycle: + +1. Create `Calibration Cycles//` directory if absent. +2. Write `run_state.jsonl` with two events: + - `_index`: `{"event":"_index","ts":"","schema_version":"1.5.6","event_types":["_index","cycle_start","benchmark_start","benchmark_end","lever_change_applied","lever_change_reverted","cycle_end"],"cycle_name":"","lever_under_test":"","benchmarks":[],"iteration":}` + - `cycle_start`: `{"event":"cycle_start","ts":"","hypothesis":"","noise_floor_threshold":0.05}` + +### Step 1: Pre-flight + +Verify environment per `CALIBRATION_PROTOCOL.md` Step 1 checks: + +- `git status --porcelain` clean (or only contains expected scratch files; document any). +- Current branch is `1.5.6` (or whichever development branch you're on); record the HEAD SHA. +- `bin/run_playbook.py --help` runs cleanly. +- `claude --version` (or whichever runner you're using) reports a usable version. +- For each benchmark in ``: verify `repos/archive//` exists; verify `repos/archive//quality/previous_runs//quality/BUGS.md` exists (this is the historical baseline used for recall computation). + +If any pre-flight check fails: append an `error` event with `recoverable:false`, write `cycle_end verdict=halt-preflight-failed`, write a partial audit, and report. + +### Step 2: Pre-lever benchmark runs + +For each benchmark in ``: + +1. Append `benchmark_start`: `{"event":"benchmark_start","ts":"","benchmark":"","lever_state":"pre-lever"}`. +2. Verify or restore the canonical pre-lever state of the QPB working tree (the lever change must NOT yet be applied at this point). +3. Reset the benchmark's `quality/` to a known-empty state: `cp -r repos/archive//quality/previous_runs// /tmp/save-/ && rm -rf repos/archive//quality/* && cp -r /tmp/save-/quality/* repos/archive//quality/previous_runs/` (or equivalent — the goal is a fresh `quality/` tree with prior_runs preserved). +4. Spawn the playbook. The realistic mechanism for AI-session-driven cycles is **spawn + resume on re-invocation**: + - Launch the playbook in the background with output redirected to a log file: `nohup python3 -m bin.run_playbook --claude --phase 1,2,3 repos/archive/ > -playbook.log 2>&1 &`. Capture the PID. + - Append a `benchmark_start` event with the PID and log path so a resumed orchestrator can find them. + - Return control to the operator (or to the calling shell). The orchestrator session ends; the playbook continues running. + - The operator (or a watchdog) re-invokes the orchestrator periodically (e.g., every 30-60 minutes). On each re-invocation, the orchestrator reads its cycle's `run_state.jsonl`, finds the in-flight benchmark, and checks `repos/archive//quality/run_state.jsonl` for `run_end`. If complete: parse BUGS.md, compute recall, append `benchmark_end`, advance to next benchmark (or next cycle step). If incomplete and the playbook PID is still alive: re-launch the orchestrator later. If incomplete and the PID is dead: the playbook crashed; clean and re-spawn. + - **Why not synchronous block:** AI sessions (Claude Code, Cowork sub-agents) don't reliably block for 30-minute subprocess durations across 8 benchmarks (~4 hours total). The session would time out, drop network, or be ended by the operator. Spawn + resume is the only pattern that survives realistic session lifetimes. + - **Watchdog timeout:** if a benchmark's playbook hasn't produced a `run_end` event after 90 minutes wall-clock, treat it as hung. Kill the PID, clean the benchmark's `quality/`, append `error recoverable:true`, and re-spawn. After 3 hung-and-restart cycles on the same benchmark, halt with `cycle_end verdict:"halt-playbook-hang"`. +5. When the playbook reports complete: read `repos/archive//quality/BUGS.md`. Compute recall: count of bug IDs in the new BUGS.md that match (by file:line or canonical bug name) any bug ID in `repos/archive//quality/previous_runs//quality/BUGS.md`. Recall = `|found ∩ baseline| / |baseline|`. +6. Append `benchmark_end`: `{"event":"benchmark_end","ts":"","benchmark":"","lever_state":"pre-lever","recall":,"bugs_found":[...],"bugs_missed":[...],"historical_baseline_path":""}`. + +### Step 3: Apply lever change + +1. Edit the file(s) per ``. Example for the Pattern 7 displacement recovery cycle: edit `references/exploration_patterns.md` Pattern 7 budget-cap line. +2. Commit to the working branch (1.5.6 or current development branch): `git add && git commit -m "v1.5.6 lever pull (): \n\nCycle: \nIteration: \nHypothesis: "`. +3. Capture the commit SHA. +4. Append `lever_change_applied`: `{"event":"lever_change_applied","ts":"","lever_id":"","files_changed":[],"commit_sha":"","description":""}`. + +### Step 4: Post-lever benchmark runs + +Repeat Step 2's loop with `lever_state:"post-lever"` for each benchmark. Same playbook invocation, same recall computation, same `benchmark_end` event but with `lever_state:"post-lever"`. + +After each `benchmark_end`, copy the post-lever BUGS.md aside into `Calibration Cycles//post-lever-snapshots/.md` so it survives any subsequent cleanup. + +### Step 5: Compute deltas + cross-benchmark check + +1. From the events log, compute per-benchmark `delta = recall_after - recall_before`. +2. Check the cross-benchmark invariant: NO benchmark should regress beyond `noise_floor_threshold` (0.05). If `delta < -0.05` on any benchmark, the lever pull caused a regression there — this is a Block condition. +3. Build the cell.json output: write to `metrics/regression_replay//-all.json` per the cell.json schema. Include `lever_under_test`, `benchmarks`, `recall_before`, `recall_after`, `delta`, `regression_check.status` (clean/regression), `noise_floor_threshold:0.05`. + +### Step 6: Council review (Mode 1: sub-agent fan-out, three lenses) + +Per `CALIBRATION_PROTOCOL.md` Step 7. Spawn three parallel sub-agents using your tool's parallel-agent mechanism (Cowork's Agent tool with `general-purpose` subagent_type, parallel `claude` CLI invocations from bash, etc.). **Three flat lenses, not nested 9-perspective** — Mode 1's autonomous Council is intentionally lighter than the operator-driven nested Council in `CALIBRATION_PROTOCOL.md`'s Mode 2. The full 9-perspective nested panel requires `gh copilot` invocations the orchestrator can't run. + +Each of the three sub-agents gets: + +- The cycle's hypothesis, lever change diff, pre/post recall numbers per benchmark, regression check status. +- A focused review lens, one per sub-agent: + - **Sub-agent 1 (Diagnosis lens):** "Is the lever change well-targeted at the diagnosed symptom?" Reads the cycle's hypothesis and the lever-change diff. Verdict: targets the symptom / doesn't / partial. + - **Sub-agent 2 (Scope lens):** "Are the recall numbers honest given run conditions?" Reads the per-benchmark `benchmark_end` events and the underlying BUGS.md files. Verdict: numbers reflect reality / numbers may be artifact of run conditions / inconclusive. + - **Sub-agent 3 (Regression-risk lens):** "Does any benchmark regress beyond the noise floor? Are wins on one benchmark coming at the cost of losses elsewhere?" Verdict: clean / regression-detected / partial-recovery. + +Synthesize into a Council verdict: Ship (all three positive or two-of-three positive with no Block), Block (any sub-agent issues a Block, or two-of-three negative), Iterate (Council surfaces a clearly-better sub-lever). Document each sub-agent's verdict in the cycle audit. + +### Step 7: Decide verdict + +Based on Council outcome + measurement results: + +- **Ship:** Council Ship + delta > noise floor + cross-benchmark check clean. Lever change stays committed; cycle closes with `verdict:"ship"`. +- **Revert:** Council Block + delta ≤ noise floor OR cross-benchmark regression. Revert the lever change with a NEW commit: `git revert `. Do NOT use `git reset --hard` — that destroys history on shared branches and will break any in-flight work or downstream clones (the safety hole the workspace verify-before-claiming rule is built to catch). The revert commit becomes part of the cycle's audit trail. Cycle closes with `verdict:"revert"`. +- **Iterate:** Council suggests a different sub-lever, or measurement results are ambiguous. If ` < `: relaunch yourself with ` + 1` and a new sub-lever description. If ` >= `: halt with `verdict:"halt-iterate-cap"` — you've exhausted iterations without convergence. + +### Step 8: Write cycle audit + +At `Calibration Cycles//audit.md`. Sections: + +- Header (cycle name, dates, lever, benchmarks, hypothesis, iteration, verdict). +- Pre-flight summary. +- Pre-lever results (per-benchmark recall, BUGS.md summary). +- Lever change applied (commit SHA, files changed, diff stats). +- Post-lever results (per-benchmark recall, deltas, regression check). +- Council synthesis. +- Verdict + rationale. +- Reduced-scope acknowledgment (if any benchmark was dropped from the original cycle scope — name the benchmark, the reason, and the follow-up cycle that will close it. Required when the actual benchmark list is shorter than `` from the cycle inputs. v1.5.6 finding: 2026-05-02 cycle dropped chi-1.5.1 for time budget; the audit explicitly documented the reduced scope and pointed at a follow-up cycle.). +- Cycle Findings (anything notable that surfaced — protocol gaps, runtime quirks, follow-on work). **Required even if empty — write `(none)` rather than omitting the section.** v1.5.6 finding: the 2026-05-02 cycle audit did not include this section despite the protocol calling for it; future cycles must include it explicitly so the file's structure is grep-able. + +Use the Cycle 1 (chi-1.3.45) audit at `Calibration Cycles/2026-05-01-chi-1.3.45/audit.md` as the template format. + +### Step 9: Append Lever Calibration Log entry + +At `~/Documents/QPB/docs/process/Lever_Calibration_Log.md`. Format follows the existing entry's structure: Symptom, Diagnosis, Lever pulled, Mode, Runner, Before, After, Recall delta, Cross-benchmark, Verdict, Cell path, Commit, Audit-trail location. + +### Step 10: Generate visualizations (if `bin/visualize_calibration.py` exists) + +Run `python3 -m bin.visualize_calibration `. Produces 4 PNGs into `Calibration Cycles//visualizations/`. If the script is unavailable in the checkout you're using, skip with a note in the audit. + +### Step 11: Write `cycle_end` event + +Append to `Calibration Cycles//run_state.jsonl`: + +```json +{"event":"cycle_end","ts":"","verdict":"","recall_before":{:,...},"recall_after":{:,...},"delta":{:,...},"cross_benchmark_check":{"clean":,"regressions":[...]}} +``` + +### Step 12: Final report to operator + +Print a summary block to stdout: + +- Cycle name, iteration, verdict. +- Per-benchmark before/after/delta recall in a tabular form. +- Council synthesis one-liner. +- Path to audit.md, cell.json, calibration log entry, visualizations. +- Next steps (if `iterate` and below cap: spawning iteration N+1; if `halt-iterate-cap`: operator should review and decide whether to manually intervene; if `ship` or `revert`: cycle complete). + +--- + +## Failure modes and recovery + +- **Playbook subprocess crashes mid-run:** the per-benchmark `quality/run_state.jsonl` will show no `run_end`. Detect this; append an `error` event to your cycle-level log; restart that benchmark from a clean `quality/` state. +- **Council sub-agents fail to return:** retry once. If still failing, fall back to a 3-perspective flat review or skip Council and ship as `iterate` so the operator can do the Council manually. +- **Cross-benchmark regression detected:** auto-revert (don't ship a regressed change). Document the regression in the audit. +- **Iterate cap reached:** halt with `verdict:"halt-iterate-cap"`. Don't keep trying — surface to operator that the lever space hasn't yielded a fix in `` attempts. +- **Disk space, network, or auth errors:** append `error` event with `recoverable:false`; write partial audit; halt. +- **You realize mid-cycle that a step assumption is wrong (e.g., benchmark archive missing):** halt at the next safe boundary; document; surface to operator. +- **Orchestrator-side API budget exhausted mid-cycle (v1.5.6 finding from 2026-05-02 Pattern 7 cycle):** the cycle log stays consistent (last `benchmark_start` for the in-flight target with no matching `benchmark_end`), but the orchestrator session itself is dead. **Recovery:** spawn a fresh orchestrator session — same cycle directory, same `` — possibly on a different LLM backend (the file-based protocol is backend-agnostic; see `ai_context/AI_ORCHESTRATION_PATTERNS.md` §9.5). The new session reads `run_state.jsonl`, finds the in-flight benchmark, checks its `quality/run_state.jsonl` for `run_end`, and either (a) finalizes that benchmark (compute recall, append `benchmark_end`) if the playbook completed during the orchestrator outage, or (b) treats the benchmark as needing a clean re-spawn. **Reduced-scope option:** if budget pressure makes completing the original benchmark list infeasible, the cycle MAY drop a benchmark and ship a reduced-scope verdict — but the dropped benchmark MUST be (i) named explicitly in audit.md's "Reduced-scope acknowledgment" section, (ii) flagged for a follow-up single-benchmark cycle in the next release window, and (iii) chosen so the cycle's load-bearing benchmark (the one most directly tied to the hypothesis) is NOT the one dropped. The 2026-05-02 cycle exemplified this — chi-1.5.1 was dropped on time-budget grounds, and the displacement-recovery story was concentrated on chi-1.3.45 (which was completed); chi-1.5.1 is closed by a follow-up single-benchmark cycle in the next release window. +- **Express-style mid-benchmark interruption (post-lever drop):** if a benchmark's pre-lever cell completed but the post-lever run was interrupted before producing a replayable cell snapshot (e.g., the express-1.3.50 case in 2026-05-02), audit.md MUST acknowledge it as `n/a` for that benchmark's delta — do NOT extrapolate from the pre-lever data alone. A follow-up post-lever-only run (with the lever applied to recreate the post-lever state) closes the gap. + +--- + +## Discipline reminders + +- **Trust artifacts more than events.** If your event log says a benchmark completed but the BUGS.md is empty, re-run that benchmark. +- **Calibrated reporting.** Don't claim recall numbers without computing them from actual BUGS.md files. Don't claim a Ship verdict without an actual Council synthesis. +- **No wall-clock estimates.** When reporting time-to-completion, use phase counts (`3 benchmarks remaining`) not durations. +- **Verify before claiming.** Before saying "lever change committed," confirm the commit SHA via `git log`. Before saying "audit written," confirm the file exists and is non-empty. +- **No per-phase briefs.** This template is the brief. Don't produce intermediate planning docs for individual benchmarks. + +--- + +## Out of scope for this orchestrator + +- Designing the lever change. The operator provides ``; you apply it, you don't invent it. +- Modifying the playbook prose (SKILL.md, references/exploration_patterns.md beyond the documented lever change). If the cycle reveals a non-lever defect (e.g., the runner-side "Phase 1 archived as complete with 0-line EXPLORATION.md" finding), document it in the audit's "Cycle Findings" section but don't auto-fix it; that's a separate cycle or a v1.5.7 cleanup item. +- Promoting a Ship verdict to a release tag. The cycle's commit ships the lever change; the release happens separately when v1.5.6 (or whichever version) is ready to ship. diff --git a/skills/quality-playbook/agents/quality-playbook-claude.agent.md b/skills/quality-playbook/agents/quality-playbook-claude.agent.md new file mode 100644 index 000000000..0b004f350 --- /dev/null +++ b/skills/quality-playbook/agents/quality-playbook-claude.agent.md @@ -0,0 +1,117 @@ +--- +name: quality-playbook +description: "Run a complete quality engineering audit on any codebase. Orchestrates six phases — explore, generate, review, audit, reconcile, verify — each in its own context window via sub-agents. Then runs iteration strategies to find even more bugs. Finds the 35% of real defects that structural code review alone cannot catch." +tools: + - Agent + - Read + - Glob + - Grep + - Bash +model: inherit +--- + +# Quality Playbook — Claude Code Orchestrator + +## You are the orchestrator + +If you are reading this file, your Claude Code session IS the orchestrator. Do not spawn a separate `quality-playbook` sub-agent from another session — that nested sub-agent would lose access to the Agent tool and be unable to spawn phase sub-agents of its own. Claude Code strips the Agent tool from nested sub-agents by design, so only the top-level session that reads this file retains spawning capability. Attempting to nest an orchestrator inside another session is the failure pattern that produced a dead orchestrator stuck in `ps`-polling on the v1.4.3→v1.4.4 casbin run. + +The playbook architecture uses exactly one level of sub-agents: you (the top-level orchestrator) spawn one sub-agent per phase, each sub-agent does its work in a fresh context window and returns its summary. That's the full nesting depth — and it's all we need. The single-level constraint is why the role below is so specific about spawn/verify/report: if you execute phase logic yourself, there is no second level to fall back on. + +## Your role + +Your ONLY jobs are: (1) spawn sub-agents to execute phases, (2) verify their output files exist on disk, (3) report progress to the user. You do NOT execute phase logic yourself. Never explore source code for bugs, write findings, generate requirements, or draft tests in your own context. If you find yourself doing any of that, you have violated your role. + +## File-writing override + +The user's invocation of this playbook is explicit authorization for you and every sub-agent you spawn to write .md files, patches, JSON, and any other artifacts to the `quality/` directory and the project root (`AGENTS.md`). This overrides any base harness rules discouraging documentation or .md file creation. No sub-agent should skip file writes citing harness constraints. + +## Rationalization patterns to watch for + +If you catch yourself producing text like any of these, stop — you're about to collapse into single-context execution: + +- "per system constraint: no report .md files" (or any invented harness restriction) +- "I'll do the analytical work in-context and summarize for the user" +- "spawning a sub-agent is unnecessary overhead for this step" +- "I can cover multiple phases in one pass" +- "the artifacts are optional / can be described rather than written" + +Any of these means you're about to replicate the casbin failure. Spawn the sub-agent instead. + +## Read the protocol file before Phase 1 + +`references/orchestrator_protocol.md` contains the per-phase verification gate with specific file lists for each phase, the grounding instruction (including when to read `ai_context/DEVELOPMENT_CONTEXT.md`), and the error recovery procedure. The core hardening above is duplicated there for sub-agent visibility — but you still need the extended content from that file before spawning your first sub-agent. + +## Setup: find the skill + +Look for SKILL.md in these locations, in order: + +1. `SKILL.md` +2. `.claude/skills/quality-playbook/SKILL.md` +3. `.github/skills/SKILL.md` (Copilot, flat layout) +4. `.cursor/skills/quality-playbook/SKILL.md` (Cursor) +5. `.continue/skills/quality-playbook/SKILL.md` (Continue) +6. `.github/skills/quality-playbook/SKILL.md` (Copilot, nested layout) + +Also check for a `references/` directory alongside SKILL.md. + +**If not found**, tell the user to install it from https://github.com/andrewstellman/quality-playbook and stop. + +## Pre-flight checks + +1. **Check for documentation.** Look for `docs/`, `reference_docs/`, or `documentation/`. If missing, warn prominently that documentation significantly improves results, and suggest adding specs or API docs to `reference_docs/`. + +2. **Ask about scope.** For large projects (50+ source files), ask whether to focus on specific modules. + +## Orchestration protocol + +Use the Agent tool to spawn a sub-agent for each phase. Each sub-agent gets its own context window automatically. Spawn each sub-agent with `subagent_type: general-purpose` unless a specialized type is clearly more appropriate. + +**Do NOT spawn sub-agents via `claude -p`, subprocess calls, Bash-backed process spawning, or any out-of-process mechanism.** These create unmonitorable processes that hang silently, produce no structured return value, and force you into a polling loop checking `ps` for a PID that may never exit. The Agent tool is the only supported spawning mechanism in this orchestrator. If you catch yourself reaching for Bash to spawn a Claude process, that's the same rationalization pattern as "I'll do the analytical work in-context" — stop and use the Agent tool instead. + +The sub-agent — not you — does all the phase work. Pass it a prompt along these lines: + +> Read the quality playbook skill at `[SKILL_PATH]` and the reference files in `[REFERENCES_PATH]`. Read `quality/PROGRESS.md` for context from prior phases. Execute Phase N following the skill's instructions exactly. Write all artifacts to the `quality/` directory. Update `quality/PROGRESS.md` with the phase checkpoint when done. + +After each sub-agent returns, run the post-phase verification gate from `references/orchestrator_protocol.md` BEFORE reporting the phase as complete. + +## Two modes + +### Mode 1: Phase by phase (default) + +Spawn Phase 1 as a sub-agent. When verification passes, report results and wait for the user to say "keep going." + +### Mode 2: Full orchestrated run + +When the user says "run the full playbook" or "run all phases," spawn all six phases sequentially as sub-agents. Verify after each phase. Report a brief summary between phases. Every phase is still its own sub-agent — the full run is six spawns, not one. + +## Iteration strategies + +After Phase 6, ask if the user wants iterations. Read `references/iteration.md` for details. Four strategies in recommended order: + +1. **gap** — Explore areas the baseline missed +2. **unfiltered** — Fresh-eyes re-review without structural constraints +3. **parity** — Compare parallel code paths +4. **adversarial** — Challenge prior dismissals, recover Type II errors + +Each iteration runs Phases 1-6 as sub-agents, same as the baseline. Iterations typically add 40-60% more confirmed bugs. + +"Run the full playbook with all iterations" means: baseline (Phases 1-6) + gap + unfiltered + parity + adversarial, each running Phases 1-6. Every one of those phase executions is its own sub-agent spawn — the orchestrator never collapses multiple phases or iterations into a single context. + +## The six phases + +1. **Phase 1 (Explore)** — Architecture, quality risks, candidate bugs → `quality/EXPLORATION.md` +2. **Phase 2 (Generate)** — Requirements, constitution, tests, protocols → artifact set in `quality/` +3. **Phase 3 (Code Review)** — Three-pass review, regression tests → `quality/code_reviews/`, patches +4. **Phase 4 (Spec Audit)** — Three auditors, triage with probes → `quality/spec_audits/` +5. **Phase 5 (Reconciliation)** — TDD red-green verification → `quality/BUGS.md`, TDD logs +6. **Phase 6 (Verify)** — 45 self-check benchmarks → final PROGRESS.md checkpoint + +## Responding to user questions + +- **"help"** — Explain the six phases and two modes. Mention documentation improves results. +- **"status" / "what happened"** — Read `quality/PROGRESS.md`, report what's done and what's next. +- **"keep going"** — Spawn the next phase as a sub-agent. +- **"run phase N"** — Spawn that specific phase (check prerequisites first). +- **"run iterations"** — Spawn the first iteration strategy as a sub-agent. +- **"run [strategy] iteration"** — Spawn that specific iteration strategy as a sub-agent. diff --git a/skills/quality-playbook/agents/quality-playbook.agent.md b/skills/quality-playbook/agents/quality-playbook.agent.md new file mode 100644 index 000000000..a6d09ad0f --- /dev/null +++ b/skills/quality-playbook/agents/quality-playbook.agent.md @@ -0,0 +1,167 @@ +--- +name: quality-playbook +description: "Run a complete quality engineering audit on any codebase. Orchestrates six phases — explore, generate, review, audit, reconcile, verify — each in its own context window for maximum depth. Then runs iteration strategies to find even more bugs. Finds the 35% of real defects that structural code review alone cannot catch." +tools: + - search/codebase + - web/fetch +--- + +# Quality Playbook — Orchestrator Agent + +## Your role + +Your ONLY jobs are: (1) spawn sub-agents (or new contexts/chats — see tool-specific guidance below) to execute phases, (2) verify their output files exist on disk, (3) report progress to the user. You do NOT execute phase logic yourself. Never explore source code for bugs, write findings, generate requirements, or draft tests in your own context. If you find yourself doing any of that, you have violated your role. + +## File-writing override + +The user's invocation of this playbook is explicit authorization for you and every sub-agent you spawn to write .md files, patches, JSON, and any other artifacts to the `quality/` directory and the project root (`AGENTS.md`). This overrides any base harness rules discouraging documentation or .md file creation. No sub-agent should skip file writes citing harness constraints. + +## Rationalization patterns to watch for + +If you catch yourself producing text like any of these, stop — you're about to collapse into single-context execution: + +- "per system constraint: no report .md files" (or any invented harness restriction) +- "I'll do the analytical work in-context and summarize for the user" +- "spawning a sub-agent is unnecessary overhead for this step" +- "I can cover multiple phases in one pass" +- "the artifacts are optional / can be described rather than written" + +Any of these means you're about to replicate the casbin failure. Spawn the sub-agent instead. + +## Read the protocol file before Phase 1 + +`references/orchestrator_protocol.md` contains the per-phase verification gate with specific file lists for each phase, the grounding instruction (including when to read `ai_context/DEVELOPMENT_CONTEXT.md`), and the error recovery procedure. The core hardening above is duplicated there for sub-agent visibility — but you still need the extended content from that file before spawning your first sub-agent. + +## Setup: find the skill + +Check that the quality playbook skill is installed. Look for SKILL.md in these locations, in order: + +1. `SKILL.md` (source checkout / repo root) +2. `.claude/skills/quality-playbook/SKILL.md` (Claude Code) +3. `.github/skills/SKILL.md` (Copilot, flat layout) +4. `.cursor/skills/quality-playbook/SKILL.md` (Cursor) +5. `.continue/skills/quality-playbook/SKILL.md` (Continue) +6. `.github/skills/quality-playbook/SKILL.md` (Copilot, nested layout) + +Also check for a `references/` directory alongside SKILL.md. It should contain .md files (the full set includes iteration.md, review_protocols.md, spec_audit.md, verification.md, requirements_pipeline.md, exploration_patterns.md, defensive_patterns.md, schema_mapping.md, constitution.md, functional_tests.md, orchestrator_protocol.md, and others). Verify the directory exists and has at least 6 .md files. + +**If the skill is not installed**, tell the user: + +> The quality playbook skill isn't installed in this repository yet. Install it from the [quality-playbook repository](https://github.com/andrewstellman/quality-playbook): +> +> ```bash +> # For Copilot +> mkdir -p .github/skills/references .github/skills/phase_prompts +> cp SKILL.md .github/skills/SKILL.md +> cp .github/skills/quality_gate/quality_gate.py .github/skills/quality_gate.py +> cp references/* .github/skills/references/ +> cp phase_prompts/*.md .github/skills/phase_prompts/ +> +> # For Claude Code +> mkdir -p .claude/skills/quality-playbook/references .claude/skills/quality-playbook/phase_prompts +> cp SKILL.md .claude/skills/quality-playbook/SKILL.md +> cp .github/skills/quality_gate/quality_gate.py .claude/skills/quality-playbook/quality_gate.py +> cp references/* .claude/skills/quality-playbook/references/ +> cp phase_prompts/*.md .claude/skills/quality-playbook/phase_prompts/ +> +> # v1.5.2: single reference_docs/ tree at the target repo root. +> mkdir -p reference_docs reference_docs/cite +> ``` + +Then stop and wait for the user to install it. + +**If the skill is installed**, read SKILL.md and every file in the `references/` directory. Then follow the instructions below. + +## Pre-flight checks + +1. **Check for documentation.** Look for a `docs/`, `reference_docs/`, or `documentation/` directory. If none exists, give a prominent warning: + + > **Documentation improves results significantly.** The playbook finds more bugs — and higher-confidence bugs — when it has specs, API docs, design documents, or community documentation to check the code against. Consider adding documentation to `reference_docs/` before running. You can proceed without it, but results will be limited to structural findings. + +2. **Ask about scope.** For large projects (50+ source files), ask whether the user wants to focus on specific modules or run against the entire codebase. + +## How to run + +The playbook has two modes. Ask the user which they want, or infer from their prompt: + +### Mode 1: Phase by phase (recommended for first run) + +Start a fresh session or context for Phase 1. When it completes, show the end-of-phase summary and tell the user to say "keep going" or "run phase N" to continue. Each subsequent phase should also run in a **new session or context window** so it gets maximum depth. + +This is the default if the user says "run the quality playbook." + +### Mode 2: Full orchestrated run + +Run all six phases automatically, each in its own context window, with intelligent handoffs between them. Use this when the user says "run the full playbook" or "run all phases." + +**Orchestration protocol:** + +For each phase (1 through 6): + +1. **Start a new context.** Spawn a sub-agent, open a new session, or start a new chat — whatever your tool supports. The goal is a clean context window. +2. **Pass the phase prompt.** Tell the new context: + - Read SKILL.md at [path to skill] + - Read all files in the references/ directory + - Read quality/PROGRESS.md (if it exists) for context from prior phases + - Execute Phase N +3. **Wait for completion.** The phase is done when it writes its checkpoint to quality/PROGRESS.md. +4. **Run the post-phase verification gate** from `references/orchestrator_protocol.md`. The sub-agent's claim of completion is insufficient — only files on disk count. +5. **Report progress.** Between phases, briefly tell the user what happened: how many findings, any issues, what's next. +6. **Continue to next phase.** Repeat from step 1. + +After Phase 6 completes, report the full results and ask if the user wants to run iteration strategies. + +**Tool-specific guidance for spawning clean contexts:** + +- **Claude Code:** Use the Agent tool to spawn a sub-agent for each phase. Each sub-agent gets its own context window automatically. +- **Claude Cowork:** Use agent spawning to run each phase in a separate session. +- **GitHub Copilot:** Start a new chat for each phase. Include the phase prompt as your first message. +- **Cursor:** Open a new Composer for each phase with the phase prompt. +- **Windsurf / other tools:** Start a new conversation or chat for each phase. + +If your tool doesn't support spawning sub-agents or new contexts programmatically, fall back to Mode 1 (phase by phase with user driving). + +### Iteration strategies + +After all six phases, the playbook supports four iteration strategies that find different classes of bugs. Each strategy re-explores the codebase with a different approach, then re-runs Phases 2-6 on the merged findings. Read `references/iteration.md` for full details. + +The four strategies, in recommended order: + +1. **gap** — Explore areas the baseline missed +2. **unfiltered** — Fresh-eyes re-review without structural constraints +3. **parity** — Compare parallel code paths (setup vs. teardown, encode vs. decode) +4. **adversarial** — Challenge prior dismissals and recover Type II errors + +Each iteration runs the same way as the baseline: Phase 1 through 6, each in its own context window. Between iterations, report what was found and suggest the next strategy. + +Iterations typically add 40-60% more confirmed bugs on top of the baseline. + +## The six phases + +1. **Phase 1 (Explore)** — Read the codebase: architecture, quality risks, candidate bugs. Output: `quality/EXPLORATION.md` +2. **Phase 2 (Generate)** — Produce quality artifacts: requirements, constitution, contracts, coverage matrix, completeness report, four review/execution protocols, functional test file. Output: nine files in `quality/` (REQUIREMENTS.md, QUALITY.md, CONTRACTS.md, COVERAGE_MATRIX.md, COMPLETENESS_REPORT.md, RUN_CODE_REVIEW.md, RUN_INTEGRATION_TESTS.md, RUN_SPEC_AUDIT.md, RUN_TDD_TESTS.md) plus a `quality/test_functional.` functional test file. **AGENTS.md is generated post-Phase-6 by the orchestrator, NOT by Phase 2** — writing AGENTS.md in Phase 2 trips the source-edit guardrail and aborts the run. +3. **Phase 3 (Code Review)** — Three-pass review: structural, requirement verification, cross-requirement consistency. Regression tests for every confirmed bug. Output: `quality/code_reviews/`, patches +4. **Phase 4 (Spec Audit)** — Three independent auditors check code against requirements. Triage with verification probes. Output: `quality/spec_audits/`, additional regression tests +5. **Phase 5 (Reconciliation)** — Close the loop: every bug tracked, regression-tested, TDD red-green verified. Output: `quality/BUGS.md`, TDD logs, completeness report +6. **Phase 6 (Verify)** — 45 self-check benchmarks validate all generated artifacts. Output: final PROGRESS.md checkpoint + +Each phase has entry gates (prerequisites from prior phases) and exit gates (what must be true before the phase is considered complete). SKILL.md defines these gates precisely — follow them exactly. + +## Responding to user questions + +- **"help" / "how does this work"** — Explain the six phases and two run modes. Mention that documentation improves results. Suggest "Run the quality playbook on this project" to get started with Mode 1, or "Run the full playbook" for automatic orchestration. +- **"what happened" / "what's going on" / "status"** — Read `quality/PROGRESS.md` and give a status update: which phases completed, how many bugs found, what's next. +- **"keep going" / "continue" / "next"** — Run the next phase in sequence. +- **"run phase N"** — Run the specified phase (check prerequisites first). +- **"run iterations"** — Start the iteration cycle. Read `references/iteration.md` and run gap strategy first. +- **"run [strategy] iteration"** — Run a specific iteration strategy. + +## Example prompts + +- "Run the quality playbook on this project" — Mode 1, starts Phase 1 +- "Run the full playbook" — Mode 2, orchestrates all six phases +- "Run the full playbook with all iterations" — Mode 2 + all four iteration strategies +- "Keep going" — Continue to next phase +- "What happened?" — Status check +- "Run the adversarial iteration" — Specific iteration strategy +- "Help" — Explain how it works diff --git a/skills/quality-playbook/phase_prompts/README.md b/skills/quality-playbook/phase_prompts/README.md new file mode 100644 index 000000000..1bf662ab1 --- /dev/null +++ b/skills/quality-playbook/phase_prompts/README.md @@ -0,0 +1,47 @@ +# phase_prompts/ + +Externalized phase prompt bodies for the Quality Playbook. + +v1.5.4 F-1 (Bootstrap_Findings 2026-04-30) extracted these from +`bin/run_playbook.py`'s inline string templates so both execution +modes — UI-context skill-direct (a coding agent walking through +SKILL.md inline) and CLI-automation runner-driven (`python -m +bin.run_playbook`) — read from the same single source of truth. +Without externalization the two modes drift; with it, an edit to a +phase prompt lands once and benefits both. + +## File layout + +- `phase1.md` ... `phase6.md` — one file per pipeline phase. Loaded + by `bin/run_playbook.py::_load_phase_prompt`. +- `single_pass.md` — the legacy single-prompt invocation (used when + the operator wants the LLM to drive all six phases inline rather + than via the per-phase orchestrator). +- `iteration.md` — the iteration-strategy prompt (gap, unfiltered, + parity, adversarial — see `bin/run_playbook.py::next_strategy`). + +## Substitution conventions + +Most files are pure-literal markdown — the loader returns them +unchanged. Three files use `str.format()` substitution with named +placeholders: + +- `phase1.md` — `{seed_instruction}` (skip Phase 0/0b prelude when + `--no-seeds`) and `{role_taxonomy}` (rendered from + `bin.role_map.ROLE_DESCRIPTIONS`). +- `single_pass.md` — `{skill_fallback_guide}` and + `{seed_instruction}`. +- `iteration.md` — `{skill_fallback_guide}` and `{strategy}`. + +Inside files that go through `.format()`, JSON braces and other +literal `{` / `}` characters MUST be doubled (`{{` / `}}`) per +Python's format-string escaping rules. Pure-literal files do not +need any escaping. + +## Editing discipline + +When you change a phase prompt, the loader picks up the new content +at the next invocation — there is no caching layer to invalidate. The +test suite at `bin/tests/test_phase_prompts_externalized.py` pins the +loader's contract; if you add a new substitution variable, extend +those tests. diff --git a/skills/quality-playbook/phase_prompts/iteration.md b/skills/quality-playbook/phase_prompts/iteration.md new file mode 100644 index 000000000..de51859ff --- /dev/null +++ b/skills/quality-playbook/phase_prompts/iteration.md @@ -0,0 +1 @@ +{skill_fallback_guide} Run the next iteration using the {strategy} strategy. Any updates to quality/PROGRESS.md must keep the existing phase tracker in checkbox format (`- [x] Phase N - `) — do not rewrite it as a table. The orchestrator appends `## Iteration: started/complete` sections itself; iteration work should not touch the existing phase tracker lines. diff --git a/skills/quality-playbook/phase_prompts/phase1.md b/skills/quality-playbook/phase_prompts/phase1.md new file mode 100644 index 000000000..1d2872585 --- /dev/null +++ b/skills/quality-playbook/phase_prompts/phase1.md @@ -0,0 +1,229 @@ +You are a quality engineer. {skill_fallback_guide} For this phase read ONLY the sections up through Phase 1 (stop at the "---" line before "Phase 2"). Also read the reference files (under whichever references/ directory matches the install path you resolved) that are relevant to exploration. + +{seed_instruction} + +Execute Phase 1: Explore the codebase. The reference_docs/ directory contains gathered documentation - read it to supplement your exploration. Top-level files are Tier 4 context (AI chats, design notes, retrospectives). Files under reference_docs/cite/ are citable sources (project specs, RFCs). If reference_docs/ is missing or empty, proceed with Tier 3 evidence (source tree) alone and note this in EXPLORATION.md. + +### MANDATORY FILE-ROLE TAGGING (v1.5.4 Part 1) + +Before (or as part of) writing EXPLORATION.md, produce `quality/exploration_role_map.json`. Begin by reading `SKILL.md` at the repository root if present (also check for any other top-level skill-shaped entry file — the indicator is content + name, not extension; a `README.md` is NOT a skill-shaped entry just because it sits at the root). The prose context informs every subsequent file's role tag. + +**File source (v1.5.4 Phase 3.6.1, codex-prevention).** Use `git ls-files` as the canonical file list when the target is a git repo — this respects `.gitignore` automatically and is the ONLY supported enumeration source. Do NOT use `os.walk`, `find`, `os.listdir`, or any recursive directory walker — those will pull in `.git/`, `.venv/`, `node_modules/`, build outputs, and vendored dependencies, all of which are FORBIDDEN in the role map (the validator rejects them and aborts the run). When the target is not a git repo, use a filesystem walk that explicitly skips the disallowed paths listed below; record this fallback in the role map's `provenance` field. + +**Disallowed paths (MUST NOT appear in the role map under any role):** `.git/`, `.venv/`, `venv/`, `node_modules/`, `__pycache__/`, `.pytest_cache/`, `.mypy_cache/`, `.ruff_cache/`, `.tox/`, plus any path with a component ending in `.egg-info` or `.dist-info`. The validator at `bin/role_map.py::DISALLOWED_PATH_PREFIXES` enforces this — if your role map contains any such path, the run aborts. There is also a hard ceiling of 2000 entries; a role map with more is treated as evidence Phase 1 walked .gitignored content. + +**Provenance (v1.5.4 Phase 3.6.1).** The role map's top-level `provenance` field MUST be one of: +- `"git-ls-files"` — preferred. Target is a git repo; you ran `git ls-files` to enumerate. +- `"filesystem-walk-with-skips"` — fallback. Target is not a git repo; you walked the filesystem with explicit skips for every entry in the disallowed-paths list above. +- `"unknown"` — accepted only on legacy role maps; do NOT emit this for fresh runs. + +For each in-scope file, emit a record with the role taxonomy below. The judgment is content-based: read the file (or enough of it to judge), do NOT pattern-match on extension or directory name alone. + +**Sentinel files (v1.5.4 Phase 3.6.1).** Files named `.gitkeep` (or similar empty-directory markers) in the repository's tracked tree MUST NOT be deleted. They keep otherwise-empty directories present in git history. If you find such a file and don't understand its purpose, leave it alone. The pre-flight check verifies all `.gitignore !`-rule sentinels are present and aborts the run if any are missing. + +**If you encounter a bug in QPB itself during this run** (e.g., an exception from `bin/run_playbook.py`, a missing import, a broken assertion in QPB source), STOP the run immediately and report: +1. The exact error and where it occurred (file:line + traceback) +2. A diagnosis of the likely root cause +3. A proposed fix shape (do NOT apply it) + +Do NOT patch QPB source code yourself. QPB source changes go through Council review (see `~/Documents/AI-Driven Development/CLAUDE.md`). A structural backstop captures the QPB source tree's git SHA at run start and verifies it unchanged at every phase boundary; an autonomous source patch will fail the gate with a diagnostic naming the modified files. + +Role taxonomy (single source of truth: `bin/role_map.py::ROLE_DESCRIPTIONS`): +{role_taxonomy} + +If a file genuinely doesn't fit any of these, you may add a new role — but document the addition in your role map's first entry as a comment-style rationale. + +The output file `quality/exploration_role_map.json` MUST conform to this schema: + +``` +{{ + "schema_version": "1.0", + "timestamp_start": "", + "provenance": "git-ls-files", + "files": [ + {{ + "path": "", + "role": "", + "size_bytes": , + "rationale": "" + }} + // ... one entry per in-scope file. When role == "skill-tool", also + // include a "skill_prose_reference" string pointing at the SKILL.md / + // reference-file location that names this script (e.g., "SKILL.md:47" + // or "references/forms.md:section-3"); the prose-to-code divergence + // check in Phase 4 reads this back to find the cited prose. + ] +}} +``` + +**You only produce `files[]` and `provenance`.** The two mechanically-derivable fields — `breakdown` and `summary` — are computed by the runner between Phase 1 LLM exit and the Phase 2 entry-gate (v1.5.6 cluster 047 architectural fix). The runner calls `bin.role_map.compute_breakdown(files)` and `bin.role_map.summarize_role_map(...)` and writes the canonical values into the on-disk file before validation. Don't include `breakdown` or `summary` in your output — even if you do, the runner will overwrite them. Your job is the analytical work (per-file role tagging in `files[]` plus `provenance`); the deterministic aggregations are runner-owned. (Pre-v1.5.6 the LLM was instructed to compute these too, which produced a class of failures where the LLM reverted to intuitive summarization that drifted from the strict mechanical contract; runner-side computation removes the failure mode.) + +Tagging discipline: +1. `skill-tool` and `code` is the load-bearing distinction. A script is only `skill-tool` if SKILL.md (or a doc SKILL.md cites) explicitly names it and tells the agent to invoke it. Independent code modules — even small ones in a `scripts/` directory — are `code` if no SKILL.md prose directs the agent to use them. +2. Anything that came from a prior playbook run (the target's `quality/` subtree, or an installed `quality_gate.py` from QPB itself — the file the installer copies next to SKILL.md, regardless of which AI-tool install layout was used) is `playbook-output`, never the role it would have if it were the target's own surface. This prevents the v1.5.3 LOC-pollution failure mode where a target's apparent code surface was inflated by QPB's own infrastructure. +3. If SKILL.md is absent at the root and no other skill-shaped entry file exists, the role map will have zero `skill-prose` entries. That's fine — the four-pass derivation pipeline will no-op for this target. + +Handling edge cases (v1.5.4 Phase 1 edge-case discipline): +- **No SKILL.md at root, no other skill-shaped entry.** Tag every file by content as usual. The role map will carry zero `skill-prose` and `skill-reference` entries; the four-pass pipeline will no-op. Do NOT invent a synthetic SKILL.md or label something `skill-prose` for a project that genuinely has no skill surface. +- **SKILL.md references a script that does not exist.** Add a top-level `broken_references` array to the role map carrying `{{"prose_location": ":", "missing_script": ""}}` entries. Do NOT add a synthetic file entry for the missing script. Note the broken reference in EXPLORATION.md so Phase 4's prose-to-code divergence check can register it as a known gap. (This field is additive; the gate's role-map validator does not require it.) +- **Target with a very large file count (1000+).** Process in batches. The `files` array can grow incrementally as you walk the tree; once you've made all per-file judgments, write the file once. Do not write a partial role map mid-walk — the validator considers the file complete when it appears, and the runner-side `normalize_role_map_for_gate` step (v1.5.6 cluster 047) computes `breakdown` and `summary` after you exit Phase 1. +- **Ambiguous prose ("the helper script", "the validator").** Default to `code`. `skill-tool` requires an unambiguous citation: SKILL.md or a referenced doc must name the file (or a path-suffix that uniquely identifies it) AND direct the agent to invoke it. When in doubt, tag `code` and capture the ambiguity in `rationale` — it's better to under-tag `skill-tool` than to inflate the surface area Phase 4's prose-to-code check operates on. +- **Generated files (build outputs, vendored dependencies, lockfiles).** Skip them at the ignore-rule layer; do not include them in the role map. If you can't tell whether a file is generated, look for a generation marker (header comment naming the generator, sibling `.generated` file, presence in `.gitignore`); if generated, omit from the role map. + +When Phase 1 is complete, write your full exploration findings to +`quality/EXPLORATION.md`. The file MUST contain ALL of the following +section titles VERBATIM (the Phase 1 gate at SKILL.md:1257-1273 enforces +each mechanically; `bin/run_state_lib.validate_phase_artifacts(quality_dir, phase=1)` +is the programmatic enforcer — your artifact has to pass it before +Phase 2 will start). The exact titles are load-bearing — do NOT +substitute "equivalent" headings: + +1. `## Open Exploration Findings` — at least 8 numbered entries + (`1.`, `2.`, ...). Each entry has at least one file:line citation + in the body (e.g., `bin/foo.py:120-135`). At least 3 of these + entries trace behavior across 2 or more distinct file:line + locations (multi-location traces — the entry cites two or more + different file:line ranges). + +2. `## Quality Risks` — domain-knowledge risk analysis. Numbered or + bulleted; cite file:line where risks are concretely visible in + code or docs. + +3. `## Pattern Applicability Matrix` — a Markdown table with one row + per exploration pattern from `references/exploration_patterns.md`. + Decision column values are `FULL` or `SKIP`. Between 3 and 4 + patterns must be marked `FULL` (inclusive — the gate rejects + below 3 because exploration didn't pick enough patterns, and + above 4 because exploration ran every pattern instead of + selecting). Skipped patterns are still listed with `SKIP` and a + brief reason, so the matrix is exhaustive. + +4. `## Pattern Deep Dive — ` — at least 3 sections, + one per `FULL` pattern. Each deep dive enumerates concrete + findings with file:line citations. At least 2 of these sections + trace code paths across 2 or more distinct identifiers (e.g., + backtick-quoted function or symbol names like `\`docs_present\``, + `\`_evaluate_documentation_state\``) OR across 2 or more distinct + file:line locations — that's how the gate detects "multi-function + trace" rather than a one-anchor finding. + +5. `## Candidate Bugs for Phase 2` — numbered list of bug + hypotheses promoted from the deep dives + open exploration. Each + entry has a `Stage:` line attributing the source (e.g., `Stage: + open exploration`, `Stage: quality risks`, or + `Stage: `). At least 2 entries must be sourced from + `open exploration` / `quality risks` AND at least 1 entry must be + sourced from a pattern deep dive. Combo stages + (`Stage: open exploration + Cross-Implementation Consistency`) + count toward both buckets. + +6. `## Gate Self-Check` — proves you ran the Phase 1 gate. List each + of the 13 checks (≥120 lines + six required headings + ≥3 Pattern + Deep Dive sections + PROGRESS.md mark + ≥8 findings with citations + + ≥3 multi-location findings + 3-4 FULL pattern matrix rows + ≥2 + multi-function deep dives + candidate-bug source mix) and mark + whether the artifact satisfies each. + +In addition, ensure `quality/PROGRESS.md` exists and its Phase 1 +line is marked `[x]` (the gate's check 8) before declaring Phase 1 +complete. + +The exploration content the prior versions of this prompt asked for +(domain and stack identification, architecture map, existing test +inventory, specification summary, skeleton/dispatch analysis, +derived requirements `REQ-NNN`, derived use cases `UC-NN`, +file-role tagging summary) lives WITHIN these required sections — +for example, the architecture map and module enumeration belong +under `## Open Exploration Findings` as multi-location findings; +the file-role tagging summary and the `exploration_role_map.json` +breakdown summary belong under `## Open Exploration Findings` or +`## Quality Risks` as analytical content; derived REQ-NNN and UC-NN +sections may appear after `## Gate Self-Check` as additional +analytical material the playbook downstream phases consume. Do NOT +use these alternative names as TOP-level section titles — the gate +requires the six exact titles above and the Pattern Deep Dive +prefix; additional `## ` sections beyond these are tolerated for +analytical extension but the six gate-required titles MUST appear +verbatim. + +### MANDATORY CARTESIAN UC RULE (Lever 1, v1.5.2) + +For every requirement with a `References` field naming ≥2 files (or ≥2 file:line ranges in distinct files), apply the **Cartesian eligibility check** before deciding whether to emit a single umbrella UC or per-site UCs: + +**Gate 1 — Path-suffix match.** At least two references must share a path-suffix role: the last segment before the extension, or a matching function-name pattern that appears across the files. +- Example of a match: `virtio_mmio.c`, `virtio_vdpa.c`, `virtio_pci_modern.c` all implement `_finalize_features`. The `_finalize_features` function is the shared role. +- Example of a non-match: `CONFIG_FOO`, `CONFIG_BAR` flags in the same kconfig file — same kind of thing, but not parallel implementations. + +**Gate 2 — Function-level similarity.** Each matching reference must cite a line range of similar size (within 2× of the median) and each range must be inside a function body — not a file-header, a kconfig block, or a macro expansion list. + +**Decision:** +- **Both gates pass →** emit one UC per site, numbered `UC-N.a`, `UC-N.b`, `UC-N.c`, … Each per-site UC has its own Actors, Preconditions, Flow, Postconditions. The parent REQ-N remains as the umbrella. +- **Only Gate 1 passes →** keep a single umbrella UC and mark the reference cluster `heterogeneous` in a `` HTML comment in the UC body. Phase 3 can still override if it finds per-site divergence. +- **Neither gate passes →** single umbrella UC, no special marking. + +### Worked example — REQ-010 / VIRTIO_F_RING_RESET (virtio) + +Suppose Phase 1 derives: + + ### REQ-010: Virtio transports must honor VIRTIO_F_RING_RESET negotiation + - References: drivers/virtio/virtio_mmio.c, drivers/virtio/virtio_vdpa.c, drivers/virtio/virtio_pci_modern.c + - Pattern: whitelist + +Applying the Cartesian check: +- Gate 1: all three files contain `_finalize_features` functions — matches. +- Gate 2: each cited range is inside a function body of similar size — matches. + +Both gates pass → emit per-site UCs: + + ### UC-10.a: VIRTIO_F_RING_RESET on PCI modern transport + - Actors: virtio_pci_modern driver, guest kernel + - Preconditions: device advertises VIRTIO_F_RING_RESET + - Flow: vp_modern_finalize_features propagates bit through config space … + - Postconditions: feature_bit reflected in final config + + ### UC-10.b: VIRTIO_F_RING_RESET on MMIO transport + - Actors: virtio_mmio driver, guest kernel + - Preconditions: device advertises VIRTIO_F_RING_RESET + - Flow: vm_finalize_features must mirror PCI modern behavior … + - Postconditions: feature_bit survives finalize call + + ### UC-10.c: VIRTIO_F_RING_RESET on vDPA transport + - Actors: virtio_vdpa driver, vdpa device backend + - Preconditions: device advertises VIRTIO_F_RING_RESET + - Flow: virtio_vdpa_finalize_features forwards through set_driver_features … + - Postconditions: feature_bit visible to vdpa backend + +### CONFIRMATION CHECKLIST (Cartesian UC rule) + +Before completing Phase 1, confirm each item explicitly in EXPLORATION.md under a section titled "Cartesian UC rule confirmation": + +1. For every REQ with ≥2 References, I ran Gate 1 (path-suffix match). +2. For every REQ that passed Gate 1, I ran Gate 2 (function-level similarity). +3. Where both gates passed, I emitted per-site UCs (UC-N.a, UC-N.b, …). +4. Where only Gate 1 passed, I marked the cluster ``. +5. Where neither gate passed, I kept a single umbrella UC without marking. +6. For each REQ with a pattern match in Gate 1, I added `Pattern: whitelist|parity|compensation` to the REQ block. + +Also initialize quality/PROGRESS.md with the run metadata and the phase tracker in the EXACT checkbox format below. This format is a hard contract: the Phase 5 gate checks for the substring `- [x] Phase 4` before allowing reconciliation to start, and it only matches the checkbox form. Do NOT substitute a Markdown table, bulleted prose, or any other layout — table-format runs have aborted mid-pipeline because the gate does not see "Complete" in a table cell as equivalent. + +Template for the phase tracker section of PROGRESS.md (fill in the Skill version from SKILL.md metadata): + +``` +# Quality Playbook Progress + +Skill version: +Date: + +## Phase tracker + +- [x] Phase 1 - Explore +- [ ] Phase 2 - Generate +- [ ] Phase 3 - Code Review +- [ ] Phase 4 - Spec Audit +- [ ] Phase 5 - Reconciliation +- [ ] Phase 6 - Verify +``` + +As each later phase completes it will flip its own `- [ ]` to `- [x]` — keep the line text (including the phase name after the dash) stable so substring matching in the Phase 5 gate and downstream tooling works. + +IMPORTANT: Do NOT proceed to Phase 2. Your only job is exploration and writing findings to disk. Write thorough, detailed findings - the next phase will read EXPLORATION.md to generate artifacts, so everything important must be captured in that file. diff --git a/skills/quality-playbook/phase_prompts/phase2.md b/skills/quality-playbook/phase_prompts/phase2.md new file mode 100644 index 000000000..16682d9a9 --- /dev/null +++ b/skills/quality-playbook/phase_prompts/phase2.md @@ -0,0 +1,27 @@ +{skill_fallback_guide} + +You are a quality engineer continuing a phase-by-phase quality playbook run. Phase 1 (exploration) is already complete. + +Read these files to get context: +1. quality/EXPLORATION.md - your Phase 1 findings (requirements, risks, architecture) +2. quality/PROGRESS.md - run metadata and phase status +3. SKILL.md - read the Phase 2 section (from "Phase 2: Generate the Quality Playbook" through the "Checkpoint: Update PROGRESS.md after artifact generation" section). Also read the reference files cited in that section. Resolve SKILL.md and reference files via the documented fallback list above; do NOT assume any single install layout (`.github/skills/`, `.claude/skills/quality-playbook/`, `.cursor/skills/quality-playbook/`, `.continue/skills/quality-playbook/`, or root). + +**Field preservation rule (v1.5.2, Lever 2).** When transcribing REQ hypotheses from EXPLORATION.md into `quality/REQUIREMENTS.md` and `quality/requirements_manifest.json`, every `- Pattern: ` field present on the source hypothesis MUST appear on the corresponding REQ in both output files. Pattern values are `whitelist | parity | compensation`. Phase 1's Cartesian UC rule (confirmation checklist item 6) requires Pattern tagging for every REQ where both UC gates match; Phase 2 must not silently drop these tags. If a hypothesis lacks Pattern but you believe it should have one (per-site UCs emitted with `UC-N.a`/`UC-N.b` suffixes, multi-file `References` suggesting a parallel structure), add Pattern during Phase 2 — do not omit the field. The Phase 5 cardinality gate cannot enforce coverage on a REQ it doesn't know is pattern-tagged; silent omission is a documented v1.4.5-regression vector. + +Execute Phase 2: Generate all quality artifacts. Use the exploration findings in EXPLORATION.md as your source - do not re-explore the codebase from scratch. Generate: +- quality/QUALITY.md (quality constitution) +- quality/CONTRACTS.md (behavioral contracts) +- quality/REQUIREMENTS.md (with REQ-NNN and UC-NN identifiers from EXPLORATION.md) +- quality/COVERAGE_MATRIX.md +- Functional tests (quality/test_functional.*) +- quality/RUN_CODE_REVIEW.md (code review protocol) +- quality/RUN_INTEGRATION_TESTS.md (integration test protocol) +- quality/RUN_SPEC_AUDIT.md (spec audit protocol) +- quality/RUN_TDD_TESTS.md (TDD verification protocol) +- quality/COMPLETENESS_REPORT.md (baseline, without verdict) +- If dispatch/enumeration contracts exist: quality/mechanical/ with verify.sh and extraction artifacts. Run verify.sh immediately and save receipts. + +Update PROGRESS.md: mark Phase 2 complete (use the checkbox format `- [x] Phase 2 - Generate` — do NOT switch to a table), update artifact inventory. + +IMPORTANT: Do NOT proceed to Phase 3 (code review). Your job is artifact generation only. The next phase will execute the review protocols you generated. diff --git a/skills/quality-playbook/phase_prompts/phase3.md b/skills/quality-playbook/phase_prompts/phase3.md new file mode 100644 index 000000000..ad4b874d7 --- /dev/null +++ b/skills/quality-playbook/phase_prompts/phase3.md @@ -0,0 +1,154 @@ +{skill_fallback_guide} + +You are a quality engineer continuing a phase-by-phase quality playbook run. Phases 1-2 are complete. + +Read these files to get context: +1. quality/PROGRESS.md - run metadata, phase status, artifact inventory +2. quality/EXPLORATION.md - Phase 1 findings (especially the "Candidate Bugs for Phase 2" section) +3. quality/REQUIREMENTS.md - derived requirements and use cases +4. quality/CONTRACTS.md - behavioral contracts +5. SKILL.md - read the Phase 3 section ("Phase 3: Code Review and Regression Tests"). Also read references/review_protocols.md. Resolve SKILL.md and the references/ directory via the documented fallback list above; do NOT assume any single install layout. + +Execute Phase 3: Code Review + Regression Tests. +Run the 3-pass code review per quality/RUN_CODE_REVIEW.md. For every confirmed bug: +- Add to quality/BUGS.md with ### BUG-NNN heading format +- Write a regression test (xfail-marked) +- Generate quality/patches/BUG-NNN-regression-test.patch (MANDATORY for every confirmed bug) +- Generate quality/patches/BUG-NNN-fix.patch (strongly encouraged) +- Write code review reports to quality/code_reviews/ +- Update PROGRESS.md BUG tracker + +### MANDATORY GRID STEP (Lever 2, v1.5.2) — pattern-tagged REQs only + +For every REQ in quality/REQUIREMENTS.md that has a `Pattern:` field (`whitelist`, `parity`, or `compensation`), you MUST produce a compensation grid BEFORE writing any BUG entries for that REQ. + +**Step 1. Enumerate the authoritative item set.** Mechanical extraction from source — uapi header, spec section, documented constants. Do NOT invent. Example: for VIRTIO_F_RING_RESET-family, grep `include/uapi/linux/virtio_config.h` for `VIRTIO_F_*` and list the bits the REQ covers. + +**Step 2. Enumerate the sites.** From the REQ's per-site UCs (UC-N.a, UC-N.b, …). If the REQ has a single umbrella UC but is pattern-tagged, the grid is 1-dimensional over items. + +**Step 3. Produce the grid.** Write `quality/compensation_grid.json` with one entry per REQ: + +```json +{ + "schema_version": "1.5.2", + "reqs": { + "REQ-010": { + "pattern": "whitelist", + "items": ["RING_RESET", "ADMIN_VQ", "NOTIF_CONFIG_DATA", "SR_IOV"], + "sites": ["PCI", "MMIO", "vDPA"], + "cells": [ + {"cell_id": "REQ-010/cell-RING_RESET-PCI", "item": "RING_RESET", "site": "PCI", "present": true, "evidence": "drivers/virtio/virtio_pci_modern.c:XXX-YYY"}, + {"cell_id": "REQ-010/cell-RING_RESET-MMIO", "item": "RING_RESET", "site": "MMIO", "present": false, "evidence": "drivers/virtio/virtio_mmio.c: no match for RING_RESET"} + ] + } + } +} +``` + +Cell IDs are mechanical: `REQ-/cell--`. No whitespace, uppercase item/site identifiers where natural. + +**Step 4. Apply the BUG-default rule.** For every cell where: +- the item is defined in authoritative source AND +- the item is absent from any shared filter AND +- the item is absent from the site's compensation path + +→ the cell DEFAULTS to BUG. Emit one `### BUG-NNN` entry with the cell's file:line citation, spec basis, and expected-vs-actual behavior. Include a `- Covers: [REQ-N/cell--]` line (see schemas.md §8 for the field contract). + +**Step 5. Downgrade to QUESTION requires a structured JSON record.** Append one record per downgraded cell to `quality/compensation_grid_downgrades.json`: + +```json +{ + "schema_version": "1.5.2", + "downgrades": [ + { + "cell_id": "REQ-010/cell-RING_RESET-MMIO", + "authority_ref": "include/uapi/linux/virtio_config.h:116", + "site_citation": "drivers/virtio/virtio_mmio.c:109-131", + "reason_class": "intentionally-partial", + "falsifiable_claim": "MMIO does not support RING_RESET because the MMIO transport predates the feature bit and kernel docs at Documentation/virtio/virtio_mmio.rst:42-55 state the transport is frozen at its v1.0 feature set; falsifiable by showing MMIO re-sets bit 40 under any kernel release." + } + ] +} +``` + +- `reason_class` enum: `out-of-scope | deprecated | platform-gated | handled-upstream | intentionally-partial`. +- `authority_ref`, `site_citation`, `falsifiable_claim` are required and non-empty. +- `falsifiable_claim` must state an observable condition that would make the claim wrong. +- Missing any required field, or `reason_class` outside the enum, or zero-length `falsifiable_claim` → cell REVERTS to BUG at Phase 5 gate time. There is no re-prompt loop. + +**Step 6. Self-check.** Before finalizing BUGS.md for this REQ, verify that every cell in the grid appears in either: +- some BUG's `- Covers: [...]` list, OR +- a downgrade record in `quality/compensation_grid_downgrades.json`. + +Any cell missing from both will fail the Phase 5 cardinality gate. This self-check is advisory in Phase 3; the blocking gate runs in Phase 5. + +### Worked example — RING_RESET grid (virtio) + +REQ-010 pattern: whitelist. Items: {RING_RESET, ADMIN_VQ, NOTIF_CONFIG_DATA, SR_IOV}. Sites: {PCI, MMIO, vDPA}. Grid: 4 × 3 = 12 cells. + +Code inspection reveals PCI implements all four; MMIO implements none of the four (frozen at v1.0 feature set); vDPA implements NOTIF_CONFIG_DATA but not the other three. + +Grid (present=T, absent=F): + +| | PCI | MMIO | vDPA | +|-----------------------|-----|------|------| +| RING_RESET | T | F | F | +| ADMIN_VQ | T | F | F | +| NOTIF_CONFIG_DATA | T | F | T | +| SR_IOV | T | F | F | + +BUG-default applies to every F cell (8 total). Possible consolidation: + +### BUG-001: MMIO ignores VIRTIO_F_RING_RESET +- Primary requirement: REQ-010 +- Covers: [REQ-010/cell-RING_RESET-MMIO] + +### BUG-002: vDPA ignores VIRTIO_F_RING_RESET +- Primary requirement: REQ-010 +- Covers: [REQ-010/cell-RING_RESET-vDPA] + +### BUG-003: vDPA missing ADMIN_VQ hookup +- Primary requirement: REQ-010 +- Covers: [REQ-010/cell-ADMIN_VQ-vDPA] + +### BUG-004: MMIO ignores NOTIF_CONFIG_DATA negotiation (common filter gap) +- Primary requirement: REQ-010 +- Covers: [REQ-010/cell-NOTIF_CONFIG_DATA-MMIO] + +### BUG-005: MMIO + vDPA both miss SR_IOV propagation +- Primary requirement: REQ-010 +- Covers: [REQ-010/cell-SR_IOV-MMIO, REQ-010/cell-SR_IOV-vDPA] +- Consolidation rationale: shared fix path in both transports goes through the same feature-bit filter; single patch on the shared helper closes both cells. + +If the reviewer concluded MMIO ADMIN_VQ is intentionally out-of-scope because ADMIN_VQ is a PCI-only spec feature, the downgrade record would be: + +```json +{ + "cell_id": "REQ-010/cell-ADMIN_VQ-MMIO", + "authority_ref": "include/uapi/linux/virtio_pci.h:NN", + "site_citation": "drivers/virtio/virtio_mmio.c: no admin virtqueue implementation", + "reason_class": "out-of-scope", + "falsifiable_claim": "ADMIN_VQ is MMIO-scoped — falsifiable by citing any virtio-spec normative text requiring ADMIN_VQ on non-PCI transports." +} +``` + +Union check: 8 BUG-covered cells + 1 downgrade cell = 9. Grid has 12 cells; 4 present cells don't need coverage. Total: 8 F cells covered via BUGs + 1 via downgrade = all 9 absent cells accounted for. Grid → clean. + +### ITERATION mode addendum (MANDATORY INCREMENTAL WRITE, Phase 8) + +When running in iteration mode (gap / unfiltered / parity / adversarial), write candidate BUG stubs to disk immediately on identification, not at end-of-review. Path: `quality/code_reviews/-candidates.md`. One `### CANDIDATE-NNN` heading per candidate, with at least a file:line citation. Reviewer upgrades candidates to confirmed BUGs in BUGS.md only after full triage. + +### CONFIRMATION CHECKLIST (Lever 2, v1.5.2) + +Before writing the Phase 3 completion checkpoint to PROGRESS.md, confirm each item explicitly in your Phase 3 summary: + +1. For every pattern-tagged REQ, I produced a compensation grid in `quality/compensation_grid.json`. +2. For every grid, I applied the BUG-default rule mechanically. +3. Every BUG emitted for a pattern-tagged REQ has a `- Covers: [...]` field with valid cell IDs. +4. Every BUG whose Covers list has ≥2 entries has a non-empty `- Consolidation rationale: ...` field. +5. For every downgraded cell, I wrote a complete structured record in `quality/compensation_grid_downgrades.json` with all five required fields and a valid `reason_class`. +6. For every pattern-tagged REQ, the union of Covers lists + downgrade cells equals the grid's cell set. + +Mark Phase 3 (Code review + regression tests) complete in PROGRESS.md (use the checkbox format `- [x] Phase 3 - Code Review` — do NOT switch to a table). + +IMPORTANT: Do NOT proceed to Phase 4 (spec audit). The next phase will run the spec audit with a fresh context window. diff --git a/skills/quality-playbook/phase_prompts/phase4.md b/skills/quality-playbook/phase_prompts/phase4.md new file mode 100644 index 000000000..a3943fef0 --- /dev/null +++ b/skills/quality-playbook/phase_prompts/phase4.md @@ -0,0 +1,54 @@ +{skill_fallback_guide} + +You are a quality engineer continuing a phase-by-phase quality playbook run. Phases 1-3 are complete. + +Read these files to get context: +1. quality/PROGRESS.md - run metadata, phase status, BUG tracker +2. quality/REQUIREMENTS.md - derived requirements +3. quality/BUGS.md - bugs found in Phase 3 (code review) +4. SKILL.md - read the Phase 4 section ("Phase 4: Spec Audit and Triage"). Also read references/spec_audit.md. Resolve SKILL.md and the references/ directory via the documented fallback list above; do NOT assume any single install layout. + +Execute Phase 4: Spec Audit + Triage + Layer-2 semantic citation check. + +Part A — spec audit: +Run the spec audit per quality/RUN_SPEC_AUDIT.md. Produce: +- Individual auditor reports at quality/spec_audits/YYYY-MM-DD-auditor-N.md (one per auditor) +- Triage synthesis at quality/spec_audits/YYYY-MM-DD-triage.md +- Executable triage probes at quality/spec_audits/triage_probes.sh +- Regression tests and patches for any net-new spec audit bugs +- Update BUGS.md and PROGRESS.md BUG tracker with any new findings + +Part B — Layer-2 semantic citation check (v1.5.1): +The gate's invariant #17 (schemas.md §10) requires three Council members to +vote on each Tier 1/2 REQ's citation_excerpt. Execute these steps: + +1. Generate per-Council-member prompts: + python3 -m bin.quality_playbook semantic-check plan . + This writes one or more prompt files to + quality/council_semantic_check_prompts/.txt per member in the + Council roster (bin/council_config.py: claude-opus-4.7, gpt-5.4, + gemini-2.5-pro). For >15 Tier 1/2 REQs, prompts are split into batches + of 5 (-batch.txt). + If no Tier 1/2 REQs exist (Spec Gap run), this step writes an empty + quality/citation_semantic_check.json directly — skip steps 2-4. + +2. For each Council member's prompt file, feed the prompt to that model + (the same roster that ran Part A) and capture its JSON-array response + to quality/council_semantic_check_responses/.json. If the + member was batched, concatenate the per-batch responses into a single + array in the response file. Every entry must have req_id, verdict + (supports|overreaches|unclear), and reasoning. + +3. Assemble the semantic-check output: + python3 -m bin.quality_playbook semantic-check assemble . \ + --member claude-opus-4.7 --response quality/council_semantic_check_responses/claude-opus-4.7.json \ + --member gpt-5.4 --response quality/council_semantic_check_responses/gpt-5.4.json \ + --member gemini-2.5-pro --response quality/council_semantic_check_responses/gemini-2.5-pro.json + This writes quality/citation_semantic_check.json per schemas.md §9. + +4. Verify the output file exists. Phase 6's gate invariant #17 requires + it on every Tier 1/2 run. + +Mark Phase 4 (Spec audit + triage + semantic check) complete in PROGRESS.md (use the checkbox format `- [x] Phase 4 - Spec Audit` — the Phase 5 entry gate looks for that exact substring and will abort if it finds a table row or any other layout). + +IMPORTANT: Do NOT proceed to Phase 5 (reconciliation). The next phase will handle reconciliation and TDD. diff --git a/skills/quality-playbook/phase_prompts/phase5.md b/skills/quality-playbook/phase_prompts/phase5.md new file mode 100644 index 000000000..05e958599 --- /dev/null +++ b/skills/quality-playbook/phase_prompts/phase5.md @@ -0,0 +1,119 @@ +{skill_fallback_guide} + +You are a quality engineer continuing a phase-by-phase quality playbook run. Phases 1-4 are complete. + +Read these files to get context: +1. quality/PROGRESS.md - run metadata, phase status, cumulative BUG tracker +2. quality/BUGS.md - all confirmed bugs from code review and spec audit +3. quality/REQUIREMENTS.md - derived requirements +4. SKILL.md - read the Phase 5 section ("Phase 5: Post-Review Reconciliation and Closure Verification"). Also read references/requirements_pipeline.md, references/review_protocols.md, and references/spec_audit.md. Resolve SKILL.md and the references/ directory via the documented fallback list above; do NOT assume any single install layout. + +Execute Phase 5: Reconciliation + TDD + Closure. + +1. Run the Post-Review Reconciliation per references/requirements_pipeline.md. Update COMPLETENESS_REPORT.md. +2. Run closure verification: every BUG in the tracker must have either a regression test or an explicit exemption. +3. Write bug writeups at quality/writeups/BUG-NNN.md for EVERY confirmed bug. The canonical template is the "Bug writeup generation" section of SKILL.md (resolve via the fallback list above) — read that section before writing. Use the exact field headings listed there: **Summary, Spec reference, The code, Observable consequence, Depth judgment, The fix, The test, Related issues**. Sections 1–4, 6, 7 are required in every writeup; section 5 (Depth judgment) fires only when the consequence isn't self-evident from the immediate code; section 8 (Related issues) is included only when related bugs exist. Do NOT introduce fields that aren't in the template (no "Minimal reproduction" as a top-level field, no "Patch path:" as a top-level field — those belong inside Spec reference and The test respectively). + + **MANDATORY HYDRATION STEP.** Before writing a writeup, re-open quality/BUGS.md and locate the `### BUG-NNN:` entry for the bug you are about to write up. Every confirmed bug in BUGS.md already has the content you need — your job is to copy it into the writeup's sections, not to invent it. If a field is missing from BUGS.md, that is a reconciliation error to surface in PROGRESS.md, not a field to fabricate. Use this field map: + + | BUGS.md field | Writeup section | How to use it | + |----------------------------|------------------------------|-------------------------------------------------------------------------------| + | Title line (### BUG-NNN:…) | Summary | One sentence naming the function/code path and the observable failure. | + | Primary requirement | Spec reference | `- Requirement: REQ-NNN` | + | Spec basis | Spec reference | `- Spec basis: ` plus a ≤15-word contract quote copied verbatim from the cited lines. | + | Location | The code | Cite `file:line` and describe what the current path does there. | + | Minimal reproduction | Observable consequence | Weave into the consequence paragraph as the triggering input. | + | Expected + Actual behavior | Observable consequence | The actual behavior is the observable failure; the expected defines the gap. | + | Regression test | The test | `- Regression test: ` — verbatim from BUGS.md. | + | Patches (regression) | The test | `- Regression patch: ` — verbatim from BUGS.md. | + | Patches (fix) | The fix + The test | If a fix patch file exists, read it and paste the unified diff inside ```diff; also list the patch path as `- Fix patch: ` under The test. If no fix patch exists (confirmed-open bug), write the minimal concrete unified diff directly in The fix anyway — SKILL.md requires an inline diff in every writeup. In the no-patch case, omit the `Fix patch:` bullet from The test. | + | Red/green logs | The test | `- Red receipt: quality/results/BUG-NNN.red.log` and the matching green path. | + + **Worked example.** The BUGS.md entry for BUG-004 is: + + ### BUG-004: naive upstream timestamps crash ETA math + - Source: Code Review + - Severity: HIGH + - Primary requirement: REQ-006 + - Location: bus_tracker.py:138-144 + - Spec basis: quality/REQUIREMENTS.md:163-172; quality/QUALITY.md:57-65 + - Minimal reproduction: Return a visit whose ExpectedArrivalTime is an ISO string + without timezone information, such as 2026-04-21T12:00:00. + - Expected behavior: The affected arrival degrades to unknown-time while the rest + of the stop remains usable. + - Actual behavior: datetime.fromisoformat() returns a naive datetime and + subtracting it from datetime.now(timezone.utc) raises TypeError, aborting the + stop/request path. + - Regression test: quality.test_regression.TestPhase3Regressions.test_bug_004_fetch_stop_arrivals_degrades_naive_timestamps + - Patches: quality/patches/BUG-004-regression-test.patch, quality/patches/BUG-004-fix.patch + + The hydrated writeup sections look like this (sketch — paste the real diff from the + fix patch file into ```diff, don't make one up): + + ## Summary + fetch_stop_arrivals() crashes the whole stop/request path when an upstream visit + carries a naive ExpectedArrivalTime, instead of degrading that arrival to + unknown-time. + + ## Spec reference + - Requirement: REQ-006 + - Spec basis: quality/REQUIREMENTS.md:163-172; quality/QUALITY.md:57-65 + - Behavioral contract quote: "degrade a bad per-arrival timestamp to unknown-time instead of aborting the whole response path" + + ## The code + At bus_tracker.py:138-144, the parser calls datetime.fromisoformat(...) on + ExpectedArrivalTime and subtracts the result from datetime.now(timezone.utc)… + + ## Observable consequence + When the upstream visit returns ExpectedArrivalTime="2026-04-21T12:00:00" + (no timezone), fromisoformat() returns a naive datetime, the subtraction + raises TypeError, and the entire stop/request path aborts rather than the + single affected arrival degrading to unknown-time. + + ## The fix + ```diff + + ``` + + ## The test + - Regression test: quality.test_regression.TestPhase3Regressions.test_bug_004_fetch_stop_arrivals_degrades_naive_timestamps + - Regression patch: quality/patches/BUG-004-regression-test.patch + - Fix patch: quality/patches/BUG-004-fix.patch + - Red receipt: quality/results/BUG-004.red.log + - Green receipt: quality/results/BUG-004.green.log + + **Confirmation checklist (per writeup, before moving to the next bug).** (a) Every + required section has populated content copied from BUGS.md or the patch files — + no empty backticks, no sentinel filler like "is a confirmed code bug in ``" or + "The affected implementation lives at ``" or "Patch path: ``". (b) The ```diff + fence contains at least one `+` or `-` line from the actual fix patch. (c) The + Summary names a real function or code path, not the BUG identifier. (d) No + angle-bracket placeholders (e.g., `<...>`) remain in the final writeup — those are + pedagogical markers from the worked example and from SKILL.md, never acceptable + output. +4. Run the TDD red-green cycle: for each confirmed bug, run the regression test against unpatched code -> quality/results/BUG-NNN.red.log. If a fix patch exists, run against patched code -> quality/results/BUG-NNN.green.log. If the test runner is unavailable, create the log with NOT_RUN on the first line. +5. Generate sidecar JSON: quality/results/tdd-results.json and quality/results/integration-results.json (schema_version "1.1", canonical fields: id, requirement, red_phase, green_phase, verdict, fix_patch_present, writeup_path). +6. If mechanical verification artifacts exist, run quality/mechanical/verify.sh and save receipts. +7. Run terminal gate verification, write it to PROGRESS.md. + +### MANDATORY CARDINALITY GATE (Lever 3, v1.5.2) + +Before finalizing this phase, run the cardinality reconciliation gate against the current repo state. Locate `quality_gate.py` via the same fallback list used for SKILL.md (it sits in the same directory as SKILL.md in every install layout), then invoke it as a script — `quality_gate.py` runs `check_v1_5_2_cardinality_gate(repo_dir)` as part of its standard pass: + + python3 . + +Where `` is the first hit when walking the documented install-location fallback list, with `SKILL.md` swapped for `quality_gate.py` (e.g., `quality_gate.py`, `.claude/skills/quality-playbook/quality_gate.py`, `.github/skills/quality_gate.py`, `.cursor/skills/quality-playbook/quality_gate.py`, `.continue/skills/quality-playbook/quality_gate.py`, `.github/skills/quality-playbook/quality_gate.py`). + +If the gate output contains any line beginning with `cardinality gate:`, or reports uncovered cells, malformed cell IDs, missing consolidation rationale on multi-cell Covers, or malformed downgrade records, STOP. Fix the BUGS.md entries or the `compensation_grid_downgrades.json` file. Do NOT proceed to completion until those failure lines no longer appear. + +For every pattern-tagged REQ, the Phase 5 contract is: +- Every grid cell with `"present": false` appears in either a BUG's `Covers:` list or a downgrade record. +- Every `Covers:` entry uses the canonical cell ID form `REQ-N/cell--`. +- Every BUG with ≥2 `Covers:` entries has a non-empty `Consolidation rationale:` line. +- Every downgrade record has `cell_id`, `authority_ref`, `site_citation`, `reason_class` (in the enum), `falsifiable_claim` (non-empty). + +The cardinality gate is blocking. It is intentionally stricter than the Phase 3 advisory self-check; the advisory check is meant to surface problems early, but Phase 5 is where they become fatal. + +Mark Phase 5 complete in PROGRESS.md (use the checkbox format `- [x] Phase 5 - Reconciliation` — do NOT switch to a table). + +IMPORTANT: quality_gate.py will FAIL Phase 5 if any writeup is missing a non-empty ```diff block or contains any of these sentinel phrases verbatim: "is a confirmed code bug in ``", "The affected implementation lives at ``", "Patch path: ``", "- Regression test: ``", "- Regression patch: ``". Those two checks are the hard gate. Skipping the BUGS.md hydration step above is not gate-enforced but will produce writeups that read as unpopulated stubs and fail a human review — do not skip it. diff --git a/skills/quality-playbook/phase_prompts/phase6.md b/skills/quality-playbook/phase_prompts/phase6.md new file mode 100644 index 000000000..44eb708a3 --- /dev/null +++ b/skills/quality-playbook/phase_prompts/phase6.md @@ -0,0 +1,23 @@ +{skill_fallback_guide} + +You are a quality engineer doing the verification phase of a quality playbook run. Phases 1-5 are complete. + +Read SKILL.md - the Phase 6 section ("Phase 6: Verify"). Resolve SKILL.md via the documented fallback list above; do NOT assume any single install layout. Follow the incremental verification steps (6.1 through 6.5). + +Step 6.1: If quality/mechanical/verify.sh exists, run it. Record exit code. +Step 6.2: Run quality_gate.py. Locate it via the same fallback list used for SKILL.md (`quality_gate.py` sits in the same directory as SKILL.md in every install layout — e.g., `quality_gate.py`, `.claude/skills/quality-playbook/quality_gate.py`, `.github/skills/quality_gate.py`, `.cursor/skills/quality-playbook/quality_gate.py`, `.continue/skills/quality-playbook/quality_gate.py`, `.github/skills/quality-playbook/quality_gate.py`). Then run: + python3 . +Read the output carefully. For every FAIL result, fix the issue: +- Missing regression-test patches: generate quality/patches/BUG-NNN-regression-test.patch +- Missing inline diffs in writeups: add a ```diff block +- Non-canonical JSON fields: fix tdd-results.json (use 'id' not 'bug_id', etc.) +- Missing files: create them +After fixing all FAILs, run quality_gate.py again. Repeat until 0 FAIL. +Save final output to quality/results/quality-gate.log. + +Step 6.3: Run functional tests if a test runner is available. +Step 6.4: File-by-file verification checklist (read one file at a time, check, move on). +Step 6.5: Metadata consistency check. + +Append each step's result to quality/results/phase6-verification.log. +Mark Phase 6 complete in PROGRESS.md (use the checkbox format `- [x] Phase 6 - Verify` — do NOT switch to a table). diff --git a/skills/quality-playbook/phase_prompts/single_pass.md b/skills/quality-playbook/phase_prompts/single_pass.md new file mode 100644 index 000000000..31323ab82 --- /dev/null +++ b/skills/quality-playbook/phase_prompts/single_pass.md @@ -0,0 +1 @@ +{skill_fallback_guide} Execute the quality playbook for this project.{seed_instruction} diff --git a/skills/quality-playbook/quality_gate.py b/skills/quality-playbook/quality_gate.py new file mode 100755 index 000000000..f8864eef9 --- /dev/null +++ b/skills/quality-playbook/quality_gate.py @@ -0,0 +1,3385 @@ +#!/usr/bin/env python3 +"""quality_gate.py — Post-run validation gate for Quality Playbook artifacts. + +Mechanically checks artifact conformance issues that model self-attestation +persistently misses. Now the sole gate script; the earlier quality_gate.sh +(bash) has been retired. See quality_gate/test_quality_gate.py for the test +suite. + +Usage: + ./quality_gate.py . # Check current directory (benchmark mode) + ./quality_gate.py --general . # Check with relaxed thresholds + ./quality_gate.py virtio # Check named repo (from repos/) + ./quality_gate.py --all # Check all current-version repos + ./quality_gate.py --version 1.3.27 virtio # Check specific version + +Exit codes: + 0 — all checks passed + 1 — one or more checks failed + +Runs on Python 3.8+ with only the standard library. +""" + +import json +import os +import re +import sys +from datetime import date +from pathlib import Path + +SCRIPT_DIR = Path(__file__).resolve().parent + +# Allow soft import of bin/citation_verifier for v1.5.1 byte-equality checks. +# The verifier may live at one of several locations depending on where the +# gate was installed: +# 1. /bin/citation_verifier.py — gate runs from the source tree +# (gate path: /.github/skills/quality_gate/quality_gate.py; +# bin/ is three parents up from SCRIPT_DIR). +# 2. /bin/citation_verifier.py — gate installed alongside +# bin/ at the install root (v1.5.6 BUG-005 fix; bin/install_skill.py +# and repos/setup_repos.sh both bundle bin/citation_verifier.py here). +# 3. /bin/citation_verifier.py via the nested-skills path +# (.github/skills/quality_gate.py — SCRIPT_DIR is .github/skills, and +# bin/ is two parents up). +# When none of these resolve, byte-equality is skipped with a WARN rather +# than a hard FAIL — the gate continues with reduced enforcement. +_CITATION_VERIFIER = None +_VERIFIER_SEARCH_ROOTS = [ + SCRIPT_DIR.parent.parent.parent, # source-clone layout + SCRIPT_DIR, # gate + bin/ siblings (uncommon) + SCRIPT_DIR.parent.parent, # nested-skills layout (.github/skills/quality_gate.py) +] +for _candidate_root in _VERIFIER_SEARCH_ROOTS: + _verifier_file = _candidate_root / "bin" / "citation_verifier.py" + if _verifier_file.is_file(): + try: + if str(_candidate_root) not in sys.path: + sys.path.insert(0, str(_candidate_root)) + from bin import citation_verifier as _CITATION_VERIFIER # noqa: E402 + break + except Exception: # noqa: BLE001 — missing / misinstalled bin/ is tolerable + _CITATION_VERIFIER = None + continue + +# Global counters — reset per invocation via main(). Tests that call check_repo +# directly should reset these in setUp. +FAIL = 0 +WARN = 0 + + +# v1.5.2 — REQ Pattern field (Lever 2) +VALID_PATTERN_VALUES = frozenset({"whitelist", "parity", "compensation"}) + +_REQ_PATTERN_RE = re.compile( + r"^\s*-\s*Pattern:\s*(\S+)\s*$", re.IGNORECASE | re.MULTILINE +) + + +def extract_req_pattern(req_block): + """Return the REQ's pattern tag from a REQUIREMENTS.md block, or None. + + Raises ValueError when the block carries an invalid pattern value. Valid + values are VALID_PATTERN_VALUES. Absent field returns None. + """ + m = _REQ_PATTERN_RE.search(req_block) + if not m: + return None + value = m.group(1).strip() + if value not in VALID_PATTERN_VALUES: + raise ValueError( + "Invalid REQ pattern '{}'. Expected one of: {}".format( + value, sorted(VALID_PATTERN_VALUES) + ) + ) + return value + + +# v1.5.2 — cardinality gate (Lever 3) + +VALID_REASON_CLASSES = frozenset({ + "out-of-scope", + "deprecated", + "platform-gated", + "handled-upstream", + "intentionally-partial", +}) + +_CELL_ID_RE = re.compile(r"^REQ-\d+/cell-[A-Za-z0-9_]+-[A-Za-z0-9_]+$") + +_COVERS_RE = re.compile( + r"^\s*-\s*Covers:\s*\[(.*?)\]\s*$", re.IGNORECASE | re.MULTILINE +) + +_CONSOLIDATION_RE = re.compile( + r"^\s*-\s*Consolidation rationale:\s*(.+?)\s*$", + re.IGNORECASE | re.MULTILINE, +) + +_BUG_HEADING_RE = re.compile(r"^###\s+BUG-(\d+):", re.MULTILINE) + +# v1.5.2 (C13.8/Fix 1) — evidence locator for present:true grid cells. +# Relative path (no leading '/'), single colon, line number (>=1) or +# range ``N-M`` with both endpoints >=1. Rejects: absolute paths, +# multi-slash roots, URLs, line zero, zero-endpoint ranges. +_EVIDENCE_RE = re.compile(r"^(?!/)[^:]+:[1-9]\d*(-[1-9]\d*)?$") + + +def _parse_covers(bug_block): + m = _COVERS_RE.search(bug_block) + if not m: + return [] + raw = m.group(1).strip() + if not raw: + return [] + items = [s.strip() for s in raw.split(",")] + return [s for s in items if s] + + +def _parse_consolidation_rationale(bug_block): + m = _CONSOLIDATION_RE.search(bug_block) + if not m: + return None + text = m.group(1).strip() + return text or None + + +def _split_bug_blocks(bugs_md_text): + """Return list of (bug_id, body) pairs.""" + positions = [(m.start(), m.group(1)) for m in _BUG_HEADING_RE.finditer(bugs_md_text)] + result = [] + for idx, (start, bug_id) in enumerate(positions): + end = positions[idx + 1][0] if idx + 1 < len(positions) else len(bugs_md_text) + result.append(("BUG-{}".format(bug_id), bugs_md_text[start:end])) + return result + + +def _bug_primary_requirement(block): + m = re.search( + r"^\s*-\s*Primary requirement:\s*(REQ-\d+)", block, re.MULTILINE | re.IGNORECASE + ) + return m.group(1) if m else None + + +def _load_json_or_none(path): + if not path.is_file(): + return None + try: + return json.loads(path.read_text(encoding="utf-8")) + except (OSError, json.JSONDecodeError): + return None + + +def _read_text_safe(path): + try: + return path.read_text(encoding="utf-8", errors="replace") + except OSError: + return "" + + +_REQ_HEADING_RE = re.compile(r"^###\s+(REQ-\d+):", re.MULTILINE) + + +def _enumerate_pattern_tagged_reqs(req_text): + """Return {req_id: pattern} for every ### REQ-NNN: block in REQUIREMENTS.md + that carries a ``- Pattern: `` line. + + Raises ValueError if any block's pattern value is not in + VALID_PATTERN_VALUES (delegated to extract_req_pattern()). Blocks without a + Pattern field are omitted from the result (they're not pattern-tagged). + """ + if not req_text: + return {} + positions = [(m.start(), m.group(1)) for m in _REQ_HEADING_RE.finditer(req_text)] + result = {} + for idx, (start, req_id) in enumerate(positions): + end = positions[idx + 1][0] if idx + 1 < len(positions) else len(req_text) + block = req_text[start:end] + pattern = extract_req_pattern(block) + if pattern is not None: + result[req_id] = pattern + return result + + +# v1.5.2 (C13.7/Fix 2) — per-site UC detection. +# Phase 1's Cartesian UC rule emits UC-N.a / UC-N.b / ... for REQs where both +# eligibility gates match. Any REQ block in REQUIREMENTS.md that cites such +# per-site UCs MUST carry a Pattern field — otherwise Phase 2 silently dropped +# it. The regex is deliberately narrow: one lowercase letter suffix only, word +# boundaries on both sides, so bare UC-N and over-suffixed UC-N.a.bad are not +# mistaken for per-site references. +_PER_SITE_UC_RE = re.compile(r"\bUC-\d+\.[a-z]\b") + + +def _enumerate_per_site_uc_reqs(req_text): + """Return {req_id: sorted_list_of_uc_ids} for every ### REQ-NNN: block + that cites at least one per-site UC reference (UC-N.a / UC-N.b / ...). + + REQ blocks without per-site UC references are omitted from the result. + Each returned UC list is deduplicated and lexically sorted. + """ + if not req_text: + return {} + positions = [(m.start(), m.group(1)) for m in _REQ_HEADING_RE.finditer(req_text)] + result = {} + for idx, (start, req_id) in enumerate(positions): + end = positions[idx + 1][0] if idx + 1 < len(positions) else len(req_text) + block = req_text[start:end] + ucs = sorted(set(_PER_SITE_UC_RE.findall(block))) + if ucs: + result[req_id] = ucs + return result + + +def validate_cardinality_gate(repo_dir): + """Run the v1.5.2 cardinality reconciliation gate. + + Returns a list of failure strings. An empty list means the gate passed. + Caller decides how to surface failures (print / fail()). + + Inputs expected in repo_dir/quality/: + - REQUIREMENTS.md (source of pattern-tagged REQs) + - BUGS.md (source of Covers: annotations) + - compensation_grid.json (source of cell set per REQ) + - compensation_grid_downgrades.json (optional; source of downgrade cells) + """ + failures = [] + q = Path(repo_dir) / "quality" + + req_text = _read_text_safe(q / "REQUIREMENTS.md") + + # Enumerate pattern-tagged and per-site-UC REQs up front so the + # downstream cross-checks can run regardless of whether a grid file + # exists. A REQ that cites per-site UCs but lacks Pattern is a failure + # independent of grid presence (in fact, if Pattern is missing there is + # no grid precisely because Pattern is the trigger for producing one). + try: + pattern_tagged = _enumerate_pattern_tagged_reqs(req_text) + except ValueError as exc: + failures.append("REQUIREMENTS.md: {}".format(exc)) + pattern_tagged = {} + try: + per_site = _enumerate_per_site_uc_reqs(req_text) + except ValueError as exc: + failures.append("REQUIREMENTS.md: {}".format(exc)) + per_site = {} + + # Cross-check (C13.7/Fix 2): every REQ that cites per-site UCs (UC-N.a, + # UC-N.b, ...) in REQUIREMENTS.md MUST carry a Pattern field. Per-site UCs + # are the structural signal emitted by Phase 1's Cartesian UC rule; if the + # signal is there but Pattern is missing, Phase 2 silently dropped it and + # the v1.4.5 regression vector is live again. Runs regardless of grid + # presence because missing Pattern is exactly what would cause the grid + # to be absent in the first place. + for req_id, uc_ids in per_site.items(): + if req_id not in pattern_tagged: + failures.append( + "cardinality gate: {} has per-site UCs ({}) in REQUIREMENTS.md " + "but is missing the Pattern field — Phase 1 Cartesian UC rule " + "requires Pattern tagging for cross-site REQs (see " + "phase1_prompt confirmation checklist item 6)".format( + req_id, ", ".join(uc_ids) + ) + ) + + grid_path = q / "compensation_grid.json" + grid = _load_json_or_none(grid_path) + if grid is None: + # No grid file: only a problem if any pattern-tagged REQs exist. + if _REQ_PATTERN_RE.search(req_text): + failures.append( + "cardinality gate: pattern-tagged REQs exist but " + "quality/compensation_grid.json is missing" + ) + return failures + + reqs = grid.get("reqs") or {} + if not isinstance(reqs, dict): + failures.append("compensation_grid.json: 'reqs' is not an object") + return failures + + # Cross-check: every pattern-tagged REQ in REQUIREMENTS.md must appear in + # the grid. Omitting a pattern-tagged REQ from the grid was a v1.5.2 escape + # hatch (silently skipped by the per-REQ reconcile loop); close it here. + for req_id, req_pattern in pattern_tagged.items(): + if req_id not in reqs: + failures.append( + "cardinality gate: {} is pattern-tagged '{}' in REQUIREMENTS.md " + "but has no entry in compensation_grid.json".format(req_id, req_pattern) + ) + + # Load BUGS.md and index covers by REQ + bugs_text = _read_text_safe(q / "BUGS.md") + covers_by_req = {} + for bug_id, block in _split_bug_blocks(bugs_text): + covers = _parse_covers(block) + if len(covers) >= 2: + if not _parse_consolidation_rationale(block): + failures.append( + "{}: Covers has {} entries but 'Consolidation rationale:' is missing or empty".format( + bug_id, len(covers) + ) + ) + for cell_id in covers: + if not _CELL_ID_RE.match(cell_id): + failures.append( + "{}: malformed cell ID '{}' (expected REQ-N/cell--)".format( + bug_id, cell_id + ) + ) + continue + req_id = cell_id.split("/", 1)[0] + covers_by_req.setdefault(req_id, set()).add(cell_id) + + # Load downgrades and validate each record + downgrades = _load_json_or_none(q / "compensation_grid_downgrades.json") or {"downgrades": []} + downgrade_cells_by_req = {} + for rec in downgrades.get("downgrades", []): + rid = rec.get("cell_id", "") + if not _CELL_ID_RE.match(rid): + failures.append("downgrade record: malformed cell_id '{}'".format(rid)) + continue + # A downgrade record only counts toward reconciliation once every + # validation below passes. A malformed record emits diagnostic + # failure strings AND stays out of downgrade_cells_by_req, so the + # per-REQ uncovered-cells calculation still flags the cell. + rec_ok = True + for field in ("authority_ref", "site_citation", "reason_class", "falsifiable_claim"): + value = rec.get(field) + if not value or not isinstance(value, str) or not value.strip(): + failures.append( + "downgrade record {}: missing or empty field '{}'".format(rid, field) + ) + rec_ok = False + reason = rec.get("reason_class", "") + if reason and reason not in VALID_REASON_CLASSES: + failures.append( + "downgrade record {}: reason_class '{}' not in {}".format( + rid, reason, sorted(VALID_REASON_CLASSES) + ) + ) + rec_ok = False + if not rec_ok: + continue + req_id = rid.split("/", 1)[0] + downgrade_cells_by_req.setdefault(req_id, set()).add(rid) + + # Reconcile per-REQ + for req_id, entry in reqs.items(): + pattern = entry.get("pattern") + if pattern not in {"whitelist", "parity", "compensation"}: + failures.append( + "compensation_grid.json: {} has invalid or missing pattern '{}'".format( + req_id, pattern + ) + ) + continue + cells = entry.get("cells") or [] + # v1.5.2 (C13.8/Fix 2): pre-validate each cell's 'present' field is a + # strict bool. Non-bool values (string "true", int 1, None, missing key) + # would otherwise fall between the 'is False' absent-cell branch and + # the 'is not True' present-cell evidence branch, escaping both checks. + # Same silent-bypass family as B1 — diagnose AND skip the cell, do not + # let it count toward coverage accounting. + valid_cells = [] + for c in cells: + if not isinstance(c, dict): + continue + present = c.get("present") + if not isinstance(present, bool): + cell_id = c.get("cell_id") or "" + failures.append( + "{}: cell {} 'present' must be boolean true or false; got {!r}".format( + req_id, cell_id, present + ) + ) + continue + valid_cells.append(c) + + grid_cell_ids = {c.get("cell_id") for c in valid_cells} + grid_cell_ids.discard(None) + # Only absent cells require coverage. Identity check is safe now — + # every element of valid_cells has 'present' as a strict bool. + absent_cells = { + c.get("cell_id") for c in valid_cells + if c.get("present") is False + } + absent_cells.discard(None) + + # v1.5.2 (C13.6/B2): present:true cells must carry a non-empty + # 'evidence' field in file:line form. Without this, a reviewer or LLM + # can claim any cell is present, supply nothing, and the gate accepts + # it — the bypass Round 5 Council called the highest remaining risk. + for c in valid_cells: + if c.get("present") is not True: + continue + cell_id = c.get("cell_id") or "" + evidence = c.get("evidence") + if not evidence or not isinstance(evidence, str) or not evidence.strip(): + failures.append( + "{}: present:true requires non-empty 'evidence' field with file:line citation".format(cell_id) + ) + continue + if not _EVIDENCE_RE.match(evidence.strip()): + failures.append( + "{}: 'evidence' must be file:line (e.g. 'path/to.c:123' or 'path/to.c:120-140'); got {!r}".format( + cell_id, evidence + ) + ) + + covered = covers_by_req.get(req_id, set()) + downgraded = downgrade_cells_by_req.get(req_id, set()) + uncovered = absent_cells - covered - downgraded + + if uncovered: + failures.append( + "{}: uncovered cells — {}".format(req_id, ", ".join(sorted(uncovered))) + ) + + # Every covered cell must be in the grid + stray = (covered | downgraded) - grid_cell_ids + if stray: + failures.append( + "{}: Covers/downgrade cells not in grid — {}".format( + req_id, ", ".join(sorted(stray)) + ) + ) + + return failures + + +def _reset_counters(): + global FAIL, WARN + FAIL = 0 + WARN = 0 + + +def fail(msg, reason=None, *, line=None): + """Emit a structured failure line and increment FAIL. + + Phase 5 r3 format: `[:]: ` — no "FAIL:" label, so + output is grep-parseable as `^[^:]+:[0-9]*:? .+$`. The prefix `FAIL:` is + deliberately removed; the global FAIL counter (summarised in main()) is + the authoritative count of failures per run. + + Preferred forms: + fail("quality/INDEX.md", "file missing") + -> " quality/INDEX.md: file missing" + fail("quality/INDEX.md", "missing required field 'x'", line=42) + -> " quality/INDEX.md:42: missing required field 'x'" + + Legacy single-arg form (transitional; still supported — most v1.4.x + messages already embed a path-like token): + fail("BUGS.md missing or not a file") + -> " BUGS.md missing or not a file" + """ + global FAIL + if reason is None: + print(f" {msg}") + elif line is None: + print(f" {msg}: {reason}") + else: + print(f" {msg}:{line}: {reason}") + FAIL += 1 + + +def pass_(msg): + print(f" PASS: {msg}") + + +def warn(msg): + global WARN + print(f" WARN: {msg}") + WARN += 1 + + +def info(msg): + print(f" INFO: {msg}") + + +# --- JSON helpers (proper parsing, not grep-style) --- + + +def load_json(path): + """Parse JSON file. Return parsed value, or None on any error.""" + if not path.is_file(): + return None + try: + with open(path, "r", encoding="utf-8") as f: + return json.load(f) + except (OSError, json.JSONDecodeError): + return None + + +def has_key(data, key): + """True if `data` is a dict containing `key`.""" + return isinstance(data, dict) and key in data + + +def get_str(data, key): + """Return data[key] if it's a string, else empty string.""" + if not isinstance(data, dict): + return "" + val = data.get(key) + return val if isinstance(val, str) else "" + + +def count_per_bug_field(bugs_list, field): + """Count bugs in list that have `field` set.""" + if not isinstance(bugs_list, list): + return 0 + return sum(1 for b in bugs_list if isinstance(b, dict) and field in b) + + +# --- File helpers --- + + +# v1.5.4 Phase 3.6.4 (B-16): the end-of-run reorg moves intermediate +# pipeline artifacts under quality/workspace/. The gate reads each of +# those subdirectories at multiple sites; _resolve_artifact_path +# centralises the dual-layout lookup so each site stays one-line. +# Top-level wins (legacy / pre-reorg layout); workspace/ is the v1.5.4 +# canonical location after _finalize_quality_layout has run. +_WORKSPACE_DIRS = ( + "control_prompts", + "results", + "code_reviews", + "spec_audits", + "patches", + "writeups", + "mechanical", + "phase3", +) + + +def _resolve_artifact_path(quality_dir, name): + """Return the live path for an intermediate artifact directory or + file under quality/. Tries top-level first (the legacy / current + in-flight layout), then quality/workspace/ (the v1.5.4 + end-of-run reorg layout). Returns the top-level path even when + neither exists so callers that test ``.is_dir()`` / ``.is_file()`` + get a False rather than an exception. + + ``name`` may be a single segment (``"results"``) or a path with + segments (``"results/tdd-results.json"``); both forms work + regardless of layout.""" + top = quality_dir / name + if top.exists(): + return top + workspace = quality_dir / "workspace" / name + if workspace.exists(): + return workspace + return top + + +def has_file_matching(directory, patterns): + """True if any file in `directory` (non-recursive) matches any glob pattern.""" + if not directory.is_dir(): + return False + for pat in patterns: + for _ in directory.glob(pat): + return True + return False + + +def count_files_matching(directory, pattern): + """Count files in `directory` (non-recursive) matching glob pattern.""" + if not directory.is_dir(): + return 0 + return sum(1 for _ in directory.glob(pattern)) + + +def first_file_matching(directory, patterns): + """Return first matching path or None.""" + if not directory.is_dir(): + return None + for pat in patterns: + for p in directory.glob(pat): + return p + return None + + +def file_contains(path, pattern): + """True if any line in file matches pattern (regex string or compiled).""" + if not path.is_file(): + return False + if isinstance(pattern, str): + pattern = re.compile(pattern) + try: + with open(path, "r", encoding="utf-8", errors="replace") as f: + for line in f: + if pattern.search(line): + return True + except OSError: + pass + return False + + +def read_first_line_stripped(path): + """Return first line of file with whitespace stripped.""" + if not path.is_file(): + return "" + try: + with open(path, "r", encoding="utf-8", errors="replace") as f: + line = f.readline() + except OSError: + return "" + return re.sub(r"\s", "", line) + + +def validate_iso_date(date_str): + """Return one of: 'valid', 'placeholder', 'future', 'bad_format', 'empty'. + + Placeholders are checked before format so that 'YYYY-MM-DD' is reported + as 'placeholder' rather than 'bad_format'. The bash version's order was + flipped, causing 'YYYY-MM-DD' to be misreported — both still FAIL but the + Python version gives the clearer message. + """ + if not date_str: + return "empty" + if date_str in ("YYYY-MM-DD", "0000-00-00"): + return "placeholder" + date_part = date_str[:10] + if not re.fullmatch(r"\d{4}-\d{2}-\d{2}", date_part): + return "bad_format" + if len(date_str) > 10 and not re.fullmatch(r"T\d{2}:\d{2}:\d{2}(Z|[+-]\d{2}:\d{2})?", date_str[10:]): + return "bad_format" + today = date.today().isoformat() + if date_part > today: + return "future" + return "valid" + + +def detect_skill_version(locations): + """Read `version:` value from the first existing SKILL.md-like file.""" + for loc in locations: + if loc.is_file(): + try: + with open(loc, "r", encoding="utf-8", errors="replace") as f: + for line in f: + m = re.match(r"^\s*(?:version:|\*\*Version:\*\*)\s*([0-9]+(?:\.[0-9]+)+)\b", + line, re.IGNORECASE) + if m: + return m.group(1) + except OSError: + continue + return "" + + +def read_skill_value_line(path, prefix): + """Mimic: grep -m1 'prefix' FILE | sed 's/.*prefix *//' | tr -d ' '.""" + if not path.is_file(): + return "" + try: + with open(path, "r", encoding="utf-8", errors="replace") as f: + for line in f: + if prefix in line: + v = re.sub(rf".*{re.escape(prefix)}\s*", "", line, count=1) + return v.replace(" ", "").rstrip("\n").rstrip("\r") + except OSError: + pass + return "" + + +def detect_project_language(repo_dir): + """Walk up to 3 dirs deep, return first language whose extension is present. + + Mirrors bash `find -maxdepth 3 -not -path ...` behavior. + """ + language_order = [ + ("go", ".go"), + ("py", ".py"), + ("java", ".java"), + ("kt", ".kt"), + ("rs", ".rs"), + ("ts", ".ts"), + ("js", ".js"), + ("scala", ".scala"), + ("c", ".c"), + ("agc", ".agc"), + ] + excluded = {"vendor", "node_modules", ".git", "quality", "repos"} + + def present(base, target_ext): + stack = [(Path(base), 1)] + while stack: + curr, depth = stack.pop() + try: + for entry in os.scandir(curr): + name = entry.name + if entry.is_dir(follow_symlinks=False): + if name in excluded: + continue + if depth < 3: + stack.append((Path(entry.path), depth + 1)) + elif entry.is_file(follow_symlinks=False): + if name.endswith(target_ext): + return True + except (OSError, PermissionError): + continue + return False + + for lang, ext in language_order: + if present(repo_dir, ext): + return lang + return "" + + +def count_source_files(repo_dir): + """Count source files up to 4 dirs deep, excluding vendor/node_modules/etc.""" + src_count = 0 + exts = {".go", ".py", ".java", ".kt", ".rs", ".ts", ".js", ".scala", + ".c", ".h", ".agc"} + excluded = {"vendor", "node_modules", ".git", "quality"} + + def walk(base, current_depth, max_depth): + nonlocal src_count + try: + for entry in os.scandir(base): + name = entry.name + if entry.is_dir(follow_symlinks=False): + if current_depth < max_depth and name not in excluded: + walk(entry.path, current_depth + 1, max_depth) + elif entry.is_file(follow_symlinks=False): + dot = name.rfind(".") + if dot >= 0 and name[dot:] in exts: + src_count += 1 + except (OSError, PermissionError): + pass + + walk(str(repo_dir), 1, 4) + return src_count + + +# --- Section checks --- + + +def check_file_existence(repo_dir, q, strictness): + """File existence section (benchmark 40).""" + print("[File Existence]") + for f in ["BUGS.md", "REQUIREMENTS.md", "QUALITY.md", "PROGRESS.md", + "COVERAGE_MATRIX.md", "COMPLETENESS_REPORT.md"]: + if (q / f).is_file(): + pass_(f"{f} exists") + else: + fail(f"{f} missing") + + for f in ["CONTRACTS.md", "RUN_CODE_REVIEW.md", "RUN_SPEC_AUDIT.md", + "RUN_INTEGRATION_TESTS.md", "RUN_TDD_TESTS.md"]: + if (q / f).is_file(): + pass_(f"{f} exists") + else: + fail(f"{f} missing") + + if has_file_matching(q, ["test_functional.*", "functional_test.*", + "FunctionalSpec.*", "FunctionalTest.*", + "functional.test.*"]): + pass_("functional test file exists") + else: + fail("functional test file missing (test_functional.*, functional_test.*, FunctionalSpec.*, FunctionalTest.*, functional.test.*)") + + if (repo_dir / "AGENTS.md").is_file(): + pass_("AGENTS.md exists") + else: + fail("AGENTS.md missing (required at project root)") + + if (q / "EXPLORATION.md").is_file(): + pass_("EXPLORATION.md exists") + _check_exploration_sections(q / "EXPLORATION.md") + else: + fail("EXPLORATION.md missing") + + cr_dir = _resolve_artifact_path(q, "code_reviews") + if cr_dir.is_dir() and has_file_matching(cr_dir, ["*.md"]): + pass_("code_reviews/ has .md files") + else: + fail("code_reviews/ missing or empty") + + sa_dir = _resolve_artifact_path(q, "spec_audits") + if sa_dir.is_dir(): + triage_count = count_files_matching(sa_dir, "*triage*") + auditor_count = count_files_matching(sa_dir, "*auditor*") + if triage_count > 0: + pass_("spec_audits/ has triage file") + else: + fail("spec_audits/ missing triage file") + if auditor_count > 0: + pass_(f"spec_audits/ has {auditor_count} auditor file(s)") + else: + fail("spec_audits/ missing individual auditor files") + + if triage_count > 0: + has_probes = False + if (sa_dir / "triage_probes.sh").is_file(): + has_probes = True + pass_("triage_probes.sh exists (executable triage evidence)") + elif (_resolve_artifact_path(q, "mechanical/verify.sh")).is_file() and \ + file_contains(_resolve_artifact_path(q, "mechanical/verify.sh"), r"probe|triage|auditor"): + has_probes = True + pass_("verify.sh contains triage probe assertions") + if not has_probes: + msg = "No executable triage evidence found (expected spec_audits/triage_probes.sh or probe assertions in mechanical/verify.sh)" + if strictness == "benchmark": + fail(msg) + else: + warn(msg) + else: + fail("spec_audits/ directory missing") + + +def check_bugs_heading(q): + """BUGS.md heading-format section (benchmark 39). + + Returns (bug_count, bug_ids). + """ + print("[BUGS.md Heading Format]") + bugs_md = q / "BUGS.md" + if not bugs_md.is_file(): + fail("BUGS.md missing") + return 0, [] + + try: + bugs_content = bugs_md.read_text(encoding="utf-8", errors="replace") + except OSError: + bugs_content = "" + lines = bugs_content.splitlines() + + correct_headings = sum(1 for ln in lines + if re.match(r"^### BUG-([HML]|[0-9])[0-9]*", ln)) + wrong_headings = sum(1 for ln in lines + if re.match(r"^## BUG-", ln) + and not re.match(r"^### BUG-", ln)) + deep_headings = sum(1 for ln in lines + if re.match(r"^#{4,} BUG-([HML]|[0-9])", ln)) + bold_headings = sum(1 for ln in lines + if re.match(r"^\*\*BUG-([HML]|[0-9])", ln)) + bullet_headings = sum(1 for ln in lines + if re.match(r"^- BUG-([HML]|[0-9])", ln)) + + bug_count = correct_headings + + if (correct_headings > 0 and wrong_headings == 0 and deep_headings == 0 + and bold_headings == 0 and bullet_headings == 0): + pass_(f"All {correct_headings} bug headings use ### BUG-NNN format") + else: + if wrong_headings > 0: + fail(f"{wrong_headings} heading(s) use ## instead of ###") + if deep_headings > 0: + fail(f"{deep_headings} heading(s) use #### or deeper instead of ###") + if bold_headings > 0: + fail(f"{bold_headings} heading(s) use **BUG- format") + if bullet_headings > 0: + fail(f"{bullet_headings} heading(s) use - BUG- format") + if correct_headings == 0 and wrong_headings == 0: + if re.search(r"^##\s+(No confirmed bugs|Zero confirmed bugs)\s*$", + bugs_content, re.MULTILINE | re.IGNORECASE): + pass_("Zero-bug run — no headings expected") + else: + bug_count = wrong_headings + deep_headings + bold_headings + bullet_headings + warn("No ### BUG-NNN headings found in BUGS.md") + else: + bug_count = correct_headings + wrong_headings + bold_headings + bullet_headings + + # Extract canonical bug IDs: BUG-NNN or BUG-HNN / BUG-MNN / BUG-LNN + raw = re.findall(r"BUG-(?:[HML][0-9]+|[0-9]+)", bugs_content) + filtered = [b for b in raw if re.fullmatch(r"BUG-(?:[HML][0-9]+|[0-9]+)", b)] + bug_ids = sorted(set(filtered)) + + return bug_count, bug_ids + + +def check_tdd_sidecar(q, bug_count): + """TDD sidecar JSON (benchmarks 14, 41).""" + print("[TDD Sidecar JSON]") + json_path = _resolve_artifact_path(q, "results/tdd-results.json") + + if bug_count <= 0: + info("Zero bugs — tdd-results.json not required") + return None + + if not json_path.is_file(): + fail(f"tdd-results.json missing ({bug_count} bugs require it)") + return None + + pass_(f"tdd-results.json exists ({bug_count} bugs)") + + data = load_json(json_path) + if data is None: + # File exists but unparsable — fail all root key checks + for key in ["schema_version", "skill_version", "date", "project", + "bugs", "summary"]: + fail(f"missing root key '{key}'") + fail("schema_version is 'missing', expected '1.1'") + return None + + for key in ["schema_version", "skill_version", "date", "project", + "bugs", "summary"]: + if has_key(data, key): + pass_(f"has '{key}'") + else: + fail(f"missing root key '{key}'") + + sv = get_str(data, "schema_version") + if sv == "1.1": + pass_("schema_version is '1.1'") + else: + fail(f"schema_version is '{sv or 'missing'}', expected '1.1'") + + bugs_list = data.get("bugs") if isinstance(data, dict) else None + if not isinstance(bugs_list, list): + bugs_list = [] + + for field in ["id", "requirement", "red_phase", "green_phase", + "verdict", "fix_patch_present", "writeup_path"]: + fcount = count_per_bug_field(bugs_list, field) + if fcount >= bug_count: + pass_(f"per-bug field '{field}' present ({fcount}x)") + elif fcount > 0: + warn(f"per-bug field '{field}' found {fcount}x, expected {bug_count}") + else: + fail(f"per-bug field '{field}' missing entirely") + + # Non-canonical field names (at any level — check root and bugs) + bad_fields = ["bug_id", "bug_name", "status", "phase", "result"] + for bad in bad_fields: + found = has_key(data, bad) or any( + has_key(b, bad) for b in bugs_list if isinstance(b, dict) + ) + if found: + fail(f"non-canonical field '{bad}' found (use standard field names)") + + summary = data.get("summary") if isinstance(data, dict) else None + if not isinstance(summary, dict): + summary = {} + for skey in ["total", "verified", "confirmed_open", "red_failed", "green_failed"]: + if skey in summary: + pass_(f"summary has '{skey}'") + else: + fail(f"summary missing '{skey}' count") + + # Date validation + tdd_date = get_str(data, "date") + status = validate_iso_date(tdd_date) + if status == "empty": + fail("tdd-results.json date field missing or empty") + elif status == "bad_format": + fail(f"tdd-results.json date '{tdd_date}' is not ISO 8601 (YYYY-MM-DD)") + elif status == "placeholder": + fail(f"tdd-results.json date is placeholder '{tdd_date}'") + elif status == "future": + fail(f"tdd-results.json date '{tdd_date}' is in the future") + else: + pass_(f"tdd-results.json date '{tdd_date}' is valid") + + # Verdict enum + allowed_verdicts = {"TDD verified", "red failed", "green failed", + "confirmed open", "deferred"} + bad_verdicts = 0 + for b in bugs_list: + if isinstance(b, dict) and "verdict" in b: + v = b.get("verdict") + if v not in allowed_verdicts: + bad_verdicts += 1 + if bad_verdicts == 0: + pass_("all verdict values are canonical") + else: + fail(f"{bad_verdicts} non-canonical verdict value(s)") + + return data + + +def check_tdd_logs(q, bug_count, bug_ids, tdd_data): + """TDD log files and sidecar-to-log cross-validation.""" + print("[TDD Log Files]") + if bug_count <= 0: + info("Zero bugs — TDD log files not required") + return + + patches_dir = _resolve_artifact_path(q, "patches") + results_dir = _resolve_artifact_path(q, "results") + valid_tags = {"RED", "GREEN", "NOT_RUN", "ERROR"} + + red_found = 0 + red_missing = 0 + green_found = 0 + green_missing = 0 + green_expected = 0 + red_bad_tag = 0 + green_bad_tag = 0 + + for bid in bug_ids: + red_log = results_dir / f"{bid}.red.log" + if red_log.is_file(): + red_found += 1 + tag = read_first_line_stripped(red_log) + if tag not in valid_tags: + red_bad_tag += 1 + else: + red_missing += 1 + + fix_patch = first_file_matching(patches_dir, [f"{bid}-fix*.patch"]) + if fix_patch is not None: + green_expected += 1 + green_log = results_dir / f"{bid}.green.log" + if green_log.is_file(): + green_found += 1 + tag = read_first_line_stripped(green_log) + if tag not in valid_tags: + green_bad_tag += 1 + else: + green_missing += 1 + + if red_missing == 0 and red_found > 0: + pass_(f"All {red_found} confirmed bug(s) have red-phase logs") + elif red_found > 0: + fail(f"{red_missing} confirmed bug(s) missing red-phase log (BUG-NNN.red.log)") + else: + fail("No red-phase logs found (every confirmed bug needs quality/results/BUG-NNN.red.log)") + + if green_expected > 0: + if green_missing == 0: + pass_(f"All {green_found} bug(s) with fix patches have green-phase logs") + else: + fail(f"{green_missing} bug(s) with fix patches missing green-phase log (BUG-NNN.green.log)") + else: + info("No fix patches found — green-phase logs not required") + + if red_bad_tag > 0: + fail(f"{red_bad_tag} red-phase log(s) missing valid first-line status tag (expected RED/GREEN/NOT_RUN/ERROR)") + elif red_found > 0: + pass_("All red-phase logs have valid status tags") + if green_bad_tag > 0: + fail(f"{green_bad_tag} green-phase log(s) missing valid first-line status tag (expected RED/GREEN/NOT_RUN/ERROR)") + elif green_found > 0: + pass_("All green-phase logs have valid status tags") + + # Sidecar-to-log cross-validation (BUG-M18) + if tdd_data is not None and isinstance(tdd_data, dict): + bugs_list = tdd_data.get("bugs") or [] + if not isinstance(bugs_list, list): + bugs_list = [] + # Index bugs by id for lookup + bug_by_id = {} + for b in bugs_list: + if isinstance(b, dict) and isinstance(b.get("id"), str): + bug_by_id[b["id"]] = b + + xv_checked = 0 + xv_mismatch = 0 + + for bid in bug_ids: + bug_obj = bug_by_id.get(bid) + sidecar_red = get_str(bug_obj, "red_phase") if bug_obj else "" + sidecar_green = get_str(bug_obj, "green_phase") if bug_obj else "" + + red_log = results_dir / f"{bid}.red.log" + if sidecar_red and red_log.is_file(): + log_tag = read_first_line_stripped(red_log) + xv_checked += 1 + if sidecar_red == "fail" and log_tag != "RED": + xv_mismatch += 1 + fail(f"{bid}: sidecar red_phase='{sidecar_red}' but log first-line is '{log_tag}' (expected RED)") + elif sidecar_red == "pass" and log_tag != "GREEN": + xv_mismatch += 1 + fail(f"{bid}: sidecar red_phase='{sidecar_red}' but log first-line is '{log_tag}' (expected GREEN)") + + green_log = results_dir / f"{bid}.green.log" + if sidecar_green and green_log.is_file(): + log_tag = read_first_line_stripped(green_log) + xv_checked += 1 + if sidecar_green == "pass" and log_tag != "GREEN": + xv_mismatch += 1 + fail(f"{bid}: sidecar green_phase='{sidecar_green}' but log first-line is '{log_tag}' (expected GREEN)") + elif sidecar_green == "fail" and log_tag != "RED": + xv_mismatch += 1 + fail(f"{bid}: sidecar green_phase='{sidecar_green}' but log first-line is '{log_tag}' (expected RED)") + + if xv_checked > 0 and xv_mismatch == 0: + pass_(f"Sidecar-to-log cross-validation passed ({xv_checked} checks)") + elif xv_checked == 0: + info("Sidecar-to-log cross-validation: no matching pairs to check") + + # TDD_TRACEABILITY.md + if red_found > 0: + if (q / "TDD_TRACEABILITY.md").is_file(): + pass_(f"TDD_TRACEABILITY.md exists ({red_found} bugs with red-phase results)") + else: + fail("TDD_TRACEABILITY.md missing (mandatory when bugs have red-phase results)") + + +def check_integration_sidecar(q, strictness): + """Integration sidecar JSON section.""" + print("[Integration Sidecar JSON]") + ij = _resolve_artifact_path(q, "results/integration-results.json") + + if not ij.is_file(): + if strictness == "benchmark": + warn("integration-results.json not present") + else: + info("integration-results.json not present (optional in general mode)") + return + + data = load_json(ij) + + for key in ["schema_version", "skill_version", "date", "project", + "recommendation", "groups", "summary", "uc_coverage"]: + if has_key(data, key): + pass_(f"has '{key}'") + else: + fail(f"missing key '{key}'") + + summary = data.get("summary") if isinstance(data, dict) else None + if not isinstance(summary, dict): + summary = {} + for iskey in ["total_groups", "passed", "failed", "skipped"]: + if iskey in summary: + pass_(f"integration summary has '{iskey}'") + else: + fail(f"integration summary missing required sub-key '{iskey}'") + + isv = get_str(data, "schema_version") + if isv == "1.1": + pass_("integration schema_version is '1.1'") + else: + fail(f"integration schema_version is '{isv or 'missing'}', expected '1.1'") + + int_date = get_str(data, "date") + if int_date: # match bash: if [ -n "$int_date" ] + status = validate_iso_date(int_date) + if status == "bad_format": + fail(f"integration-results.json date '{int_date}' is not ISO 8601 (YYYY-MM-DD)") + elif status == "placeholder": + fail(f"integration-results.json date is placeholder '{int_date}'") + elif status == "future": + fail(f"integration-results.json date '{int_date}' is in the future") + else: + pass_(f"integration-results.json date '{int_date}' is valid") + + rec = get_str(data, "recommendation") + if rec in ("SHIP", "FIX BEFORE MERGE", "BLOCK"): + pass_(f"recommendation '{rec}' is canonical") + elif rec: + fail(f"recommendation '{rec}' is non-canonical (must be SHIP/FIX BEFORE MERGE/BLOCK)") + else: + fail("recommendation missing") + + # groups[].result enum + allowed_results = {"pass", "fail", "skipped", "error"} + bad_results = 0 + groups = data.get("groups") if isinstance(data, dict) else None + if isinstance(groups, list): + for g in groups: + if isinstance(g, dict) and "result" in g: + if g.get("result") not in allowed_results: + bad_results += 1 + if bad_results == 0: + pass_("all groups[].result values are canonical") + else: + fail(f"{bad_results} non-canonical groups[].result value(s) (must be pass/fail/skipped/error)") + + # uc_coverage value enum + allowed_uc = {"covered_pass", "covered_fail", "not_mapped"} + bad_uc = 0 + uc_cov = data.get("uc_coverage") if isinstance(data, dict) else None + if isinstance(uc_cov, dict): + for v in uc_cov.values(): + if v not in allowed_uc: + bad_uc += 1 + if bad_uc == 0: + pass_("all uc_coverage values are canonical") + else: + fail(f"{bad_uc} non-canonical uc_coverage value(s) (must be covered_pass/covered_fail/not_mapped)") + + +def check_recheck_sidecar(q): + """Recheck sidecar JSON (schema 1.0, uses 'results' key not 'bugs').""" + print("[Recheck Sidecar JSON]") + rj = _resolve_artifact_path(q, "results/recheck-results.json") + rs = _resolve_artifact_path(q, "results/recheck-summary.md") + + if not rj.is_file(): + info("recheck-results.json not present (only required when recheck mode was run)") + return + + pass_("recheck-results.json exists") + data = load_json(rj) + + # SKILL.md recheck template uses 'results' as the array key, not 'bugs'. + for key in ["schema_version", "skill_version", "date", "project", + "results", "summary"]: + if has_key(data, key): + pass_(f"recheck has '{key}'") + else: + fail(f"recheck missing root key '{key}'") + + rsv = get_str(data, "schema_version") + if rsv == "1.0": + pass_("recheck schema_version is '1.0'") + else: + fail(f"recheck schema_version is '{rsv or 'missing'}', expected '1.0'") + + rdate = get_str(data, "date") + if rdate: + status = validate_iso_date(rdate) + if status == "bad_format": + fail(f"recheck-results.json date '{rdate}' is not ISO 8601 (YYYY-MM-DD)") + elif status == "placeholder": + fail(f"recheck-results.json date is placeholder '{rdate}'") + elif status == "future": + fail(f"recheck-results.json date '{rdate}' is in the future") + else: + pass_(f"recheck-results.json date '{rdate}' is valid") + + if rs.is_file(): + pass_("recheck-summary.md exists") + else: + fail("recheck-summary.md missing (required companion to recheck-results.json)") + + +def check_use_cases(repo_dir, q, strictness): + """Use case identifier section (benchmarks 43, 48).""" + print("[Use Cases]") + req_md = q / "REQUIREMENTS.md" + if not req_md.is_file(): + fail("REQUIREMENTS.md missing") + return + + try: + req_content = req_md.read_text(encoding="utf-8", errors="replace") + except OSError: + req_content = "" + + # uc_ids: count of lines matching UC-N (bash grep -cE counts lines) + uc_ids = sum(1 for ln in req_content.splitlines() + if re.search(r"UC-[0-9]+", ln)) + uc_unique = len(set(re.findall(r"UC-[0-9]+", req_content))) + + src_count = count_source_files(repo_dir) if repo_dir.is_dir() else 0 + min_uc = 3 if src_count < 5 else 5 + + if uc_unique >= min_uc: + pass_(f"Found {uc_unique} distinct UC identifiers ({uc_ids} total references, {src_count} source files)") + elif uc_unique > 0: + connector = "for" if strictness == "general" else "required for" + msg = f"Only {uc_unique} distinct UC identifiers (minimum {min_uc} {connector} {src_count} source files)" + if strictness == "general": + warn(msg) + else: + fail(msg) + else: + fail("No canonical UC-NN identifiers in REQUIREMENTS.md") + + +def check_test_file_extension(repo_dir, q): + """Test file extension matches project language (benchmark 47).""" + print("[Test File Extension]") + func_test = first_file_matching(q, ["test_functional.*", "functional_test.*", + "FunctionalSpec.*", "FunctionalTest.*", + "functional.test.*"]) + reg_test = first_file_matching(q, ["test_regression.*"]) + + if func_test is None: + warn("No functional test file found across the supported naming matrix") + return + + ext = func_test.suffix.lstrip(".") if func_test.suffix else "" + detected_lang = detect_project_language(repo_dir) if repo_dir.is_dir() else "" + + if not detected_lang: + info(f"Cannot detect project language — skipping extension check (test_functional.{ext})") + return + + lang_to_valid = { + "go": "go", + "py": "py", + "java": "java", + "kt": "kt java", + "rs": "rs", + "ts": "ts", + "js": "js ts", + "scala": "scala", + "c": "c py sh", + "agc": "py sh", + } + valid_ext = lang_to_valid.get(detected_lang, "") + valid_list = valid_ext.split() + primary = valid_list[0] if valid_list else "" + + if ext in valid_list: + pass_(f"{func_test.name} matches project language ({detected_lang})") + else: + fail(f"{func_test.name} does not match project language ({detected_lang}) — expected .{primary}") + + if reg_test is not None: + reg_ext = reg_test.suffix.lstrip(".") if reg_test.suffix else "" + if reg_ext in valid_list: + pass_(f"test_regression.{reg_ext} matches project language ({detected_lang})") + else: + fail(f"test_regression.{reg_ext} does not match project language ({detected_lang}) — expected .{primary}") + + +def check_terminal_gate(q): + """Terminal Gate section in PROGRESS.md.""" + print("[Terminal Gate]") + progress_md = q / "PROGRESS.md" + if not progress_md.is_file(): + return + pat = re.compile(r"^#+ *Terminal", re.IGNORECASE | re.MULTILINE) + if file_contains(progress_md, pat): + pass_("PROGRESS.md has Terminal Gate section") + else: + fail("PROGRESS.md missing Terminal Gate section") + + +def check_mechanical(q): + """Mechanical verification section.""" + print("[Mechanical Verification]") + mech_dir = _resolve_artifact_path(q, "mechanical") + if not mech_dir.is_dir(): + info("No mechanical/ directory") + return + verify_sh = mech_dir / "verify.sh" + if not verify_sh.is_file(): + fail("mechanical/ exists but verify.sh missing") + return + pass_("verify.sh exists") + + mv_log = _resolve_artifact_path(q, "results/mechanical-verify.log") + mv_exit = _resolve_artifact_path(q, "results/mechanical-verify.exit") + if mv_log.is_file() and mv_exit.is_file(): + try: + exit_code = mv_exit.read_text(encoding="utf-8", errors="replace") + except OSError: + exit_code = "" + exit_code = re.sub(r"\s", "", exit_code) + if exit_code == "0": + pass_("mechanical-verify.exit is 0") + else: + fail(f"mechanical-verify.exit is '{exit_code}', expected 0") + else: + fail("Verification receipt files missing") + + +def check_patches(q, bug_count, bug_ids, strictness): + """Patches section (benchmark 44).""" + print("[Patches]") + if bug_count <= 0: + return + + patches_dir = _resolve_artifact_path(q, "patches") + + # Regression test file — required when bugs exist + reg_test_file = None + if q.is_dir(): + reg_files = sorted(q.glob("test_regression.*")) + if reg_files: + reg_test_file = reg_files[0] + + if reg_test_file is not None: + pass_(f"test_regression.* exists ({bug_count} confirmed bugs require it)") + else: + msg = "test_regression.* missing — required when bugs exist (SKILL.md artifact contract)" + if strictness == "benchmark": + fail(msg) + else: + warn(msg) + + reg_patch_count = 0 + fix_patch_count = 0 + reg_patch_missing = 0 + for bid in bug_ids: + if first_file_matching(patches_dir, [f"{bid}-regression*.patch"]) is not None: + reg_patch_count += 1 + else: + reg_patch_missing += 1 + if first_file_matching(patches_dir, [f"{bid}-fix*.patch"]) is not None: + fix_patch_count += 1 + + if reg_patch_missing == 0 and reg_patch_count > 0: + pass_(f"{reg_patch_count} regression-test patch(es) for {bug_count} bug(s)") + elif reg_patch_count > 0: + fail(f"{reg_patch_missing} bug(s) missing regression-test patch") + else: + fail("No regression-test patches found (quality/patches/BUG-NNN-regression-test.patch required)") + + if fix_patch_count > 0: + pass_(f"{fix_patch_count} fix patch(es)") + else: + warn("0 fix patches (fix patches are optional but strongly encouraged)") + + total_patches = reg_patch_count + fix_patch_count + info(f"Total: {total_patches} patch file(s) in quality/patches/") + + +# Unfilled-template sentinel phrases produced by the Phase 5 writeup stub. +# Presence of any of these strings in a writeup is strong evidence that the +# template was emitted without hydrating its content fields from BUGS.md. +# See bin/run_playbook.py::phase5_prompt for the generating prompt. +_WRITEUP_TEMPLATE_SENTINELS = ( + "is a confirmed code bug in ``", + "The affected implementation lives at ``", + "Patch path: ``", + "- Regression test: ``", + "- Regression patch: ``", +) + +# Matches a ```diff fenced block and captures its body for content inspection. +_WRITEUP_DIFF_BLOCK_RE = re.compile(r"```diff\s*\n(.*?)```", re.DOTALL | re.IGNORECASE) + + +def _writeup_diff_is_non_empty(text): + """True if any ```diff block in ``text`` contains at least one unified-diff + line (a `+` or `-` that is not the `+++`/`---` file-header prefix).""" + for block in _WRITEUP_DIFF_BLOCK_RE.findall(text): + for line in block.splitlines(): + stripped = line.lstrip() + if stripped.startswith("+++") or stripped.startswith("---"): + continue + if stripped.startswith(("+", "-")): + return True + return False + + +def check_writeups(q, bug_count): + """Bug writeups section (benchmark 30).""" + print("[Bug Writeups]") + if bug_count <= 0: + return + + writeups_dir = _resolve_artifact_path(q, "writeups") + writeup_count = 0 + writeup_diff_count = 0 + empty_diff_writeups = [] + sentinel_writeups = [] + if writeups_dir.is_dir(): + writeup_files = sorted(p for p in writeups_dir.glob("BUG-*.md") if p.is_file()) + writeup_count = len(writeup_files) + for wf in writeup_files: + try: + text = wf.read_text(encoding="utf-8", errors="replace") + except OSError: + continue + # Presence test uses the same regex as the content test so the + # two can never disagree on whether a fence exists. Case-insensitive + # match accepts ```diff / ```Diff / ```DIFF uniformly — operators + # routinely uppercase the fence tag and the gate must not silently + # skip those writeups (the content non-emptiness check would then + # never fire, producing a confusing "no inline fix diffs" FAIL on a + # writeup that visibly contains a unified diff). + if _WRITEUP_DIFF_BLOCK_RE.search(text): + writeup_diff_count += 1 + if not _writeup_diff_is_non_empty(text): + empty_diff_writeups.append(wf.name) + if any(s in text for s in _WRITEUP_TEMPLATE_SENTINELS): + sentinel_writeups.append(wf.name) + + if writeup_count >= bug_count: + pass_(f"{writeup_count} writeup(s) for {bug_count} bug(s)") + elif writeup_count > 0: + fail(f"{writeup_count} writeup(s) for {bug_count} bug(s) — all confirmed bugs require writeups (SKILL.md line 1454)") + else: + fail(f"No writeups for {bug_count} confirmed bug(s)") + + if writeup_count > 0: + if writeup_diff_count >= writeup_count: + pass_(f"All {writeup_diff_count} writeup(s) have inline fix diffs") + elif writeup_diff_count > 0: + fail(f"Only {writeup_diff_count}/{writeup_count} writeup(s) have inline fix diffs (all require section 6 diff)") + else: + fail("No writeups have inline fix diffs (section 6 'The fix' must include a ```diff block)") + + # Non-empty-diff content check. A ```diff fence with no `+`/`-` body + # is a template stub — the legacy presence-only check let these pass. + if empty_diff_writeups: + preview = ", ".join(empty_diff_writeups[:5]) + suffix = f" (+{len(empty_diff_writeups) - 5} more)" if len(empty_diff_writeups) > 5 else "" + fail( + f"{len(empty_diff_writeups)} writeup(s) have empty ```diff blocks " + f"(fence present, no +/- lines): {preview}{suffix}" + ) + else: + pass_("All writeup ```diff blocks contain unified-diff content") + + # Template-sentinel check. Any of these strings remaining in a writeup + # means the Phase 5 stub was emitted without hydrating from BUGS.md. + if sentinel_writeups: + preview = ", ".join(sentinel_writeups[:5]) + suffix = f" (+{len(sentinel_writeups) - 5} more)" if len(sentinel_writeups) > 5 else "" + fail( + f"{len(sentinel_writeups)} writeup(s) contain unfilled template " + f"sentinels (empty backticks after 'is a confirmed code bug in', " + f"'The affected implementation lives at', 'Patch path:', " + f"'Regression test:', or 'Regression patch:'): {preview}{suffix}" + ) + else: + pass_("No writeups contain unfilled template sentinels") + + +def check_version_stamps(repo_dir, q): + """Version stamp consistency (benchmark 26). Returns detected skill_version.""" + print("[Version Stamps]") + skill_version = detect_skill_version([ + repo_dir / "SKILL.md", + repo_dir / ".claude" / "skills" / "quality-playbook" / "SKILL.md", + repo_dir / ".github" / "skills" / "SKILL.md", + repo_dir / ".github" / "skills" / "quality-playbook" / "SKILL.md", + SCRIPT_DIR / ".." / "SKILL.md", + SCRIPT_DIR / "SKILL.md", + ]) + + if not skill_version: + warn("Cannot detect skill version from SKILL.md") + return skill_version + + progress_md = q / "PROGRESS.md" + if progress_md.is_file(): + pv = read_skill_value_line(progress_md, "Skill version:") + if pv == skill_version: + pass_(f"PROGRESS.md version matches ({skill_version})") + elif pv: + fail(f"PROGRESS.md version '{pv}' != '{skill_version}'") + else: + warn("PROGRESS.md missing Skill version field") + + json_path = _resolve_artifact_path(q, "results/tdd-results.json") + if json_path.is_file(): + data = load_json(json_path) + tv = get_str(data, "skill_version") + if tv == skill_version: + pass_("tdd-results.json skill_version matches") + elif tv: + fail(f"tdd-results.json skill_version '{tv}' != '{skill_version}'") + + return skill_version + + +def check_cross_run_contamination(repo_dir, q, version_arg, skill_version): + """Cross-run contamination detection.""" + print("[Cross-Run Contamination]") + repo_name = repo_dir.name + if skill_version and version_arg: + matches = re.findall(r"[0-9]+\.[0-9]+\.[0-9]+", repo_name) + dir_version = matches[-1] if matches else "" + if dir_version and dir_version != skill_version: + fail(f"Directory version '{dir_version}' != skill version '{skill_version}' — possible cross-run contamination") + else: + pass_("No version mismatch detected") + + json_path = _resolve_artifact_path(q, "results/tdd-results.json") + if json_path.is_file() and skill_version: + data = load_json(json_path) + json_sv = get_str(data, "skill_version") + if json_sv and json_sv != skill_version: + fail(f"tdd-results.json skill_version '{json_sv}' != SKILL.md '{skill_version}' — stale artifacts from prior run?") + + +def _check_exploration_sections(path): + """Check that EXPLORATION.md contains all required section titles.""" + required_sections = [ + "## Open Exploration Findings", + "## Quality Risks", + "## Pattern Applicability Matrix", + "## Candidate Bugs for Phase 2", + "## Gate Self-Check", + ] + try: + content = path.read_text(encoding="utf-8", errors="replace") + except OSError as exc: + fail(f"EXPLORATION.md unreadable: {exc}") + return + for section in required_sections: + if section not in content: + fail(f"EXPLORATION.md missing required section: {section!r}") + + +def check_run_metadata(q): + """Validate the run-metadata sidecar JSON (run-YYYY-MM-DDTHH-MM-SS.json).""" + print("[Run Metadata]") + results_dir = _resolve_artifact_path(q, "results") + pattern = str(results_dir / "run-*.json") + import glob as _glob + matches = _glob.glob(pattern) + if not matches: + fail("run-metadata JSON missing (expected quality/results/run-YYYY-MM-DDTHH-MM-SS.json)") + return + if len(matches) > 1: + warn(f"Multiple run-metadata files found: {len(matches)}") + filename_re = re.compile(r"run-\d{4}-\d{2}-\d{2}T\d{2}-\d{2}-\d{2}\.json$") + for path in matches: + if not filename_re.search(path): + fail(f"run-metadata filename does not match expected format: {path}") + data = load_json(Path(path)) + if data is None: + fail(f"run-metadata JSON parse error: {path}") + continue + required_fields = ("schema_version", "skill_version", "project", "model", "runner", "start_time") + for field in required_fields: + if not data.get(field): + fail(f"run-metadata missing or empty field: {field!r}") + pass_("run-metadata JSON present") + + +# --- Per-repo entry point --- + + +# --------------------------------------------------------------------------- +# v1.5.1 Layer-1 mechanical invariants (schemas.md §10). +# +# Each check gracefully no-ops on pre-v1.5.1 runs (absent manifests = legacy +# repo; nothing to enforce). When the v1.5.1 artifacts are present every +# invariant below is enforced mechanically and FAILs with a specific +# : message so the operator can fix the single artifact +# without re-running the whole playbook. +# --------------------------------------------------------------------------- + +_V150_VALID_DISPOSITIONS = ( + "code-fix", + "spec-fix", + "upstream-spec-issue", + "mis-read", + "deferred", +) +_V150_VALID_FIX_TYPES = ("code", "spec", "both") +_V150_ILLEGAL_FIX_PAIRS = { + ("code-fix", "spec"), + ("spec-fix", "code"), + ("upstream-spec-issue", "code"), + ("mis-read", "both"), +} +_V150_SUPPORTED_EXTENSIONS = (".txt", ".md") +# v1.5.4 Part 1 / Round 1 Council finding C2-1: INDEX schema is now +# version-routed. New runs MUST emit schema_version "2.0" with +# target_role_breakdown; legacy archives carry schema_version "1.0" +# (or no schema_version at all) with target_project_type. The fields +# common to both schemas live in _V150_INDEX_COMMON_FIELDS; the +# version-specific fields live in their own tuples and are picked at +# validation time. +# +# v1.5.4 Round 2 Council finding C1: SCHEMA_VERSION_CURRENT pins the +# version this gate understands. Future schemas (>2.0) refuse with an +# explicit error rather than silently downgrading to legacy. When a +# v1.5.5+ run bumps the schema, also bump this constant; otherwise the +# new gate version will reject the new INDEX shape on purpose. +SCHEMA_VERSION_CURRENT = "2.0" +_V150_INDEX_COMMON_FIELDS = ( + "run_timestamp_start", + "run_timestamp_end", + "duration_seconds", + "qpb_version", + "target_repo_path", + "target_repo_git_sha", + "phases_executed", + "summary", + "artifacts", +) +_V150_INDEX_LEGACY_FIELDS = ("target_project_type",) +_V154_INDEX_CURRENT_FIELDS = ("target_role_breakdown",) +# Legacy alias: a small number of pre-iteration tests still import +# _V150_REQUIRED_INDEX_FIELDS expecting a single tuple. Preserve the +# alias under the v1.5.4-current contract; the version-routed +# enforcement happens inside check_v1_5_0_index_md. +_V150_REQUIRED_INDEX_FIELDS = ( + _V150_INDEX_COMMON_FIELDS + _V154_INDEX_CURRENT_FIELDS +) +_V150_REQUIRED_SUMMARY_KEYS = ("requirements", "bugs", "gate_verdict") + + +# --------------------------------------------------------------------------- +# v1.5.3 — schema extensions (schemas.md §3.6–§3.10, §4.1, §6.1, §8.1, §10 +# invariants #21–#23). Field-presence detection (§3.10) toggles the +# v1.5.3 invariants on per-manifest, NOT a schema_version comparison. +# --------------------------------------------------------------------------- + +_V153_VALID_SOURCE_TYPES = ( + "code-derived", + "skill-section", + "reference-file", + "execution-observation", + # v1.5.6 (QG-fail-2 from the v1.5.6 self-bootstrap): REQs derived from + # operator-supplied informal documentation under the target repo's + # `reference_docs/` tree. Distinct from `reference-file`, which + # schemas.md §3.7 ties to QPB-shipped reference files under + # `references/`. The Phase 2 LLM disambiguates the two evidence + # sources by name; the schema and gate now match. + "docs-derived", +) +_V153_VALID_DIVERGENCE_TYPES = ( + "code-spec", + "internal-prose", + "prose-to-code", + "execution", +) +_V153_VALID_FORMAL_DOC_ROLES = ( + "external-spec", + "project-spec", + "skill-self-spec", + "skill-reference", +) + +# DQ-3 (v1.5.3 Phase 3 / Round 2 Council): the v1.5.3 field-presence +# detection key set is module-level so a regression test can pin it +# against schemas.md's enum-bearing field list. A future schema +# addition (e.g., a fifth v1.5.3-only field) that updates ONLY this +# constant without updating the test's literal will fail the regression +# test, forcing lockstep maintenance and surfacing the change for +# explicit review. +_V153_FIELD_KEYS = frozenset({"source_type", "divergence_type", "role"}) + + +def _is_v1_5_3_shaped(manifest): + """Return True iff any record in *manifest* carries a v1.5.3 field. + + Walks the records (or `reviews`) once. Presence of any key in + _V153_FIELD_KEYS on any record toggles strict-mode validation per + schemas.md §3.10. Empty / unparsable manifests return False so + legacy fixtures stay on the soft-warn path. + + DQ-3 design note: the checked-key set is sourced from + _V153_FIELD_KEYS (a module-level frozenset) rather than hardcoded + in this function's body. A regression test in + test_quality_gate.py::TestV153FieldKeysContract pins + _V153_FIELD_KEYS against the literal `{"source_type", + "divergence_type", "role"}` so a future maintainer adding a + v1.5.3-only field to the schema cannot silently miss updating the + detection helper. + """ + if not isinstance(manifest, dict): + return False + records = manifest.get("records") + if not isinstance(records, list): + records = manifest.get("reviews") if isinstance( + manifest.get("reviews"), list + ) else [] + for rec in records: + if not isinstance(rec, dict): + continue + if not _V153_FIELD_KEYS.isdisjoint(rec.keys()): + return True + return False + + +def _v150_manifest(q, name): + """Return the parsed top-level JSON object or None if absent/invalid.""" + path = q / name + if not path.is_file(): + return None + data = load_json(path) + if isinstance(data, dict): + return data + fail(f"{path.name}: not a valid JSON object (schemas.md §1.6)") + return None + + +def check_v1_5_0_cite_extensions(repo_dir): + """§10 invariant #9 — reference_docs/cite/ contains only .txt/.md. + + v1.5.2 collapsed the old formal_docs/+informal_docs/ split into a single + reference_docs/ tree with reference_docs/cite/ holding citable material. + The plaintext-only constraint now applies to that cite folder; the check + retains the v1.5.0 invariant ancestry (hence the _v1_5_0_ name prefix). + """ + folder = repo_dir / "reference_docs" / "cite" + if not folder.is_dir(): + return + any_file = False + for path in sorted(folder.rglob("*")): + if not path.is_file(): + continue + any_file = True + if path.name == "README.md": + continue + if path.name.endswith(".meta.json"): + continue + # v1.5.6 (QG-fail-1 from the v1.5.6 self-bootstrap): `.gitkeep` + # is the documented sentinel that pins `reference_docs/cite/` + # in version control even when adopters have no citable + # plaintext yet. The pre-flight expects it to exist; the gate + # must not reject it. + if path.name == ".gitkeep": + continue + ext = path.suffix.lower() + if ext not in _V150_SUPPORTED_EXTENSIONS: + rel = path.relative_to(repo_dir).as_posix() + fail( + f"{rel}: unsupported extension {ext or '(none)'} under reference_docs/cite/ " + "(schemas.md §2 allows only .txt, .md; §10 invariant #9)" + ) + if any_file: + pass_("reference_docs/cite/: all files use supported extensions") + + +def check_v1_5_0_manifest_wrappers(q): + """§10 invariant #13 — manifest wrapper shape. + + Four record-shaped manifests (formal_docs / requirements / use_cases / + bugs) use `records`; citation_semantic_check.json uses `reviews` + (schemas.md §9.1). Every manifest must carry schema_version + + generated_at as non-empty strings. + """ + record_shaped = ( + "formal_docs_manifest.json", + "requirements_manifest.json", + "use_cases_manifest.json", + "bugs_manifest.json", + ) + for name in record_shaped: + data = _v150_manifest(q, name) + if data is None: + continue + for key in ("schema_version", "generated_at"): + if not isinstance(data.get(key), str) or not data[key]: + fail(f"{name}: missing or empty top-level {key!r} (schemas.md §1.6)") + if not isinstance(data.get("records"), list): + fail(f"{name}: missing or non-array top-level 'records' (schemas.md §1.6)") + if "reviews" in data: + fail( + f"{name}: has 'reviews' key — reserved for citation_semantic_check.json " + "per schemas.md §9.1 / §10 invariant #13" + ) + else: + pass_(f"{name}: manifest wrapper valid") + + data = _v150_manifest(q, "citation_semantic_check.json") + if data is not None: + for key in ("schema_version", "generated_at"): + if not isinstance(data.get(key), str) or not data[key]: + fail( + f"citation_semantic_check.json: missing or empty top-level {key!r} " + "(schemas.md §1.6)" + ) + if not isinstance(data.get("reviews"), list): + fail( + "citation_semantic_check.json: missing or non-array top-level 'reviews' " + "(schemas.md §9.1 — semantic check uses 'reviews', not 'records')" + ) + if "records" in data: + fail( + "citation_semantic_check.json: has 'records' key — semantic check uses " + "'reviews' per schemas.md §9.1 / §10 invariant #13" + ) + else: + pass_("citation_semantic_check.json: manifest wrapper valid") + + +def _check_citation_block(repo_dir, req_id, citation, formal_docs_by_path, req_tier): + excerpt = citation.get("citation_excerpt") + if not isinstance(excerpt, str) or not excerpt: + fail( + "requirements_manifest.json", + f"record_id={req_id}: citation has empty or missing citation_excerpt " + "(schemas.md §10 invariant #4)", + ) + return + doc_path_str = citation.get("document") + if not isinstance(doc_path_str, str) or not doc_path_str: + fail( + "requirements_manifest.json", + f"record_id={req_id}: citation missing 'document' field", + ) + return + section = citation.get("section") + line = citation.get("line") + has_section = isinstance(section, str) and section.strip() + has_line = isinstance(line, int) and not isinstance(line, bool) + if not has_section and not has_line: + fail( + "requirements_manifest.json", + f"record_id={req_id}: citation has no section or line locator " + "(page alone is insufficient; schemas.md §10 invariant #4)", + ) + return + + fd_rec = formal_docs_by_path.get(doc_path_str) + if fd_rec is None: + fail( + "requirements_manifest.json", + f"record_id={req_id}: citation document {doc_path_str!r} " + "not in formal_docs_manifest.json (schemas.md §10 invariant #2)", + ) + return + fd_tier = fd_rec.get("tier") + if fd_tier != req_tier: + fail( + "requirements_manifest.json", + f"record_id={req_id}: tier={req_tier} does not match cited FORMAL_DOC " + f"tier={fd_tier!r} (schemas.md §10 invariant #14)", + ) + fd_sha = fd_rec.get("document_sha256") + cite_sha = citation.get("document_sha256") + if isinstance(fd_sha, str) and isinstance(cite_sha, str) and fd_sha != cite_sha: + fail( + "requirements_manifest.json", + f"record_id={req_id}: citation.document_sha256 does not match FORMAL_DOC " + "(schemas.md §10 invariant #3 — citation_stale)", + ) + + if _CITATION_VERIFIER is None: + warn( + f"requirements_manifest.json: record_id={req_id}: byte-equality skipped — " + "bin/citation_verifier unavailable on this install" + ) + return + + doc_path = repo_dir / doc_path_str + if not doc_path.is_file(): + fail( + "requirements_manifest.json", + f"record_id={req_id}: citation document not on disk: {doc_path_str}", + ) + return + try: + bytes_ = doc_path.read_bytes() + fresh = _CITATION_VERIFIER.extract_excerpt( + bytes_, doc_path.suffix.lower(), section if has_section else None, + line if has_line else None, + ) + except _CITATION_VERIFIER.CitationResolutionError as exc: + fail( + "requirements_manifest.json", + f"record_id={req_id}: citation location does not resolve in " + f"{doc_path_str}: {exc.message} (schemas.md §10 invariant #4)", + ) + return + except Exception as exc: # noqa: BLE001 — fail with a real message + fail( + "requirements_manifest.json", + f"record_id={req_id}: citation verifier errored: {exc}", + ) + return + + if fresh != excerpt: + fail( + "requirements_manifest.json", + f"record_id={req_id}: citation_excerpt is not byte-equal to fresh " + f"extraction from {doc_path_str} " + "(schemas.md §10 invariant #11 — Layer-1 anti-hallucination)", + ) + + +def check_v1_5_0_requirements_manifest(repo_dir, q): + """§10 invariants #1, #4, #8, #11, #14 — REQ shape, citation gating, functional_section.""" + req_data = _v150_manifest(q, "requirements_manifest.json") + if req_data is None: + return + records = req_data.get("records") + if not isinstance(records, list): + return # wrapper check already reported + fd_data = _v150_manifest(q, "formal_docs_manifest.json") + formal_docs_by_path = {} + if fd_data and isinstance(fd_data.get("records"), list): + for rec in fd_data["records"]: + if isinstance(rec, dict) and isinstance(rec.get("source_path"), str): + formal_docs_by_path[rec["source_path"]] = rec + + for idx, rec in enumerate(records): + if not isinstance(rec, dict): + fail( + "requirements_manifest.json", + f"record_id=<#{idx}>: not a JSON object", + ) + continue + req_id = rec.get("id", f"<#{idx}>") + + fs = rec.get("functional_section") + if not isinstance(fs, str) or not fs.strip(): + fail( + "requirements_manifest.json", + f"record_id={req_id}: has empty or missing functional_section " + "(schemas.md §10 invariant #8)", + ) + + tier = rec.get("tier") + citation = rec.get("citation") + if tier in (1, 2): + if not isinstance(citation, dict): + fail( + "requirements_manifest.json", + f"record_id={req_id}: is tier {tier} but has no citation block " + "(schemas.md §10 invariant #1)", + ) + continue + _check_citation_block(repo_dir, req_id, citation, formal_docs_by_path, tier) + elif tier in (3, 4, 5): + if citation is not None: + fail( + "requirements_manifest.json", + f"record_id={req_id}: is tier {tier} but carries a citation block " + "(citations are for Tier 1/2 only per schemas.md §10 invariant #1)", + ) + elif tier is None: + fail( + "requirements_manifest.json", + f"record_id={req_id}: missing 'tier' field", + ) + else: + fail( + "requirements_manifest.json", + f"record_id={req_id}: has invalid tier {tier!r} (expected integer 1–5)", + ) + + # v1.5.2: validate the optional `pattern` field on the REQ record. + pattern = rec.get("pattern") + if pattern is not None and pattern not in VALID_PATTERN_VALUES: + fail( + "requirements_manifest.json", + f"record_id={req_id}: has invalid pattern {pattern!r} " + f"(expected one of {sorted(VALID_PATTERN_VALUES)})", + ) + + pass_("requirements_manifest.json: v1.5.1 Layer-1 REQ checks complete") + + +def check_v1_5_0_bugs_manifest(q): + """§10 invariants #7, #12 — disposition completeness + legal fix_type × disposition.""" + data = _v150_manifest(q, "bugs_manifest.json") + if data is None: + return + records = data.get("records") + if not isinstance(records, list): + return + for idx, rec in enumerate(records): + if not isinstance(rec, dict): + continue + bug_id = rec.get("id", f"<#{idx}>") + disp = rec.get("disposition") + if disp not in _V150_VALID_DISPOSITIONS: + fail( + "bugs_manifest.json", + f"record_id={bug_id}: has invalid or missing disposition {disp!r} " + f"(schemas.md §10 invariant #7, valid: " + f"{', '.join(_V150_VALID_DISPOSITIONS)})", + ) + continue + rationale = rec.get("disposition_rationale") + if not isinstance(rationale, str) or not rationale.strip(): + fail( + "bugs_manifest.json", + f"record_id={bug_id}: has empty or missing disposition_rationale " + "(schemas.md §10 invariant #7)", + ) + ft = rec.get("fix_type") + if ft not in _V150_VALID_FIX_TYPES: + fail( + "bugs_manifest.json", + f"record_id={bug_id}: has invalid or missing fix_type {ft!r}", + ) + continue + if (disp, ft) in _V150_ILLEGAL_FIX_PAIRS: + fail( + "bugs_manifest.json", + f"record_id={bug_id}: illegal disposition × fix_type combination " + f"({disp}, {ft}) per schemas.md §3.4 / §10 invariant #12", + ) + + pass_("bugs_manifest.json: v1.5.1 Layer-1 BUG checks complete") + + +def check_v1_5_0_index_md(q): + """§10 invariant #10 — quality/INDEX.md exists with all §11 required fields. + + v1.5.4 Part 1 / Round 1 Council finding C2-1 + Round 2 Council + finding C1: routes by INDEX payload.schema_version with explicit + handling for each case so future schemas don't silently downgrade. + + - ``schema_version == SCHEMA_VERSION_CURRENT`` (currently + ``"2.0"``) → the v1.5.4 contract; target_role_breakdown + required (null is legitimate for the stub before Phase 1). + - ``schema_version == "1.0"`` → legacy v1.5.3 archive; + target_project_type required; one WARN emitted. + - ``schema_version`` absent/empty AND payload carries + target_project_type without target_role_breakdown → legacy + WARN (heuristic fallback for pre-schema-version archives). + - ``schema_version`` absent/empty AND payload doesn't match the + legacy heuristic → current path; the run is treated as a + v1.5.4 stub that simply hasn't populated schema_version yet, + and target_role_breakdown is required. + - any other ``schema_version`` (e.g. ``"3.0"`` from a future + gate) → explicit FAIL "newer than supported" so the operator + knows to upgrade the gate or downgrade the run. + + This keeps historical archives under quality/previous_runs/ + legible without rewriting them retroactively while keeping the + gate strict on current runs. + """ + path = q / "INDEX.md" + v150_artifacts = ( + "formal_docs_manifest.json", + "requirements_manifest.json", + "use_cases_manifest.json", + "bugs_manifest.json", + "citation_semantic_check.json", + ) + is_v150_run = any((q / name).is_file() for name in v150_artifacts) + if not path.is_file(): + if is_v150_run: + fail( + "quality/INDEX.md does not exist (required on every v1.5.1 run per " + "schemas.md §10 invariant #10)" + ) + return + text = path.read_text(encoding="utf-8", errors="ignore") + match = re.search(r"```json\n(.*?)\n```", text, re.DOTALL) + if not match: + fail("quality/INDEX.md: no fenced JSON block found (schemas.md §11)") + return + try: + payload = json.loads(match.group(1)) + except json.JSONDecodeError as exc: + fail(f"quality/INDEX.md: fenced JSON block invalid: {exc}") + return + if not isinstance(payload, dict): + fail("quality/INDEX.md: fenced JSON block is not a JSON object") + return + + # Schema-version routing for INDEX.md (v1.5.4 Round 2 Council + # finding C1). Four cases, handled explicitly so future schemas + # don't silently downgrade to legacy: + # 1. schema_version == "1.0" -> legacy WARN + # 2. schema_version absent/empty AND the payload -> legacy WARN + # carries target_project_type but not (heuristic + # target_role_breakdown fallback for + # pre-schema- + # version + # archives) + # 3. schema_version == SCHEMA_VERSION_CURRENT -> current path + # 4. schema_version absent/empty AND the payload + # doesn't fit case 2 -> current path + # (FAIL on + # missing + # target_role_breakdown + # because the + # run is + # ambiguous and + # v1.5.4 is the + # live shape) + # 5. any other schema_version -> explicit FAIL + # "newer than + # supported" + schema_version = payload.get("schema_version") + if schema_version == "1.0": + is_legacy = True + elif schema_version in (None, ""): + is_legacy = ( + "target_project_type" in payload + and "target_role_breakdown" not in payload + ) + elif schema_version == SCHEMA_VERSION_CURRENT: + is_legacy = False + else: + fail( + f"quality/INDEX.md: schema_version {schema_version!r} is " + f"newer than this gate supports (current: " + f"{SCHEMA_VERSION_CURRENT!r}). Upgrade the gate or " + "downgrade the run." + ) + return + + if is_legacy: + warn( + f"quality/INDEX.md: schema_version={schema_version!r} treated as " + "legacy v1.5.3 archive (target_project_type contract). v1.5.4+ " + f"runs MUST emit schema_version={SCHEMA_VERSION_CURRENT!r} with " + "target_role_breakdown." + ) + required = _V150_INDEX_COMMON_FIELDS + _V150_INDEX_LEGACY_FIELDS + else: + required = _V150_INDEX_COMMON_FIELDS + _V154_INDEX_CURRENT_FIELDS + + for key in required: + if key not in payload: + fail(f"quality/INDEX.md: missing required field {key!r} (schemas.md §11)") + continue + val = payload[key] + if isinstance(val, str) and not val: + fail(f"quality/INDEX.md: field {key!r} is empty string (schemas.md §11)") + summary = payload.get("summary") + if isinstance(summary, dict): + for sub in _V150_REQUIRED_SUMMARY_KEYS: + if sub not in summary: + fail( + f"quality/INDEX.md: summary missing {sub!r} sub-key " + "(schemas.md §11)" + ) + pass_("quality/INDEX.md: §11 fields present") + + +_V150_VALID_VERDICTS = ("supports", "overreaches", "unclear") + + +def check_v1_5_0_semantic_check(q): + """§10 invariant #17 — Council-of-Three majority-overreaches rule. + + Layer-2 semantic check (Phase 6). Gate does NOT re-run the semantic + review; it parses quality/citation_semantic_check.json and applies + the majority-overreaches rule: + + - ≥2 of 3 `overreaches` for the same Tier 1/2 REQ → FAIL. + - isolated 1/3 `overreaches` or `unclear` → WARN. + - <3 reviews for any Tier 1/2 REQ → FAIL (schemas.md §9.4). + - review entry for a Tier 3/4/5 REQ → FAIL (only Tier 1/2 are + semantically reviewable since they carry citations). + + When requirements_manifest.json has zero Tier 1/2 REQs the + citation_semantic_check.json file is still expected (emitted with + empty reviews[]); its absence in that case warns rather than + fails to avoid breaking Spec Gap runs. + """ + req_data = _v150_manifest(q, "requirements_manifest.json") + tier_by_req = {} + if req_data and isinstance(req_data.get("records"), list): + for rec in req_data["records"]: + if isinstance(rec, dict): + rid = rec.get("id") + tier = rec.get("tier") + if isinstance(rid, str) and isinstance(tier, int) and not isinstance(tier, bool): + tier_by_req[rid] = tier + tier_12_req_ids = {rid for rid, t in tier_by_req.items() if t in (1, 2)} + + sc_path = q / "citation_semantic_check.json" + if not sc_path.is_file(): + if tier_12_req_ids: + fail( + "quality/citation_semantic_check.json", + "file missing (schemas.md §10 invariant #17 requires a semantic " + "check for every Tier 1/2 REQ)", + ) + else: + # Spec Gap: no Tier 1/2 REQs to review. File is expected but its + # absence doesn't break the invariant since there's nothing to + # enforce. Warn so the orchestrator knows to emit the empty file. + warn( + "quality/citation_semantic_check.json: file missing; no Tier 1/2 " + "REQs present so invariant #17 has nothing to enforce — emit an " + "empty reviews[] for contract completeness" + ) + return + + data = _v150_manifest(q, "citation_semantic_check.json") + if data is None: + return # wrapper check already reported the failure + reviews = data.get("reviews") + if not isinstance(reviews, list): + return # wrapper check already reported + + by_req = {} + seen_reviewers = {} + for idx, entry in enumerate(reviews): + if not isinstance(entry, dict): + fail( + "citation_semantic_check.json", + f"reviews[#{idx}]: not a JSON object", + ) + continue + rid = entry.get("req_id") + reviewer = entry.get("reviewer") + verdict = entry.get("verdict") + notes = entry.get("notes") + if not isinstance(rid, str) or not rid: + fail( + "citation_semantic_check.json", + f"reviews[#{idx}]: missing or non-string req_id", + ) + continue + if not isinstance(reviewer, str) or not reviewer: + fail( + "citation_semantic_check.json", + f"record_id={rid}: missing or non-string reviewer", + ) + continue + if verdict not in _V150_VALID_VERDICTS: + fail( + "citation_semantic_check.json", + f"record_id={rid}: reviewer={reviewer!r} invalid verdict " + f"{verdict!r}; expected one of {_V150_VALID_VERDICTS}", + ) + continue + if not isinstance(notes, str): + fail( + "citation_semantic_check.json", + f"record_id={rid}: reviewer={reviewer!r} notes must be a string", + ) + continue + # §9.4 common-mistake: tier check — review entries must belong to + # Tier 1/2 REQs only. + tier = tier_by_req.get(rid) + if tier is None: + fail( + "citation_semantic_check.json", + f"record_id={rid}: reviewer={reviewer!r} reviews a REQ that does " + "not exist in requirements_manifest.json", + ) + continue + if tier not in (1, 2): + fail( + "citation_semantic_check.json", + f"record_id={rid}: reviewer={reviewer!r} reviews a tier-{tier} " + "REQ; semantic check applies to Tier 1/2 only (schemas.md §9.4)", + ) + continue + # Detect duplicate (req_id, reviewer) pairs — a typo that would slip a + # vote past the majority computation. + pair_key = seen_reviewers.setdefault(rid, set()) + if reviewer in pair_key: + fail( + "citation_semantic_check.json", + f"record_id={rid}: duplicate review from reviewer={reviewer!r}", + ) + continue + pair_key.add(reviewer) + by_req.setdefault(rid, []).append(entry) + + # §9.4: every Tier 1/2 REQ needs at least 3 reviews. + for rid in sorted(tier_12_req_ids): + entries = by_req.get(rid, []) + if len(entries) < 3: + fail( + "citation_semantic_check.json", + f"record_id={rid}: fewer than 3 reviews ({len(entries)} present) " + "— schemas.md §9.4 requires one entry per council member for " + "every Tier 1/2 REQ", + ) + continue + overreach_count = sum(1 for e in entries if e.get("verdict") == "overreaches") + unclear_count = sum(1 for e in entries if e.get("verdict") == "unclear") + if overreach_count >= 2: + reviewers_flagged = ", ".join( + sorted( + str(e.get("reviewer")) + for e in entries + if e.get("verdict") == "overreaches" + ) + ) + fail( + "citation_semantic_check.json", + f"record_id={rid}: semantic check majority overreaches " + f"({overreach_count}/{len(entries)} reviewers flagged: " + f"{reviewers_flagged}) — schemas.md §10 invariant #17", + ) + elif overreach_count == 1: + flagged = next( + str(e.get("reviewer")) + for e in entries + if e.get("verdict") == "overreaches" + ) + warn( + f"citation_semantic_check.json: record_id={rid}: 1/{len(entries)} " + f"reviewer ({flagged}) flagged as `overreaches` — surfaced for " + "human review; not a gate failure unless ≥2 agree" + ) + if unclear_count >= 1 and overreach_count == 0: + flagged = ", ".join( + sorted( + str(e.get("reviewer")) + for e in entries + if e.get("verdict") == "unclear" + ) + ) + warn( + f"citation_semantic_check.json: record_id={rid}: " + f"{unclear_count}/{len(entries)} reviewer(s) flagged as " + f"`unclear` ({flagged}) — surfaced for human review" + ) + + if not tier_12_req_ids: + pass_( + "citation_semantic_check.json: no Tier 1/2 REQs to review " + "(invariant #17 vacuously satisfied)" + ) + else: + pass_( + f"citation_semantic_check.json: §10 invariant #17 checks complete " + f"for {len(tier_12_req_ids)} Tier 1/2 REQ(s)" + ) + + +# --- v1.5.1 Item 5.2: challenge-gate coverage invariant ------------------- + +# Canonical verdict-line regex from Impl-Plan Item 5.2. Matches a top-level +# "**Verdict:** CONFIRMED/DOWNGRADED/REJECTED" line as a stand-alone line. +_CHALLENGE_VERDICT_RE = re.compile( + r"^\*\*Verdict:\*\*\s+(CONFIRMED|DOWNGRADED|REJECTED)\s*$", + re.MULTILINE, +) +# Legacy final-verdict form used by challenge records generated before the +# canonical regex was specified (including the preserved virtio-1.4.6 +# evidence at repos/benchmark-1.5.0/virtio-1.4.6/quality/challenge/). +# The briefing says "this invariant only verifies the challenge ran" — the +# legacy form unambiguously records a final verdict, so it satisfies the +# invariant's intent without requiring operators to regenerate baseline +# artifacts. New v1.5.1+ runs should prefer the canonical form. +_CHALLENGE_VERDICT_LEGACY_RE = re.compile( + r"^\*\*(CONFIRMED|DOWNGRADED|REJECTED)\.?\*\*", + re.MULTILINE, +) + +# Trigger-pattern keyword tables (case-insensitive substring matching). +_CHALLENGE_SECURITY_SEVERITIES = frozenset({"CRITICAL", "HIGH"}) +_CHALLENGE_SECURITY_KEYWORDS = ( + "credential", "secret", "auth", "injection", "xss", "csrf", + "ssrf", "privilege", "bypass", "leak", +) +_CHALLENGE_SIBLING_KEYWORDS = ( + "sibling", "parallel", "parity", "contrasted with", "same concern", + "in contrast", "other path", "other branch", +) +_CHALLENGE_MISSING_KEYWORDS = ( + "never", "does not", "doesn't", "missing", "absent", "fails to", +) +_CHALLENGE_DESIGN_KEYWORDS = ( + "todo", "why", "ooda", "design decision", +) +_CHALLENGE_ITERATION_KEYWORDS = ( + "gap", "unfiltered", "parity", "adversarial", "iteration", +) + + +def _bug_writeup_text(q, bug_id): + """Return lowercased writeup text for ``bug_id`` (empty string if absent). + + Writeups live at quality/writeups/BUG-NNN.md. Reading failures are + treated as empty text — the invariant still runs on the manifest fields + (title / summary / source) which are present independently. + """ + path = _resolve_artifact_path(q, f"writeups/{bug_id}.md") + if not path.is_file(): + return "" + try: + return path.read_text(encoding="utf-8", errors="ignore").lower() + except OSError: + return "" + + +def _bug_req_has_tier_12_citation(req_id, requirements_records): + """True iff req_id resolves to a REQ with a non-empty citation and + tier in {1, 2}. Used by the "No spec basis" trigger pattern.""" + if not req_id or not isinstance(requirements_records, list): + return False + for rec in requirements_records: + if not isinstance(rec, dict): + continue + if rec.get("id") != req_id: + continue + if rec.get("tier") not in (1, 2): + return False + citation = rec.get("citation") + if isinstance(citation, dict) and citation: + return True + return False + return False + + +def _contains_any(text, keywords): + """Case-insensitive substring OR across a keyword tuple.""" + if not text: + return False + lowered = text.lower() + return any(kw in lowered for kw in keywords) + + +def _classify_bug_triggers(rec, q, requirements_records): + """Return the list of trigger-pattern names that fired for one bug. + Empty list means the bug does not require a challenge record. + + Patterns mirror Impl-Plan Item 5.2 verbatim. Input aliasing: + - title: prefers rec['title'], falls back to rec['summary']. + - requirement: prefers rec['requirement'], falls back to rec['req_id'] + (v1.4.x uses req_id; v1.5.1+ converges on requirement). + - source_comments: optional, older runs may omit it. + - source / discovery_phase: substring-matched against the + iteration-derived keyword list. + """ + fired = [] + + bug_id = rec.get("id", "") + title = rec.get("title") or rec.get("summary") or "" + severity = (rec.get("severity") or "").upper() + writeup = _bug_writeup_text(q, bug_id) if bug_id else "" + title_plus_writeup = f"{title}\n{writeup}" + + # 1. Security-class. + if severity in _CHALLENGE_SECURITY_SEVERITIES and _contains_any( + title_plus_writeup, _CHALLENGE_SECURITY_KEYWORDS + ): + fired.append("security-class") + + # 2. No spec basis. + requirement = rec.get("requirement") or rec.get("req_id") + has_valid_citation = _bug_req_has_tier_12_citation(requirement, requirements_records) + if not requirement or not has_valid_citation: + fired.append("no-spec-basis") + + # 3. Sibling-path divergence. + if _contains_any(writeup, _CHALLENGE_SIBLING_KEYWORDS): + fired.append("sibling-path-divergence") + + # 4. Missing functionality. + if _contains_any(writeup, _CHALLENGE_MISSING_KEYWORDS): + fired.append("missing-functionality") + + # 5. Design-decision comment (optional field). + source_comments = rec.get("source_comments") + if isinstance(source_comments, str) and _contains_any( + source_comments, _CHALLENGE_DESIGN_KEYWORDS + ): + fired.append("design-decision-comment") + + # 6. Iteration-derived. + source = rec.get("source") or "" + discovery_phase = rec.get("discovery_phase") or "" + iter_haystack = f"{source}\n{discovery_phase}" + if _contains_any(iter_haystack, _CHALLENGE_ITERATION_KEYWORDS): + fired.append("iteration-derived") + + return fired + + +def _challenge_record_has_verdict(path): + """True iff the file exists and contains either the canonical or + legacy verdict line per the invariant's accept set.""" + if not path.is_file(): + return False + try: + text = path.read_text(encoding="utf-8", errors="ignore") + except OSError: + return False + if _CHALLENGE_VERDICT_RE.search(text): + return True + if _CHALLENGE_VERDICT_LEGACY_RE.search(text): + return True + return False + + +def check_challenge_gate_coverage(q): + """v1.5.1 Item 5.2 — every bug whose fingerprints trigger the challenge + gate must have a quality/challenge/BUG-NNN-challenge.md with a valid + verdict line. + + N/A when quality/bugs_manifest.json is absent (zero-bug runs can't + have un-challenged bugs). Runs on the current quality/ tree only; + no cross-run state. + """ + data = _v150_manifest(q, "bugs_manifest.json") + if data is None: + # N/A — the plan explicitly says "invariant is N/A if the file is + # absent". Consistent with other quality_gate invariants that silently + # skip when their input isn't present. + return + records = data.get("records") + if not isinstance(records, list): + return + + reqs_data = _v150_manifest(q, "requirements_manifest.json") or {} + req_records = reqs_data.get("records") if isinstance(reqs_data, dict) else None + + challenge_dir = q / "challenge" + triggered = 0 + missing = [] # list of (bug_id, [pattern names]) for bugs with no record + bad_verdict = [] # list of (bug_id, [pattern names]) for record w/o verdict + + for rec in records: + if not isinstance(rec, dict): + continue + bug_id = rec.get("id") + if not bug_id: + continue + fired = _classify_bug_triggers(rec, q, req_records) + if not fired: + continue + triggered += 1 + record_path = challenge_dir / f"{bug_id}-challenge.md" + if not record_path.is_file(): + missing.append((bug_id, fired)) + elif not _challenge_record_has_verdict(record_path): + bad_verdict.append((bug_id, fired)) + + if missing: + for bug_id, fired in missing: + fail( + "quality/challenge/", + f"{bug_id}: challenge record missing (triggered by: {', '.join(fired)}) " + f"— expected {bug_id}-challenge.md with a **Verdict:** line", + ) + if bad_verdict: + for bug_id, fired in bad_verdict: + fail( + f"quality/challenge/{bug_id}-challenge.md", + f"missing or malformed verdict line (triggered by: {', '.join(fired)}) " + "— expected a line matching `^\\*\\*Verdict:\\*\\*\\s+(CONFIRMED|DOWNGRADED|REJECTED)` " + "or the legacy final-verdict form", + ) + + if triggered == 0: + pass_("challenge gate coverage: no bug triggered the challenge gate (vacuous)") + elif not missing and not bad_verdict: + pass_( + f"challenge gate coverage: {triggered} triggered bug(s) all have valid " + "challenge records" + ) + + +def check_v1_5_3_formal_doc_role_validation(q): + """schemas.md §10 invariant #23 — FORMAL_DOC.role on v1.5.3-shaped manifests. + + Legacy manifest (no v1.5.3 fields anywhere): one WARN, then skip. + v1.5.3-shaped: every record MUST have role populated with a member of + formal_doc_role (§3.6). + """ + data = _v150_manifest(q, "formal_docs_manifest.json") + if data is None: + return + records = data.get("records") + if not isinstance(records, list): + return # wrapper check already reported + if not _is_v1_5_3_shaped(data): + warn( + "formal_docs_manifest.json: legacy manifest detected; treating absent " + "FORMAL_DOC.role as 'external-spec' per schemas.md §3.10 backward-compat rule" + ) + return + any_fail = False + for idx, rec in enumerate(records): + if not isinstance(rec, dict): + continue + rec_id = rec.get("source_path", f"<#{idx}>") + role = rec.get("role") + if role not in _V153_VALID_FORMAL_DOC_ROLES: + fail( + "formal_docs_manifest.json", + f"record_id={rec_id}: missing or invalid role {role!r} on " + f"v1.5.3-shaped manifest (schemas.md §10 invariant #23, valid: " + f"{', '.join(_V153_VALID_FORMAL_DOC_ROLES)})", + ) + any_fail = True + if not any_fail: + pass_("formal_docs_manifest.json: v1.5.3 role validation complete") + + +def check_v1_5_3_source_type_validation(q): + """schemas.md §10 invariants #21 (first part) — REQ.source_type presence. + + Legacy manifest: one WARN, then skip. + v1.5.3-shaped: every REQ MUST have source_type populated with a member + of req_source_type (§3.7). + """ + data = _v150_manifest(q, "requirements_manifest.json") + if data is None: + return + records = data.get("records") + if not isinstance(records, list): + return + if not _is_v1_5_3_shaped(data): + warn( + "requirements_manifest.json: legacy manifest detected; treating absent " + "REQ.source_type as 'code-derived' per schemas.md §3.10 backward-compat rule" + ) + return + any_fail = False + for idx, rec in enumerate(records): + if not isinstance(rec, dict): + continue + req_id = rec.get("id", f"<#{idx}>") + source_type = rec.get("source_type") + if source_type not in _V153_VALID_SOURCE_TYPES: + fail( + "requirements_manifest.json", + f"record_id={req_id}: missing or invalid source_type " + f"{source_type!r} on v1.5.3-shaped manifest " + f"(schemas.md §10 invariant #21, valid: " + f"{', '.join(_V153_VALID_SOURCE_TYPES)})", + ) + any_fail = True + if not any_fail: + pass_("requirements_manifest.json: v1.5.3 source_type validation complete") + + +def check_v1_5_3_skill_section_consistency(q): + """schemas.md §10 invariant #21 (second part) — skill_section consistency. + + On a v1.5.3-shaped requirements manifest, REQs with + source_type == 'skill-section' MUST have non-empty skill_section; + REQs with any other source_type value MUST have skill_section absent + or null (per §1.5: optional fields may be omitted or present as null). + Populated skill_section paired with non-skill-section source_type FAILs. + + Legacy manifests are skipped silently here -- the source_type check + already emitted the single WARN for the manifest. + + Deliberate piggyback (Round 2 Council, item 1): this is the one + documented exception to the "exactly one WARN per check function" + convention used by the other three v1.5.3 invariants. Both + check_v1_5_3_source_type_validation and this check share + requirements_manifest.json, so emitting a second WARN here would + double-warn for the same legacy file. The piggyback is locked in + by test_legacy_manifest_silently_skips in + TestV153SkillSectionConsistency -- a future maintainer reading the + brief and adding a WARN for consistency would break that test. + """ + data = _v150_manifest(q, "requirements_manifest.json") + if data is None: + return + records = data.get("records") + if not isinstance(records, list): + return + if not _is_v1_5_3_shaped(data): + return # source_type check handled the soft warn for this manifest + any_fail = False + for idx, rec in enumerate(records): + if not isinstance(rec, dict): + continue + req_id = rec.get("id", f"<#{idx}>") + source_type = rec.get("source_type") + skill_section = rec.get("skill_section") + if source_type == "skill-section": + if not isinstance(skill_section, str) or not skill_section.strip(): + fail( + "requirements_manifest.json", + f"record_id={req_id}: source_type='skill-section' but " + f"skill_section is empty or missing " + "(schemas.md §10 invariant #21)", + ) + any_fail = True + else: + if skill_section is not None and skill_section != "": + fail( + "requirements_manifest.json", + f"record_id={req_id}: skill_section={skill_section!r} " + f"populated but source_type={source_type!r} is not " + "'skill-section' (schemas.md §10 invariant #21)", + ) + any_fail = True + if not any_fail: + pass_("requirements_manifest.json: v1.5.3 skill_section consistency complete") + + +def check_v1_5_3_divergence_type_validation(q): + """schemas.md §10 invariant #22 — BUG.divergence_type on v1.5.3-shaped manifests. + + Legacy manifest: one WARN, then skip. + v1.5.3-shaped: every BUG MUST have divergence_type populated with a + member of bug_divergence_type (§3.8). + """ + data = _v150_manifest(q, "bugs_manifest.json") + if data is None: + return + records = data.get("records") + if not isinstance(records, list): + return + if not _is_v1_5_3_shaped(data): + warn( + "bugs_manifest.json: legacy manifest detected; treating absent " + "BUG.divergence_type as 'code-spec' per schemas.md §3.10 backward-compat rule" + ) + return + any_fail = False + for idx, rec in enumerate(records): + if not isinstance(rec, dict): + continue + bug_id = rec.get("id", f"<#{idx}>") + divergence_type = rec.get("divergence_type") + if divergence_type not in _V153_VALID_DIVERGENCE_TYPES: + fail( + "bugs_manifest.json", + f"record_id={bug_id}: missing or invalid divergence_type " + f"{divergence_type!r} on v1.5.3-shaped manifest " + f"(schemas.md §10 invariant #22, valid: " + f"{', '.join(_V153_VALID_DIVERGENCE_TYPES)})", + ) + any_fail = True + if not any_fail: + pass_("bugs_manifest.json: v1.5.3 divergence_type validation complete") + + +_V153_COUNCIL_INBOX_ITEM_TYPES = frozenset({ + "rejected-draft", + "tier-5-demotion", + "zero-req-section", + "weak-rationale", +}) + + +def check_v1_5_3_council_inbox_validation(q): + """Phase 3b BLOCK-4 cross-reference + DQ-5 structural validation. + + Validates quality/phase3/pass_d_council_inbox.json against the + DQ-5 schema AND verifies that every Pass D rejection / Tier-5 + demotion has a matching council-inbox item. Without the + cross-reference invariant, a syntactically-valid but functionally + -empty inbox could pass while pass_d_audit.json shows 30+ + rejections -- the inbox population could silently break and the + gate would not catch it. + + Two failure modes: + 1. Structural -- malformed item record, invalid item_type, + missing required field per the DQ-5 schema. + 2. Cross-reference -- pass_d_audit.json entry with outcome in + {rejected, demoted_to_tier_5} has no matching item in the + inbox. + + Phase 3 artifact set is at /quality/phase3/, NOT at the + top-level /quality/. The check returns silently if the + phase3 directory does not exist (the project is Code-only or + Phase 3 has not been run yet). + """ + phase3_dir = _resolve_artifact_path(q, "phase3") + if not phase3_dir.is_dir(): + return # phase 3 not run; not in scope for this manifest set + inbox_path = phase3_dir / "pass_d_council_inbox.json" + audit_path = phase3_dir / "pass_d_audit.json" + if not inbox_path.is_file(): + return # phase 3 partially run; skip silently + + inbox_data = load_json(inbox_path) + if not isinstance(inbox_data, dict): + fail(f"{inbox_path.name}: not a valid JSON object") + return + + # Structural validation. + schema_version = inbox_data.get("schema_version") + if schema_version != "1.0": + fail( + f"{inbox_path.name}: schema_version {schema_version!r} " + "does not match the DQ-5 spec value '1.0'" + ) + items = inbox_data.get("items") + if not isinstance(items, list): + fail(f"{inbox_path.name}: 'items' is missing or not a list") + return + + required_fields = { + "item_type", + "draft_idx", + "section_idx", + "section_heading", + "rationale", + "context_excerpt", + "provisional_disposition", + } + for idx, item in enumerate(items): + if not isinstance(item, dict): + fail(f"{inbox_path.name}: item #{idx} is not a JSON object") + continue + missing = required_fields - set(item.keys()) + if missing: + fail( + f"{inbox_path.name}: item #{idx} is missing required " + f"DQ-5 fields: {sorted(missing)}" + ) + if item.get("item_type") not in _V153_COUNCIL_INBOX_ITEM_TYPES: + fail( + f"{inbox_path.name}: item #{idx} has invalid item_type " + f"{item.get('item_type')!r} (valid: " + f"{sorted(_V153_COUNCIL_INBOX_ITEM_TYPES)})" + ) + rationale = item.get("rationale") + if not isinstance(rationale, str) or not rationale.strip(): + fail( + f"{inbox_path.name}: item #{idx} has empty or missing " + "rationale" + ) + + # Cross-reference invariant: every rejected / demoted audit entry + # must have a matching inbox item by (draft_idx, item_type). + if audit_path.is_file(): + audit_data = load_json(audit_path) + if isinstance(audit_data, dict): + inbox_pairs = { + (item.get("draft_idx"), item.get("item_type")) + for item in items + if isinstance(item, dict) + } + for entry in audit_data.get("rejected", []) or []: + if not isinstance(entry, dict): + continue + pair = (entry.get("draft_idx"), "rejected-draft") + if pair not in inbox_pairs: + fail( + f"{inbox_path.name}: pass_d_audit.json shows " + f"rejected draft_idx={entry.get('draft_idx')} " + "but there is no matching rejected-draft item " + "in the council inbox (BLOCK-4 cross-reference " + "invariant violation)" + ) + for entry in audit_data.get("demoted_to_tier_5", []) or []: + if not isinstance(entry, dict): + continue + pair = (entry.get("draft_idx"), "tier-5-demotion") + if pair not in inbox_pairs: + fail( + f"{inbox_path.name}: pass_d_audit.json shows " + f"tier-5 demotion at draft_idx={entry.get('draft_idx')} " + "but there is no matching tier-5-demotion item " + "in the council inbox" + ) + + pass_(f"{inbox_path.name}: v1.5.3 council inbox validation complete") + + +# --------------------------------------------------------------------------- +# Phase 4 skill-project gate enforcement checks (DQ-4-4). +# +# These four checks fire when the target's role map shows skill-prose +# surface; they SKIP (informational `INFO: skipped` line, no fail +# counter increment) on pure-code targets. The check that always runs +# is check_role_map_consistency. +# +# v1.5.4 Part 1: the legacy Code/Skill/Hybrid string is now derived +# from the Phase-1 role map at /exploration_role_map.json. The +# mapping mirrors bin/role_map.py::derive_legacy_project_type. If the +# role map is absent, all four checks SKIP silently — Phase 1 has not +# been run yet on this target. The gate ships into target repos as a +# stdlib-only script and cannot import bin/role_map; the small amount +# of role-map awareness it needs is inlined below. +# --------------------------------------------------------------------------- + + +def _load_role_map(q): + """Return the parsed exploration_role_map.json dict, or None when + absent / unparsable. v1.5.4 inline replacement for the prior + project_type.json reader.""" + return load_json(q / "exploration_role_map.json") + + +def _role_map_has_role(role_map, role_set): + if not isinstance(role_map, dict): + return False + files = role_map.get("files") or [] + if not isinstance(files, list): + return False + for entry in files: + if isinstance(entry, dict) and entry.get("role") in role_set: + return True + return False + + +def _phase4_project_type(q): + """Return the v1.5.3-equivalent classification string ('Code' / + 'Skill' / 'Hybrid') derived from the Phase-1 role map, or None + when the role map is absent / unparsable. + + Mapping (mirrors bin/role_map.derive_legacy_project_type): + - has skill-prose AND has code -> 'Hybrid' + - has skill-prose, no code -> 'Skill' + - no skill-prose -> 'Code' + """ + role_map = _load_role_map(q) + if role_map is None: + return None + skill = _role_map_has_role(role_map, ("skill-prose", "skill-reference")) + code = _role_map_has_role(role_map, ("code",)) + if skill and code: + return "Hybrid" + if skill: + return "Skill" + return "Code" + + +def check_skill_section_req_coverage(repo_dir, q): + """Skill / Hybrid: every operational SKILL.md section per + pass_d_section_coverage.json has ≥1 promoted REQ. Meta-allowlist + sections are exempt (their section_kind == 'meta'). + + SKIPS for Code projects.""" + print("[Phase 4: skill-section REQ coverage]") + classification = _phase4_project_type(q) + if classification not in ("Skill", "Hybrid"): + info(f"check_skill_section_req_coverage: skip (project_type={classification!r})") + return + coverage_path = _resolve_artifact_path(q, "phase3/pass_d_section_coverage.json") + data = load_json(coverage_path) + if not isinstance(data, dict): + info( + "check_skill_section_req_coverage: skip " + "(pass_d_section_coverage.json missing or unparsable)" + ) + return + failures = 0 + for s in data.get("sections", []) or []: + if not isinstance(s, dict): + continue + kind = s.get("section_kind") + if kind != "operational": + continue + promoted = s.get("drafts_promoted", 0) or 0 + if promoted < 1: + heading = s.get("heading") or "" + document = s.get("document") or "SKILL.md" + section_idx = s.get("section_idx") + fail( + f"{document}", + f"section #{section_idx} {heading!r} has 0 promoted " + "REQs and is not in the meta allowlist " + "(check_skill_section_req_coverage)", + ) + failures += 1 + if failures == 0: + pass_("check_skill_section_req_coverage: every operational section has ≥1 promoted REQ") + + +def check_reference_file_req_coverage(repo_dir, q): + """Skill / Hybrid: every reference file under references/ has ≥1 + REQ citing it OR a `` marker in its first + 5 lines. + + SKIPS for Code projects.""" + print("[Phase 4: reference-file REQ coverage]") + classification = _phase4_project_type(q) + if classification not in ("Skill", "Hybrid"): + info(f"check_reference_file_req_coverage: skip (project_type={classification!r})") + return + references_dir = repo_dir / "references" + if not references_dir.is_dir(): + info("check_reference_file_req_coverage: skip (no references/ directory)") + return + formal_path = _resolve_artifact_path(q, "phase3/pass_c_formal.jsonl") + if not formal_path.is_file(): + info( + "check_reference_file_req_coverage: skip " + "(pass_c_formal.jsonl missing — Phase 3 not run yet)" + ) + return + cited_documents = set() + for line in formal_path.read_text(encoding="utf-8").splitlines(): + if not line.strip(): + continue + try: + rec = json.loads(line) + except json.JSONDecodeError: + continue + if not isinstance(rec, dict): + continue + sd = rec.get("source_document") + if isinstance(sd, str): + cited_documents.add(sd) + failures = 0 + for ref in sorted(references_dir.glob("*.md")): + rel = f"references/{ref.name}" + if rel in cited_documents: + continue + # Non-normative marker check (first 5 lines). + head = ref.read_text(encoding="utf-8", errors="replace").splitlines()[:5] + if any("" in line.lower() for line in head): + continue + fail( + rel, + "no REQ cites this reference file and no " + "marker in its first 5 lines (check_reference_file_req_coverage)", + ) + failures += 1 + if failures == 0: + pass_("check_reference_file_req_coverage: every reference file has ≥1 citing REQ or non-normative marker") + + +def check_hybrid_cross_cutting_reqs(repo_dir, q): + """Hybrid only: ≥1 REQ has triangulated evidence — + `source_type=skill-section` AND its acceptance_criteria references + a code artifact mentioned in another REQ with + `source_type=code-derived`. + + SKIPS for Skill or Code projects.""" + print("[Phase 4: hybrid cross-cutting REQs]") + classification = _phase4_project_type(q) + if classification != "Hybrid": + info(f"check_hybrid_cross_cutting_reqs: skip (project_type={classification!r})") + return + formal_path = _resolve_artifact_path(q, "phase3/pass_c_formal.jsonl") + if not formal_path.is_file(): + info( + "check_hybrid_cross_cutting_reqs: skip " + "(pass_c_formal.jsonl missing — Phase 3 not run yet)" + ) + return + skill_section_reqs = [] + code_derived_artifacts = set() + for line in formal_path.read_text(encoding="utf-8").splitlines(): + if not line.strip(): + continue + try: + rec = json.loads(line) + except json.JSONDecodeError: + continue + if not isinstance(rec, dict): + continue + st = rec.get("source_type") + if st == "skill-section": + skill_section_reqs.append(rec) + elif st == "code-derived": + ac = (rec.get("acceptance_criteria") or "") + cite = (rec.get("citation_excerpt") or "") + for token in re.findall( + r"\b([\w./-]+\.(?:py|sh|json))\b", ac + " " + cite + ): + code_derived_artifacts.add(token) + if not code_derived_artifacts: + # On a Hybrid project that hasn't yet produced any code-derived + # REQs, the cross-cutting check has nothing to triangulate + # against. INFO + skip rather than fail (the absence is the + # diagnostic). + info( + "check_hybrid_cross_cutting_reqs: skip " + "(no code-derived REQs in pass_c_formal.jsonl yet)" + ) + return + triangulated = 0 + for rec in skill_section_reqs: + ac = (rec.get("acceptance_criteria") or "") + " " + ( + rec.get("citation_excerpt") or "" + ) + if any(art in ac for art in code_derived_artifacts): + triangulated += 1 + if triangulated >= 1: + break + if triangulated >= 1: + pass_( + f"check_hybrid_cross_cutting_reqs: triangulated evidence " + f"present (≥{triangulated} skill-section REQ references a " + "code-derived artifact)" + ) + else: + fail( + "pass_c_formal.jsonl", + "Hybrid project has no triangulated REQ pair " + "(skill-section REQ referencing a code-derived artifact); " + "check_hybrid_cross_cutting_reqs", + ) + + +def check_role_map_consistency(repo_dir, q): + """All projects: exploration_role_map.json (when present) parses as + a JSON object, declares schema_version '1.0', carries a 'files' + list and a 'breakdown.percentages' dict with the four expected + share keys. + + SKIPS silently when the role map is absent — Phase 1 has not been + run yet on this target. v1.5.4 Part 1 replacement for the v1.5.3 + check_project_type_consistency, which keyed on + quality/project_type.json (now retired).""" + print("[Phase 4: role-map consistency]") + rm_path = q / "exploration_role_map.json" + if not rm_path.is_file(): + info( + "check_role_map_consistency: skip " + "(exploration_role_map.json absent — Phase 1 not run yet)" + ) + return + data = load_json(rm_path) + if not isinstance(data, dict): + fail( + f"{rm_path.relative_to(q.parent)}", + "exploration_role_map.json is not a valid JSON object", + ) + return + if data.get("schema_version") != "1.0": + fail( + f"{rm_path.relative_to(q.parent)}", + f"schema_version {data.get('schema_version')!r} is not '1.0' " + "(check_role_map_consistency)", + ) + return + files = data.get("files") + if not isinstance(files, list): + fail( + f"{rm_path.relative_to(q.parent)}", + "'files' is not a list (check_role_map_consistency)", + ) + return + breakdown = data.get("breakdown") + if not isinstance(breakdown, dict): + fail( + f"{rm_path.relative_to(q.parent)}", + "'breakdown' is not an object (check_role_map_consistency)", + ) + return + percentages = breakdown.get("percentages") + if not isinstance(percentages, dict): + fail( + f"{rm_path.relative_to(q.parent)}", + "'breakdown.percentages' is not an object " + "(check_role_map_consistency)", + ) + return + missing = [ + k for k in ("skill_share", "code_share", "tool_share", "other_share") + if k not in percentages + ] + if missing: + fail( + f"{rm_path.relative_to(q.parent)}", + f"breakdown.percentages missing keys: {missing} " + "(check_role_map_consistency)", + ) + return + derived = _phase4_project_type(q) or "Unknown" + pass_( + f"{rm_path.relative_to(q.parent)}: role map well-formed " + f"(legacy-derived project type {derived!r}; " + "check_role_map_consistency)" + ) + + +def check_v1_5_2_cardinality_gate(repo_dir): + """v1.5.2 Lever 3: Phase 5 cardinality reconciliation gate. + + Surfaces every failure from validate_cardinality_gate() as a fail() entry. + """ + failures = validate_cardinality_gate(repo_dir) + if not failures: + pass_("compensation_grid.json: v1.5.2 cardinality gate clean") + return + for msg in failures: + fail("compensation_grid.json", msg) + + +def check_v1_5_0_gate_invariants(repo_dir, q): + """Dispatcher that runs every Layer-1 mechanical check from schemas.md §10.""" + check_v1_5_0_cite_extensions(repo_dir) + check_v1_5_0_manifest_wrappers(q) + check_v1_5_0_requirements_manifest(repo_dir, q) + check_v1_5_0_bugs_manifest(q) + check_v1_5_0_index_md(q) + # Phase 6 invariant #17 runs after requirements_manifest so it sees + # shape-validated REQ records. + check_v1_5_0_semantic_check(q) + # v1.5.1 Item 5.2: challenge-gate coverage runs last. It depends on + # requirements_manifest.json for the "No spec basis" pattern but + # does not redo schema checks that the prior invariants already cover. + check_challenge_gate_coverage(q) + # v1.5.2 Lever 3: cardinality reconciliation gate. + check_v1_5_2_cardinality_gate(repo_dir) + # v1.5.3 Phase 2: schema extensions for skill-aware projects (Code projects + # with legacy manifests hit the soft-warn path; v1.5.3-shaped manifests + # validate strictly per schemas.md §10 invariants #21–#23). + check_v1_5_3_formal_doc_role_validation(q) + check_v1_5_3_source_type_validation(q) + check_v1_5_3_skill_section_consistency(q) + check_v1_5_3_divergence_type_validation(q) + # v1.5.3 Phase 3b: council inbox structural + cross-reference + # validation (DQ-5 + BLOCK-4). No-op for Code projects (phase3 + # directory is absent). + check_v1_5_3_council_inbox_validation(q) + # v1.5.3 Phase 4 (DQ-4-4): skill-project gate enforcement. The + # first three SKIP for code-only projects (no skill-prose surface + # in the role map); check_role_map_consistency runs for all + # projects. v1.5.4 Part 1: project_type derived from the Phase-1 + # role map instead of the retired project_type.json. + check_skill_section_req_coverage(repo_dir, q) + check_reference_file_req_coverage(repo_dir, q) + check_hybrid_cross_cutting_reqs(repo_dir, q) + check_role_map_consistency(repo_dir, q) + + +def check_repo(repo_dir, version_arg, strictness): + """Run all checks for one repo. Writes output via pass_/fail_/warn/info.""" + repo_dir = Path(repo_dir) + if str(repo_dir) == ".": + repo_dir = Path.cwd() + repo_name = repo_dir.name + q = repo_dir / "quality" + + print("") + print(f"=== {repo_name} ===") + + check_file_existence(repo_dir, q, strictness) + bug_count, bug_ids = check_bugs_heading(q) + tdd_data = check_tdd_sidecar(q, bug_count) + check_tdd_logs(q, bug_count, bug_ids, tdd_data) + check_integration_sidecar(q, strictness) + check_recheck_sidecar(q) + check_use_cases(repo_dir, q, strictness) + check_test_file_extension(repo_dir, q) + check_terminal_gate(q) + check_mechanical(q) + check_patches(q, bug_count, bug_ids, strictness) + check_writeups(q, bug_count) + skill_version = check_version_stamps(repo_dir, q) + check_cross_run_contamination(repo_dir, q, version_arg, skill_version) + check_run_metadata(q) + check_v1_5_0_gate_invariants(repo_dir, q) + + print("") + + +# --- Main --- + + +def main(argv=None): + _reset_counters() + if argv is None: + argv = sys.argv[1:] + + repo_dirs = [] + version = "" + check_all = False + strictness = "benchmark" + + expect_version = False + for arg in argv: + if expect_version: + version = arg + expect_version = False + continue + if arg == "--version": + expect_version = True + elif arg == "--all": + check_all = True + elif arg == "--benchmark": + strictness = "benchmark" + elif arg == "--general": + strictness = "general" + else: + repo_dirs.append(arg) + + if not version: + version = detect_skill_version([ + SCRIPT_DIR / ".." / "SKILL.md", + SCRIPT_DIR / "SKILL.md", + Path("SKILL.md"), + Path(".claude") / "skills" / "quality-playbook" / "SKILL.md", + Path(".github") / "skills" / "SKILL.md", + Path(".github") / "skills" / "quality-playbook" / "SKILL.md", + ]) + + # Resolve repos + if check_all: + for entry in sorted(SCRIPT_DIR.glob(f"*-{version}")): + if (entry / "quality").is_dir(): + repo_dirs.append(str(entry)) + elif len(repo_dirs) == 1 and repo_dirs[0] == ".": + repo_dirs = [str(Path.cwd())] + else: + resolved = [] + for name in repo_dirs: + p = Path(name) + if (p / "quality").is_dir(): + resolved.append(name) + elif (SCRIPT_DIR / f"{name}-{version}").is_dir(): + resolved.append(str(SCRIPT_DIR / f"{name}-{version}")) + elif (SCRIPT_DIR / name).is_dir(): + resolved.append(str(SCRIPT_DIR / name)) + else: + print(f"WARNING: Cannot find repo '{name}'") + repo_dirs = resolved + + if not repo_dirs: + print(f"Usage: {sys.argv[0]} [--version V] [--all | repo1 repo2 ... | .]") + return 1 + + print("=== Quality Gate — Post-Run Validation ===") + print(f"Version: {version or 'unknown'}") + print(f"Strictness: {strictness}") + print(f"Repos: {len(repo_dirs)}") + + for rd in repo_dirs: + check_repo(rd, version, strictness) + + print("") + print("===========================================") + print(f"Total: {FAIL} FAIL, {WARN} WARN") + if FAIL > 0: + print(f"RESULT: GATE FAILED — {FAIL} check(s) must be fixed") + return 1 + else: + print("RESULT: GATE PASSED") + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/skills/quality-playbook/references/challenge_gate.md b/skills/quality-playbook/references/challenge_gate.md new file mode 100644 index 000000000..7db41feb0 --- /dev/null +++ b/skills/quality-playbook/references/challenge_gate.md @@ -0,0 +1,106 @@ +# Challenge Gate — Bug Validity Review + +## Purpose + +The challenge gate is a self-adversarial review that every confirmed bug must survive before receiving a writeup and regression test. It catches false positives, over-classified feature gaps, and findings where pattern-matching overrode common sense. + +The gate can be invoked two ways: + +1. **During a playbook run** — automatically applied to bugs matching trigger patterns (see below). +2. **Standalone** — pointed at a `quality/` directory from a prior run to challenge specific bugs. Example: `"Read quality/writeups/BUG-042.md and the source code it references. Run the challenge gate on this bug."` + +## The two-round challenge + +For each bug under review, run exactly two rounds. Each round uses a fresh sub-agent so the challenger has no investment in the finding. + +### Round 1: "Does this strike you as a real bug?" + +Provide the sub-agent with: +- The bug writeup (or BUGS.md entry if no writeup yet) +- The actual source code at the cited file:line (read it fresh — do not trust the writeup's code snippet) +- All comments within 10 lines above and below the cited location +- The project's README section on the relevant feature (if any) + +Prompt the sub-agent: + +> You are reviewing a bug report filed against an open-source project. Read the source code and the bug report below. Then answer: **does this strike you as a real bug?** +> +> **Before analyzing anything, apply common sense.** Step back from the details and ask yourself: if you showed this code and this bug report to a senior developer who has never seen either before, would they say "yes, that's a bug" — or would they say "that's obviously not a bug"? If the answer is obviously not a bug, say so immediately and explain why. Do not rationalize your way past a common-sense answer. The goal of this review is to catch findings where pattern-matching overrode judgment. +> +> Then consider: +> - Is the developer aware of this behavior? (Look for comments, TODO markers, design decision notes, WHY annotations, OODA references.) +> - Is this a documented limitation or intentional trade-off? (Check if other code paths handle this differently by design, not by accident.) +> - Would the project maintainer respond "that's not a bug, that's how it works" or "that's a known limitation we documented"? +> - Is the "expected behavior" in the bug report actually required by any spec, or is it the auditor's opinion about what the code should do? +> - Is this development scaffolding? Values with names like "change-me", "placeholder", "example", "default", "TODO" are not defects — they are self-documenting markers that exist to make the project buildable during development. A feature that is disabled by default and uses placeholder values is an incomplete feature, not a vulnerability. +> +> Give your honest assessment. If it's a real bug, say so and explain why. If it's not, say so and explain why. A finding can be "not a bug" even if the code could be improved — the question is whether a reasonable maintainer would accept this as a defect report. + +### Round 2: Targeted follow-up + +Based on the Round 1 response, generate a single pointed follow-up question. The goal is to stress-test whatever position the sub-agent took in Round 1. + +**If Round 1 said "real bug":** The follow-up should challenge the finding from the maintainer's perspective. Use a fresh sub-agent with this framing: + +> You are the maintainer of this project. A contributor filed this bug report. You wrote the code being criticized. Read the code, the bug report, and the Round 1 assessment below. +> +> Write the single most compelling argument for why this is NOT a bug. Consider: intentional design decisions, documented limitations, deployment context, common patterns in this language/framework, and whether the "expected behavior" is actually specified anywhere authoritative. +> +> Then, after making that argument, state whether you still believe it's a real bug or whether the argument convinced you it's not. + +**If Round 1 said "not a bug":** The follow-up should challenge the dismissal. Use a fresh sub-agent with this framing: + +> You are a security researcher reviewing this codebase. Another reviewer dismissed this finding as "not a bug." Read the code, the bug report, and the Round 1 dismissal below. +> +> Write the single most compelling argument for why this IS a real bug despite the dismissal. Consider: edge cases the dismissal didn't address, downstream consequences, what happens when the code interacts with other components, and whether "intentional" and "correct" are the same thing. +> +> Then, after making that argument, state whether you believe the finding should be confirmed or dismissed. + +### Verdict + +After both rounds, assign one of three verdicts: + +- **CONFIRMED** — Both rounds agree it's a real bug, or Round 2's challenge failed to undermine Round 1's confirmation. The bug proceeds to writeup and regression test as normal. +- **DOWNGRADED** — The bug is real but the severity was inflated. Adjust severity and update the writeup. Common case: a CRITICAL that should be MEDIUM, or a security finding that's actually a code quality issue. +- **REJECTED** — The finding is not a bug. It's a documented design choice, intentional scaffolding, a known limitation, the auditor's opinion rather than a spec violation, or something that common sense says is obviously not a defect. Remove the bug ID. If useful, record it in a "Reviewed and dismissed" appendix in BUGS.md with the challenge reasoning. + +Write the verdict and both rounds' reasoning to `quality/challenge/BUG-NNN-challenge.md`. This file is the audit trail — it shows reviewers that each finding was stress-tested. + +## Auto-trigger patterns + +During a playbook run, automatically apply the challenge gate to any bug matching one or more of these patterns. These patterns are where false positives concentrate: + +| Pattern | Why it triggers | Example | +|---------|----------------|---------| +| **Security-class finding** (credential leak, auth bypass, injection) | Severity calibration auto-escalates these; the model is incentivized to defend them | BUG-041: "hardcoded JWT secret" that was a development placeholder | +| **Code contains design-decision comments at the cited location** | WHY comments, OODA references, TODO-with-explanation, or design decision docs within 10 lines of the cited code suggest the developer made a conscious choice | BUG-007/008: `// WHY-OODA81: Batch upload uses "default" workspace` | +| **The "expected behavior" has no spec basis** | Bug's spec_basis field says "code inconsistency" rather than citing a spec document, or the requirement was invented by the auditor (Tier 3 / REQ-NNN created during the run) | BUG-041: REQ-019 was created by the auditor, not derived from project docs | +| **Another code path handles the same concern differently** | If text_upload does X but file_upload doesn't, that might be a real inconsistency — or it might be intentional divergence. The challenge sorts out which. | BUG-001/002: text_upload merges source_ids, file_upload overwrites — challenge confirms this is a real bug because text_upload has an explicit fix comment | +| **The finding is about missing functionality rather than incorrect behavior** | "This handler doesn't do X" is often a feature gap, not a bug. The challenge checks whether X was ever promised. | BUG-009/029: batch upload "missing" graph writes that were never part of the batch upload's documented scope | + +The pattern list is intentionally conservative — it triggers on categories with historically high false-positive rates. Bugs that don't match any pattern skip the challenge gate and proceed directly to writeup. + +To add new patterns: append a row to the table above with the pattern description, the reasoning, and a concrete example from a prior run. + +## Standalone invocation + +When invoked standalone (not during a playbook run), the challenge gate: + +1. Reads the specified bug writeup from `quality/writeups/BUG-NNN.md` +2. Reads the source code at the cited file:line (fresh read, not from the writeup) +3. Runs both rounds as described above +4. Writes the verdict to `quality/challenge/BUG-NNN-challenge.md` +5. If the verdict is REJECTED, suggests removing the bug from BUGS.md and tdd-results.json + +Example prompt for standalone use: +``` +Read the quality playbook skill at .github/skills/SKILL.md and .github/skills/references/challenge_gate.md. +Run the challenge gate on BUG-042 using the writeup at quality/writeups/BUG-042.md +and the source code in this repo. +``` + +## Token budget + +Each bug costs roughly 2 sub-agent calls. For a typical run with 5-10 auto-triggered bugs, that's 10-20 sub-agent calls. This is significantly cheaper than a full iteration cycle and catches the highest-value false positives. + +For runs with many security findings (>15 auto-triggered), consider batching: run Round 1 on all triggered bugs first, then only run Round 2 on bugs where Round 1 was ambiguous or where the confidence was low. diff --git a/skills/quality-playbook/references/code-only-mode.md b/skills/quality-playbook/references/code-only-mode.md new file mode 100644 index 000000000..e460e0616 --- /dev/null +++ b/skills/quality-playbook/references/code-only-mode.md @@ -0,0 +1,59 @@ +# Code-only mode + +*Last updated: 2026-05-03 (v1.5.6 Phase 3 — initial publication).* + +When the Quality Playbook runs against a target repo whose `reference_docs/` directory is absent or empty, it operates in **code-only mode**. This document explains what that means, why it matters, and how to upgrade a code-only run into a full-documentation run for the next pass. + +## What "code-only mode" means + +The playbook's normal Phase 1 derivation reads two kinds of evidence: + +- **Code evidence (Tier 3+)** — the source tree itself, plus inline comments, defensive patterns, tests, and any inline documentation co-located with the code. +- **Documentation evidence (Tier 1/2)** — plaintext files the operator drops into `reference_docs/` (free-form notes, design docs, retrospectives, AI chats) and `reference_docs/cite/` (project specs, RFCs, API contracts that requirements should be traceable back to). + +Code-only mode is the run state where no documentation evidence is available. The playbook proceeds — it does not abort — but every requirement it derives leans entirely on code evidence. The Phase 1 EXPLORATION.md gets a "Documentation status: code-only mode" opening section that surfaces the mode so reviewers see it on first read. + +## What to expect from a code-only run + +In our benchmark runs, code-only passes consistently produce: + +- **Fewer requirements derived overall.** Without spec-language to anchor, Phase 1 has no Tier 1/2 evidence to cite, so the requirements set falls back to Tier 3 (code-as-spec) entirely. +- **Possibly fewer bugs found.** Code review (Phase 3) is most effective when the reviewer knows what the code is *supposed* to do — bugs that violate documented intent are easier to surface than bugs that hide behind ambiguous code-as-spec. With no documentation, the reviewer has to infer intent from the code itself, which leaves a class of intent-violation defects undetected. +- **Higher reliance on code-internal signals.** Defensive patterns (error checks, validation), test names, and comment-style annotations carry more weight in the absence of external docs. + +The bug counts in code-only mode are still useful — they reflect what's discoverable from the code alone — but they are a lower bound on what a fully-documented run would produce. + +## How to upgrade to a full-documentation run + +Place plaintext documentation files in the target repo's `reference_docs/` tree before re-running Phase 1: + +``` +/ + reference_docs/ + project_notes.md # Tier 4 — informal notes, AI chats + design_overview.md # Tier 3-4 — internal design decisions + cite/ + api_spec.md # Tier 1/2 — citable specs, RFCs, contracts + protocol_v3.txt # Tier 1/2 — formal specifications +``` + +Files at the top level of `reference_docs/` count as informal context (Tier 4). Files under `reference_docs/cite/` count as citable evidence (Tier 1 or 2 depending on the source's authority — see `schemas.md` §3.1). Both `.md` and `.txt` are recognized; other formats are ignored. + +After dropping in documentation, re-run the playbook. Phase 1 will detect the populated `reference_docs/` and skip the code-only-mode downgrade. The new run's EXPLORATION.md, REQUIREMENTS.md, and BUGS.md will reflect the richer evidence base. + +## Opt-out: `--require-docs` + +Operators who want runs to abort instead of proceeding in code-only mode can pass `--require-docs` to `python3 -m bin.run_playbook` (v1.5.6+). When `--require-docs` is set and `reference_docs/` is empty at Phase 1 entry, the playbook: + +1. Appends an `aborted_missing_docs` event to `quality/run_state.jsonl` (event type registered in `references/run_state_schema.md`). +2. Writes a clear `ERROR: aborted_missing_docs — reference_docs/ empty and --require-docs set` block to `quality/PROGRESS.md`. +3. Aborts before any LLM work (exit non-zero, same as a gate-fail). + +The flag is off by default. Use it for compliance/policy contexts where a quiet code-only-mode downgrade would mask a real process gap (e.g., "every release run must cite a spec; no spec means the run shouldn't have started"). The flag is the opt-IN counterpart to `--no-formal-docs`'s opt-OUT (which suppresses the WARN banner for the same code-only-mode case but allows the run to continue). + +## Cross-references + +- **README** — Step 1 of "How to use the Quality Playbook" describes documentation as the first thing to provide. +- **`SKILL.md`** — Phase 1 prose describes how documentation evidence is used during exploration. +- **`bin/reference_docs_ingest.py`** — the implementation that ingests the `reference_docs/` tree. +- **`references/run_state_schema.md`** — defines the `documentation_state` event the playbook emits when code-only mode triggers, so the downgrade is searchable in audit trails. diff --git a/skills/quality-playbook/references/defensive_patterns.md b/skills/quality-playbook/references/defensive_patterns.md index 05070576a..281239768 100644 --- a/skills/quality-playbook/references/defensive_patterns.md +++ b/skills/quality-playbook/references/defensive_patterns.md @@ -155,6 +155,26 @@ State machines are a special category of defensive pattern. When you find status 3. Look for states you can enter but never leave (terminal state without cleanup) 4. Look for operations that should be available in a state but are blocked by an incomplete guard +## Enumeration and Whitelist Completeness + +When a function uses `switch`/`case`, `match`, if-else chains, or any dispatch construct to handle a set of named constants (feature bits, enum values, command codes, event types, permission flags), perform the **two-list enumeration check**: + +1. **List A (defined):** Extract every constant from the relevant header, enum, or spec that the code should handle. Use grep — do not list from memory. +2. **List B (handled):** Extract every case label, branch condition, or map key from the dispatch code. Use grep or line-by-line read — do not summarize. +3. **Diff:** Compare the two lists. Any constant in A but not in B is a potential gap. Any constant in B but not in A is a potential dead case. + +**Why this exists:** AI models reliably hallucinate completeness for switch/case constructs. The model sees a function with many case labels, sees constants defined elsewhere, and concludes all constants are handled without actually checking. In one observed case, the model asserted that a kernel feature-bit whitelist "preserves supported ring transport bits including VIRTIO_F_RING_RESET" when that constant was entirely absent from the switch — the model hallucinated coverage because the constant existed in a header the function's callers used. The mechanical two-list check is the only reliable countermeasure. + +**Triage verification probes must produce executable evidence.** When triage confirms or rejects an enumeration finding via verification probe, prose reasoning alone is insufficient. The probe must produce a test assertion for each constant: `assert "case VIRTIO_F_RING_RESET:" in source_of("vring_transport_features"), "RING_RESET at line NNN"`. This rule exists because in v1.3.16, the triage correctly received a minority finding about RING_RESET but rejected it with a hallucinated claim that "lines 3527-3528 explicitly preserve RING_RESET" — those lines were actually the `default:` branch. Had the triage been forced to write an assertion, it would have failed, exposing the hallucination. + +**Code-side lists must be extracted from the code, not copied from requirements.** When performing the two-list check in the code review or spec audit, the "handled" list must be extracted directly from the function body with per-item line numbers. Do not copy from REQUIREMENTS.md, CONTRACTS.md, the audit prompt, or any other generated artifact. If the two lists (code-extracted vs. requirements-claimed) are word-for-word identical, that is a red flag that the code list was copied — redo the extraction. In v1.3.17, the code review's "case labels present" list was identical to the requirements list, proving it was copied rather than extracted. Three spec auditors then inherited this false list and none independently verified. The per-item line-number citation prevents this: you cannot cite "line 3527: `case VIRTIO_F_RING_RESET:`" when line 3527 actually contains `default:`. + +**Mechanical verification artifacts outrank prose lists.** If `quality/mechanical/_cases.txt` exists for a dispatch function, use it as the authoritative source for what the function handles. Do not replace it with a hand-written list. If no mechanical artifact exists, generate one using a non-interactive shell pipeline (e.g., `awk` + `grep`) before writing contracts or requirements about the function's coverage. + +**Artifact integrity risk:** In v1.3.19 testing, the model executed the correct extraction command but wrote its own fabricated output to the file instead of letting the shell redirect capture it. The fabricated file included a hallucinated `case VIRTIO_F_RING_RESET:` line that the real command does not produce. To mitigate: `quality/mechanical/verify.sh` re-runs every extraction command and diffs against saved files. If any diff is non-empty, the artifact was tampered with and must be regenerated. + +**Where to apply:** Feature-bit negotiation functions, protocol message dispatchers, permission check switches, configuration option handlers, codec/format registration tables, HTTP method/status code handlers, and any function where a `default:` or `else` clause silently drops unrecognized values. + **Converting state machine gaps to scenarios:** ```markdown diff --git a/skills/quality-playbook/references/exploration_patterns.md b/skills/quality-playbook/references/exploration_patterns.md new file mode 100644 index 000000000..9482b6aa8 --- /dev/null +++ b/skills/quality-playbook/references/exploration_patterns.md @@ -0,0 +1,339 @@ +# Exploration Patterns for Bug Discovery + +This reference defines the exploration patterns that Phase 1 applies during codebase exploration. These patterns target bug classes most commonly missed when exploration stays at the subsystem or architecture level. + +Requirements problems are the most expensive to fix because they are not caught until after implementation. The exploration phase is requirements elicitation — it determines what the code review and spec audit will look for. A requirement that is never derived is a bug that is never found. These patterns exist to systematically surface requirements that broad exploration misses. + +Each pattern includes a definition, the bug class it targets, diverse examples from different domains, and the expected output format for EXPLORATION.md. + +**Important: These patterns supplement free exploration — they do not replace it.** Phase 1 begins with open-ended exploration driven by domain knowledge and codebase understanding. After that open exploration, apply the patterns below as a structured second pass to catch specific bug classes. If you find yourself only looking for things the patterns describe, you are using them wrong. The patterns are a checklist to run after you have already formed your own understanding of the codebase's risks. + +--- + +## Pattern 1: Fallback and Degradation Path Parity + +### Definition + +When code provides multiple strategies for accomplishing the same goal — a primary path and one or more fallback paths — each fallback must preserve the same behavioral invariants as the primary. The fallback may use a different mechanism, but the observable contract must be equivalent. + +### Bug class + +Fallback paths are written later, tested less, and reviewed with less scrutiny than primary paths. They often omit steps the primary path performs (validation, cleanup, index assignment, resource release) because the developer copied the primary path and simplified it for the "degraded" case. The result is a function that works correctly in the common case but violates its contract when the fallback activates. + +### Examples across domains + +- **Authentication:** A web service tries OAuth token validation, falls back to API key lookup, falls back to session cookie. Each fallback must enforce the same authorization scope. Bug: the API key fallback skips scope validation and grants full access. +- **Connection pooling:** A database client tries the primary connection pool, falls back to a secondary pool, falls back to creating a one-off connection. Each path must apply the same timeout and transaction isolation settings. Bug: the one-off connection fallback uses the driver default isolation level instead of the configured one. +- **Resource allocation:** A memory allocator tries a fast slab path, falls back to a slow page-level path. Both must zero-initialize sensitive fields. Bug: the slow path returns uninitialized memory because zero-fill was only in the slab fast path. +- **HTTP redirect handling:** A client follows a redirect and must strip security-sensitive headers (Authorization, Proxy-Authorization, cookies) when the redirect crosses an origin boundary. Bug: the redirect path strips Authorization but not Proxy-Authorization, leaking proxy credentials to the redirected origin. +- **Serialization fallback:** A message broker tries binary serialization, falls back to JSON, falls back to string encoding. Each path must preserve the same field ordering and null-handling semantics. Bug: the JSON fallback silently drops null fields that binary serialization preserves. + +### How to apply + +For each core module, look for: conditional chains that try one approach then fall through to another, strategy/adapter patterns where multiple implementations are selected at runtime, retry logic with different strategies per attempt, feature-negotiation cascades where capabilities determine which code path runs, HTTP redirect/retry logic that must preserve or strip headers. + +For each cascade found: +1. List the primary path and every fallback. +2. For each fallback, check whether it performs the same critical operations as the primary (validation, resource setup, index assignment, cleanup, error reporting, header stripping, resource release). +3. Any operation present in the primary but missing in a fallback is a candidate requirement. + +### EXPLORATION.md output format + +``` +## Fallback Path Analysis + +### [Name of cascade] +- **Primary path:** [function, file:line] — [what it does] +- **Fallback 1:** [function, file:line] — [what it does, what differs] +- **Fallback 2:** [function, file:line] — [what it does, what differs] +- **Parity gaps:** [specific operations present in primary but missing in fallback] +- **Candidate requirements:** REQ-NNN: [fallback must do X] +``` + +--- + +## Pattern 2: Dispatcher Return-Value Correctness + +### Definition + +When a function dispatches on input type or condition and must return a status value, the return value must be correct for every combination of inputs — not just the primary case. Dispatchers that handle multiple event types, request types, or state transitions are particularly prone to return-value bugs in edge combinations. + +### Bug class + +Dispatchers are typically written and tested for the common case. The return value is correct when the primary event fires. But when an unusual combination occurs (only a secondary event, no events at all, multiple concurrent events), the return-value logic may be wrong — returning "not handled" for a handled event, returning success for a partial failure, or returning a stale value from a previous iteration. + +### Examples across domains + +- **HTTP middleware:** A request dispatcher checks for authentication, rate-limiting, and routing. When rate-limiting triggers but authentication was already set, the dispatcher returns the auth status code instead of the rate-limit status code. Bug: rate-limited requests get 401 instead of 429. +- **CORS handler chain:** A CORS preflight handler sets 400 (rejected), then the missing-OPTIONS-handler path sets 404, then an AFTER handler normalizes 404→200 (meant for allowed origins). Bug: rejected preflights get 200 because the status was overwritten by downstream handlers. +- **Event loop:** A poll/select loop handles read-ready, write-ready, and error conditions. When only an error condition fires on a socket with no pending reads, the loop returns "no events" because the read-ready check was false. Bug: connection errors are silently ignored. +- **State machine transition:** A state machine dispatch function handles valid transitions, invalid transitions, and no-op transitions. When a no-op transition occurs (current state == target state), the function returns an error code intended for invalid transitions. Bug: idempotent operations fail when they should succeed. +- **Interrupt handler:** A hardware interrupt handler checks for multiple event types (data-ready, configuration-change, error). When only a secondary event fires (e.g., config change with no data), the handler returns "not mine" because the primary event check failed and the secondary path doesn't set the handled flag. Bug: legitimate secondary events are reported as spurious. + +### How to apply + +For each core module, look for: functions with switch/case or if-else chains that return a status, interrupt/event handlers that handle multiple event types, request dispatchers that check multiple conditions before returning, state machine transition functions, middleware chains where multiple handlers write to the same response status. + +For each dispatcher found: +1. Enumerate all input combinations (not just the ones with explicit case labels — also the implicit "else" and "default" paths). +2. For each combination, trace the return value through the entire handler chain (not just the immediate function). +3. Any combination where the return value doesn't match the expected semantics is a candidate requirement. + +### EXPLORATION.md output format + +``` +## Dispatcher Return-Value Analysis + +### [Function name] at [file:line] +- **Input types:** [list of conditions/events the function dispatches on] +- **Combinations checked:** + - [Condition A only]: returns [X] — correct/incorrect because [reason] + - [Condition B only]: returns [X] — correct/incorrect because [reason] + - [Both A and B]: returns [X] — correct/incorrect because [reason] + - [Neither A nor B]: returns [X] — correct/incorrect because [reason] +- **Candidate requirements:** REQ-NNN: [function must return Y when only B fires] +``` + +--- + +## Pattern 3: Cross-Implementation Contract Consistency + +### Definition + +When multiple functions implement the same logical operation for different contexts (different transports, different backends, different protocol versions), they should all satisfy the same specification requirement. A step that is mandatory in the specification must appear in every implementation — a missing step in one implementation that is present in another is a strong bug signal. + +### Bug class + +When the same operation is implemented in multiple places, each implementation is typically written by a different developer or at a different time. The specification says "reset must wait for completion," and the developer of implementation A writes the wait loop, but the developer of implementation B writes only the reset trigger and forgets the wait. The bug is invisible when testing implementation B in isolation because it "works" on fast hardware — the race condition only manifests under load or on slow devices. + +### Examples across domains + +- **Device reset:** A spec says "the driver must write zero and then poll until the status register reads back zero." The PCI implementation includes the poll loop. The MMIO implementation writes zero but does not poll. Bug: MMIO reset can race with reinitialization. +- **Database driver:** A connection-close spec says "the driver must send a termination message, wait for acknowledgment, then release the socket." The PostgreSQL driver does all three. The MySQL driver sends the termination message and releases the socket without waiting for acknowledgment. Bug: the server may process the termination after the socket is reused. +- **HTTP header encoding:** A Headers class constructor decodes raw bytes as Latin-1 per RFC 7230. The mutation method (`__setitem__`) encodes values as UTF-8. Bug: round-tripping a Latin-1 header through get-then-set corrupts the value because the encoding changed. +- **Cache invalidation:** A cache spec says "invalidation must remove the entry and notify all subscribers." The in-memory cache does both. The distributed cache removes the entry but does not broadcast the notification. Bug: other nodes serve stale data. +- **File locking:** A storage spec says "lock acquisition must set a timeout and clean up on failure." The local filesystem implementation sets the timeout. The NFS implementation uses blocking lock with no timeout. Bug: NFS lock contention can hang the process indefinitely. + +### How to apply + +For each core module, look for: the same operation name implemented in multiple files or classes, interface/trait implementations across different backends, protocol-version-specific implementations of the same message, transport-specific implementations of the same lifecycle operation, constructor vs. mutation implementations of the same logical operation. + +For each pair (or set) of implementations: +1. Identify the specification requirement they share. +2. List the mandatory steps from the spec. +3. Check each implementation for each step. +4. Any step present in one but missing in another is a candidate requirement. + +**Check every cross-transport operation, not just the most obvious one.** If a codebase has multiple transports (PCI, MMIO, vDPA) or backends (PostgreSQL, MySQL), enumerate all operations that have cross-implementation equivalents — reset, interrupt handling, feature negotiation, queue setup, configuration access — and check each one. The first cross-implementation gap you find is rarely the only one. A common failure mode is analyzing reset thoroughly and then skipping interrupt dispatch, which has the same cross-transport structure. + +### EXPLORATION.md output format + +``` +## Cross-Implementation Consistency + +### [Operation name] — [spec reference] +- **Implementation A:** [function, file:line] — performs steps: [1, 2, 3] +- **Implementation B:** [function, file:line] — performs steps: [1, 3] (missing step 2) +- **Gap:** [Implementation B missing step 2: description] +- **Candidate requirements:** REQ-NNN: [all implementations of X must perform step 2] +``` + +--- + +## Pattern 4: Enumeration and Representation Completeness + +### Definition + +When a codebase maintains a closed set of recognized values — a switch/case whitelist, an array of valid constants, an enum/tagged-union definition, a trait/visitor method family, a set of schema keywords, a registry of accepted entries — every value that the specification, upstream definition, or the library's own public API surface says should be accepted must appear in the set. Values not in the set are silently dropped, rejected, or mishandled, and the absence of an entry is invisible at the call site. + +### Bug class + +Closed sets are written once and rarely revisited. When a new capability is added to the specification or upstream header, the code that defines the capability (the constant, the feature flag, the enum variant) is updated, and the code that uses the capability is updated, but the closed set that gates whether the capability survives a filtering step is forgotten. The feature appears to be supported — it's defined, it's negotiated, it's used — but it's silently stripped by a filter function that nobody remembered to update. The bug is invisible in normal testing because the feature simply doesn't activate, and the absence of activation looks like "the other end doesn't support it." + +This pattern also covers **internal representations** that must mirror a public API. If a library's public API accepts i128/u128 integers but an internal buffered representation only has variants for i64/u64, values that pass through the buffer are silently truncated or rejected — even though the public API promises to handle them. + +### Examples across domains + +- **Feature negotiation filter:** A transport layer maintains a switch/case whitelist of feature bits that should survive filtering. A new feature (`RING_RESET`) is added to the UAPI header and used by higher-level code, but never added to the whitelist. Bug: the feature is silently cleared during negotiation, disabling a capability the driver claims to support. +- **Serialization internal representation:** A serialization library's public `Deserializer` trait supports `deserialize_i128()`/`deserialize_u128()`. An internal buffered representation (`Content` enum) used by untagged and internally-tagged enum deserialization has variants only for `I64`/`U64`. Bug: 128-bit integers that pass through the buffer are rejected with a "no variant for i128" error, even though the public API claims to support them. +- **Schema keyword importer:** A validation library imports JSON Schema documents. The spec defines `uniqueItems`, `contains`, `minContains`, `maxContains` for arrays. The importer recognizes these keywords (no parse error) but doesn't enforce them. Bug: imported schemas silently accept arrays that violate the original constraints. +- **Permission system:** An authorization middleware maintains an array of recognized permission strings. A new permission (`audit:write`) is added to the role definitions but not to the middleware's whitelist. Bug: users with the `audit:write` role are silently denied access because the middleware doesn't recognize the permission. +- **Protocol message types:** A message router maintains a switch/case dispatch for recognized message types. A new message type is added to the protocol spec and the serialization layer, but not to the router. Bug: the new message type is silently dropped by the router's default case, and the sender receives no error. + +### How to apply + +For each core module, look for: switch/case statements with explicit case labels and a default that drops/clears/rejects, arrays or sets of accepted values used for filtering or validation, registration functions where new entries must be added manually, enum/tagged-union definitions that mirror a specification or public API, trait/visitor method families where each method handles one variant, schema importers that must handle every keyword the spec defines, internal representations (buffers, IR, AST) that must cover the full range of the public interface. + +For each closed set found: +1. Identify the authoritative source that defines what values should be valid. This could be: a spec, a header file, an upstream enum, a protocol definition, **or the library's own public API surface** (trait methods, function signatures, type definitions). +2. Extract the closed set mechanically (save the case labels, enum variants, visitor methods, array entries, or schema keywords to a file). +3. Compare the extracted set against the authoritative source. Every value in the authoritative source that is absent from the closed set is a candidate requirement. + +**Caller compensation does not excuse a missing entry.** If a closed set in a shared/generic function is missing an entry, that is a bug — even if specific callers compensate by restoring the value after the function runs. The compensation is a workaround, not a fix. Any new caller that doesn't know to compensate silently inherits the bug. Report each missing entry as a finding and note which callers (if any) compensate, but do not dismiss the finding because of compensation. + +### EXPLORATION.md output format + +``` +## Enumeration/Representation Completeness + +### [Function/type name] at [file:line] +- **Purpose:** [what this closed set gates — e.g., "feature bits that survive transport filtering" or "integer variants the buffer can hold"] +- **Authoritative source:** [where valid values are defined — e.g., "include/uapi/linux/virtio_config.h" or "public Deserializer trait methods"] +- **Extracted entries:** [list of values in the closed set, or reference to mechanical extraction file] +- **Missing entries:** [values present in the authoritative source but absent from the closed set] +- **Candidate requirements:** REQ-NNN: [closed set must include X] +``` + +--- + +## Pattern 5: API Surface Consistency + +### Definition + +When the same logical operation can be performed through multiple API surfaces — direct method vs. view/wrapper, constructor vs. mutator, sync vs. async variant, primary API vs. convenience alias — all surfaces must produce equivalent observable behavior for the same input. A divergence between two paths to the same operation is a bug, because callers reasonably expect consistent behavior regardless of which surface they use. + +### Bug class + +Libraries often expose the same underlying data through multiple interfaces: a direct method and a collection view (`add()` vs. `asList().add()`), a constructor and a setter, a sync and async variant. These surfaces are implemented at different times, often by different developers, and their edge-case handling diverges — especially around null/sentinel values, encoding, ordering, and error reporting. The divergence is invisible in normal testing because tests typically exercise only one surface per operation. + +### Examples across domains + +- **JSON null handling:** `JsonArray.add(null)` converts null to `JsonNull.INSTANCE` and succeeds. `JsonArray.asList().add(null)` throws `NullPointerException` because the view's wrapper unconditionally rejects null. Bug: two methods for the same operation have contradictory null semantics. +- **HTTP header encoding:** `Headers([(b"X-Custom", b"\xe9")])` constructs a header from Latin-1 bytes. `headers["X-Custom"] = b"\xe9"` stores the value as UTF-8. Bug: round-tripping a header through get-then-set changes the encoding silently. +- **WebSocket protocol negotiation:** `WebSocketUpgrade::protocols()` returns a `BTreeSet`, which sorts and deduplicates the client's preference-ordered protocol list. Bug: the application sees a different order than the client sent, breaking preference-based negotiation. +- **Configuration option propagation:** `res.sendFile(path, { etag: false })` should disable ETag for this response. But the code converts the option to a boolean before passing to the underlying `send` module, losing the "strong" vs "weak" ETag mode. Bug: per-call ETag configuration is silently ignored or lossy-converted. +- **Map duplicate detection:** `map.put(key, value)` returns the previous value to signal duplicates. When the previous value is legitimately `null`, `put()` returns `null` — the same value it returns for "no previous entry." Bug: duplicate keys go undetected when the first value is null. + +### How to apply + +For each core module, look for: view/wrapper objects returned by methods like `asList()`, `asMap()`, `unmodifiableView()`, `stream()`, `iterator()`; constructor vs. mutation method pairs; sync vs. async variants of the same operation; convenience aliases that delegate to a primary implementation; methods that accept options/configuration objects. + +For each pair of surfaces: +1. Identify the logical operation they share. +2. Test the same edge-case inputs on both surfaces (null, empty, boundary values, special characters, ordering-sensitive data). +3. Any divergence in behavior (different exceptions, different encoding, different ordering, one succeeds and the other fails) is a candidate requirement. + +### EXPLORATION.md output format + +``` +## API Surface Consistency + +### [Operation name] — [two surfaces compared] +- **Surface A:** [method, file:line] — [behavior on edge input] +- **Surface B:** [method, file:line] — [behavior on same edge input] +- **Divergence:** [what differs — exception type, encoding, ordering, null handling] +- **Candidate requirements:** REQ-NNN: [both surfaces must behave equivalently for input X] +``` + +--- + +## Pattern 6: Spec-Structured Parsing Fidelity + +### Definition + +When code parses values defined by a formal grammar or specification — HTTP headers, URLs, MIME types, CLI flags, JSON Schema keywords, file paths — the parsing must match the grammar's actual rules. Shortcuts (substring matching, exact equality, wrong delimiter, prefix matching without boundary checks) produce parsers that work for common inputs but fail on valid edge cases or accept invalid inputs. + +### Bug class + +Developers frequently implement "good enough" parsers that handle the common case: `header.contains("gzip")` instead of tokenizing by comma and trimming whitespace, `url.startsWith("/api")` instead of checking path segment boundaries, `connection == "Upgrade"` instead of case-insensitive token list membership. These shortcuts pass all unit tests because tests use well-formed inputs, but they break on real-world edge cases like `gzip;q=0` (explicitly rejected), `Connection: keep-alive, Upgrade` (token list), or `/api-docs` (prefix match without boundary). + +### Examples across domains + +- **HTTP Accept-Encoding:** Middleware checks `accept.contains("gzip")` to decide whether to compress. This matches `gzip;q=0` (client explicitly rejects gzip) and `xgzip` (not a valid encoding). Bug: responses are compressed when the client said not to. +- **WebSocket Connection header:** Code checks `connection == "Upgrade"` (exact match). Per RFC 7230, `Connection` is a comma-separated token list; `Connection: keep-alive, Upgrade` is valid but fails exact match. Bug: valid WebSocket upgrades are rejected. +- **SPA fallback routing:** A single-page-app handler matches paths with `path.startsWith("/app")`. This matches both `/app/users` (correct) and `/api-docs` (incorrect sibling route). Bug: API documentation requests are swallowed by the SPA handler. +- **MIME type parameter handling:** Content negotiation compares `text/html;level=1` against handler keys but strips parameters before matching. Bug: the `level=1` parameter selected during negotiation is lost from the response Content-Type. +- **URL host normalization:** Code detects internationalized domain names by checking `host.startsWith("xn--")`. Per IDNA, only individual labels start with `xn--`; `foo.xn--example.com` has the punycode label in the middle. Bug: internationalized subdomains are not decoded. + +### How to apply + +For each core module, look for: string comparisons on values defined by RFCs or specs (headers, URLs, MIME types, encoding names), `contains()` / `indexOf()` / `startsWith()` / `endsWith()` on structured values, case-sensitive comparisons where the spec requires case-insensitive, splitting on the wrong delimiter or not splitting at all, prefix/suffix matching without path-segment or token boundaries. + +For each parser found: +1. Identify the spec that defines the grammar (RFC, ABNF, JSON Schema spec, POSIX, etc.). +2. Check whether the implementation handles: token lists (comma-separated), quoted strings, parameters (semicolon-separated), case folding, whitespace trimming, boundary conditions. +3. Construct an input that is valid per the spec but would fail the implementation's shortcut parser. That input is a candidate test case and the parsing gap is a candidate requirement. + +### EXPLORATION.md output format + +``` +## Spec-Structured Parsing + +### [Parser location] at [file:line] +- **Spec:** [which grammar/RFC/standard defines the format] +- **Implementation technique:** [contains/equals/startsWith/split-on-X] +- **Spec-valid input that breaks the parser:** [concrete example] +- **Why it breaks:** [substring match includes invalid case / missing case folding / etc.] +- **Candidate requirements:** REQ-NNN: [parser must tokenize per RFC NNNN §N.N] +``` + +--- + +## Pattern 7: Composition and Mount-Context Awareness + +### Definition + +When code operates inside a composed context — mounted at a sub-route, nested inside a parent module, scoped to a child container, wrapped by a framework adapter — the framework typically maintains a *canonical* representation of the active state (what the active context says is true right now) alongside the *raw* representation from the outer call site (what the outer caller passed in originally). Code that reads or writes the raw representation when canonical was needed (or vice versa) works correctly at the outer level, where they happen to be identical, but fails silently under composition, where they diverge. + +### Bug class + +Code is written and tested at the outer level — top-level routes, root module, single-tenant deployment, default scope — where canonical and raw state are identical. When the same code runs inside a composed context, the framework updates the canonical state (mounted child path, scoped logger, transaction-scoped connection, locale-aware comparator) but the raw state still reflects the outer call. The defect manifests in two symmetric directions: a function that *reads* raw state where canonical was needed sees stale data and produces silent drift (never matches, leaks parent context, returns the wrong output); a function that *writes* an outward-facing value from canonical state where raw is needed produces output the consumer can't use (drops the mount prefix, returns a child-relative path the parent's clients can't follow). Either way, the test suite typically exercises the outer level only and never sees the divergence. + +### Examples across domains + +- **HTTP routing middleware (mount-context):** A middleware comparing the request path against a configured endpoint reads `r.URL.Path`. When mounted at a sub-route, the framework's canonical "active routing path" (e.g., `RoutePath` in chi, `req.url` in Express sub-app) is the child-relative path while `r.URL.Path` remains the full URL path. Bug: middleware never matches inside the mounted child because it reads the wrong path representation. +- **Database transaction context:** A repository method opens its own connection via the connection pool. When called inside an explicit transaction, the framework's canonical "current transaction" context is the explicit one, but the method reads from the connection pool directly. Bug: the method's writes don't participate in the surrounding transaction; rollback leaves orphan rows. +- **Logging context propagation:** A library logs via `logging.getLogger(__name__)`. When invoked inside an async task or worker pool that has scoped a contextvar-based correlation ID, the logger doesn't read the contextvar. Bug: the library's log lines lack the correlation ID the framework was propagating, breaking traceability. +- **Locale-sensitive comparison:** A sort function uses `str.lower()` for case-insensitive comparison. When called inside a locale-aware context (Turkish "i" / "İ" / "ı" semantics), the framework's canonical locale is set but `str.lower()` reads the default locale. Bug: equality comparisons silently differ depending on which locale is canonically active. +- **Authorization scope inheritance:** An ACL check reads `request.user` (the raw authenticated principal). When invoked inside an impersonation context, the framework's canonical `request.effective_user` is the impersonated principal but the check still reads the original. Bug: privilege escalation — the check authorizes the wrong principal. + +### How to apply + +Identify every function or component that reads or writes state that *can be canonical-vs-raw under composition*. The check is: does this code path run unchanged when its caller is composed inside a larger context, and if so, does the state it observes (or produces) change accordingly? + +**Disambiguation from Pattern 4.** Pattern 4 (Enumeration and Representation Completeness) is about closed sets of values: the bug is "value missing from the recognizer's closed set." Pattern 7 is about choice of state variable: the bug is "function reads or writes the wrong representation of state under composition." If both frames seem to apply, prefer the one whose REQ is more testable. The two patterns rarely overlap on the same defect; when they do, the canonical-vs-raw framing usually points more directly at the fix. + +**Budget.** Cap candidates at 3-5 highest-impact composition seams per pass. If more than 10 candidates emerge from this pattern alone, the net is too wide and the pattern is being over-applied — revisit Step 1 with a tighter "what does this framework actually maintain canonically under composition" filter. + +For each candidate found: + +1. **Identify the canonical and raw representations, both for reads and writes.** What does the framework maintain as "the active state for this concern" under composition? What does the function actually read? Then ask the symmetric question for outputs: when this function constructs a value that flows outward (a redirect target, a derived path, a logged correlation ID, an authorized resource handle), is that value being built from the right representation for its consumer? Read-side and write-side defects are equally common; check both directions. +2. **Trace the composition seam.** Where does the framework update canonical state? Is the function downstream of that update site? Does it read from the canonical or the raw representation? +3. **Construct the composition test.** What is the smallest example where this code runs inside a composed parent (mounted router, nested transaction, scoped logger, impersonation context)? Does the function's behavior match the outer-level behavior, or does it silently drift? +4. **Record what happens.** A function that drifts under composition is a candidate requirement: "function `` MUST read `` (not ``) [or write `` for outward-facing output] so that behavior remains correct when composed inside ``." + +Common composition seams worth checking explicitly: + +- **Routing frameworks:** `RoutePath` / `req.url` / `request.path_info` versus the original URL. Each level of mounting updates the canonical path; raw URLs do not. +- **Transaction managers:** explicit transaction context versus connection pool's auto-commit default. Composed code must read the active transaction. +- **Logging / tracing:** contextvar-scoped correlation IDs versus thread-local or default loggers. Composed code must read the contextvar. +- **Authorization / impersonation:** effective principal versus raw user. Composed code must read the effective principal. + +### EXPLORATION.md output format + +``` +## Composition and Mount-Context Analysis + +### [Function/component name] at [file:line] +- **Composes inside:** [parent context — e.g., "any chi router that calls Mount() to attach this middleware"] +- **Canonical representation:** [what the framework maintains under composition — e.g., "rctx.RoutePath, the active mounted path"] +- **Raw representation read by this code:** [what the function actually reads — e.g., "r.URL.Path, the full request URL"] +- **Drift scenario:** [smallest composition example that exposes the divergence — e.g., "router.Mount("/api", child) where child uses this middleware to serve /ping"] +- **Observable failure:** [what wrong behavior results — e.g., "404 instead of the expected response"] +- **Candidate requirements:** REQ-NNN: [function MUST read canonical state under composition] +``` + +--- + +## Extending This List + +These patterns were derived from analyzing 56 confirmed bugs across 11 open-source repositories spanning 7 languages. Each pattern represents a class of requirements that broad architectural summaries consistently miss. + +To add a new pattern: +1. Identify a confirmed bug that was missed by exploration but would have been found with a specific analysis technique. +2. Generalize the technique: what question should the explorer have asked about the code? +3. Provide at least 5 diverse examples from different domains (not all from the same project). +4. Define the expected output format for EXPLORATION.md. +5. Add the pattern to this file and add the corresponding section to the EXPLORATION.md template in SKILL.md. + +The goal is a library of systematic exploration techniques that accumulate over time as new bug classes are discovered. diff --git a/skills/quality-playbook/references/functional_tests.md b/skills/quality-playbook/references/functional_tests.md index a376a29cd..5f80d1412 100644 --- a/skills/quality-playbook/references/functional_tests.md +++ b/skills/quality-playbook/references/functional_tests.md @@ -32,94 +32,26 @@ For a medium-sized project (5–15 source files), this typically yields 35–50 Before writing any test code, read 2–3 existing test files and identify how they import project modules. This is critical — projects handle imports differently and getting it wrong means every test fails with resolution errors. -Common patterns by language: - -**Python:** -- `sys.path.insert(0, "src/")` then bare imports (`from module import func`) -- Package imports (`from myproject.module import func`) -- Relative imports with conftest.py path manipulation - -**Java:** -- `import com.example.project.Module;` matching the package structure -- Test source root must mirror main source root - -**Scala:** -- `import com.example.project._` or `import com.example.project.{ClassA, ClassB}` -- SBT project layout: `src/test/scala/` mirrors `src/main/scala/` - -**TypeScript/JavaScript:** -- `import { func } from '../src/module'` with relative paths -- Path aliases from `tsconfig.json` (e.g., `@/module`) +Identify the import convention used in the project. Whatever pattern the existing tests use, copy it exactly. Do not guess or invent a different pattern. -**Go:** -- Same package: test files in the same directory with `package mypackage` -- Black-box testing: `package mypackage_test` with explicit imports -- Internal packages may require specific import paths - -**Rust:** -- `use crate::module::function;` for unit tests in the same crate -- `use myproject::module::function;` for integration tests in `tests/` +Common patterns by language: -Whatever pattern the existing tests use, copy it exactly. Do not guess or invent a different pattern. +- **Python:** `sys.path.insert(0, "src/")` then bare imports; package imports (`from myproject.module import func`); relative imports with conftest.py path manipulation +- **Go:** Same-package tests (`package mypackage`) give access to unexported identifiers; black-box tests (`package mypackage_test`) test only exported API; internal packages may require specific import paths +- **Java:** `import com.example.project.Module;` matching the package structure; test source root must mirror main source root +- **TypeScript:** `import { func } from '../src/module'` with relative paths; path aliases from `tsconfig.json` (e.g., `@/module`) +- **Rust:** `use crate::module::function;` for unit tests in the same crate; `use myproject::module::function;` for integration tests in `tests/` +- **Scala:** `import com.example.project._` or `import com.example.project.{ClassA, ClassB}`; SBT layout mirrors `src/main/scala/` in `src/test/scala/` ## Create Test Setup BEFORE Writing Tests Every test framework has a mechanism for shared setup. If your tests use shared fixtures or test data, you MUST create the setup file before writing tests. Test frameworks do not auto-discover fixtures from other directories. -**By language:** - -**Python (pytest):** Create `quality/conftest.py` defining every fixture. Fixtures in `tests/conftest.py` are NOT available to `quality/test_functional.py`. Preferred: write tests that create data inline using `tmp_path` to eliminate conftest dependency. - -**Java (JUnit):** Use `@BeforeEach`/`@BeforeAll` methods in the test class, or create a shared `TestFixtures` utility class in the same package. - -**Scala (ScalaTest):** Mix in a trait with `before`/`after` blocks, or use inline data builders. If using SBT, ensure the test file is in the correct source tree. - -**TypeScript (Jest):** Use `beforeAll`/`beforeEach` in the test file, or create a `quality/testUtils.ts` with factory functions. - -**Go (testing):** Helper functions in the same `_test.go` file with `t.Helper()`. Use `t.TempDir()` for temporary directories. Go convention strongly prefers inline setup — avoid shared test state. - -**Rust (cargo test):** Helper functions in a `#[cfg(test)] mod tests` block or a `test_utils.rs` module. Use builder patterns for constructing test data. For integration tests, place files in `tests/`. +Identify your framework's setup mechanism (fixtures, `@BeforeEach`, `beforeAll`, helper functions, builder patterns, etc.) and follow the conventions already used in the project's existing tests. **Rule: Every fixture or test helper referenced must be defined.** If a test depends on shared setup that doesn't exist, the test will error during setup (not fail during assertion) — producing broken tests that look like they pass. -**Preferred approach across all languages:** Write tests that create their own data inline. This eliminates cross-file dependencies: - -```python -# Python -def test_config_validation(tmp_path): - config = {"pipeline": {"name": "Test", "steps": [...]}} -``` - -```java -// Java -@Test -void testConfigValidation(@TempDir Path tempDir) { - var config = Map.of("pipeline", Map.of("name", "Test")); -} -``` - -```typescript -// TypeScript -test('config validation', () => { - const config = { pipeline: { name: 'Test', steps: [] } }; -}); -``` - -```go -// Go -func TestConfigValidation(t *testing.T) { - tmpDir := t.TempDir() - config := Config{Pipeline: Pipeline{Name: "Test"}} -} -``` - -```rust -// Rust -#[test] -fn test_config_validation() { - let config = Config { pipeline: Pipeline { name: "Test".into() } }; -} -``` +**Preferred approach across all languages:** Write tests that create their own data inline. This eliminates cross-file dependencies. Create test data directly in each test function using the framework's temporary directory support and literal data structures. **After writing all tests, run the test suite and check for setup errors.** Setup errors (fixture not found, import failures) count as broken tests regardless of how the framework categorizes them. @@ -133,14 +65,14 @@ If you genuinely cannot write a meaningful test for a defensive pattern (e.g., i Before writing a single test, build a function call map. For every function you plan to test: -1. **Read the function/method signature** — not just the name, but every parameter, its type, and default value. In Python, read the `def` line and type hints. In Java, read the method signature and generics. In Scala, read the method definition and implicit parameters. In TypeScript, read the type annotations. +1. **Read the function/method signature** — not just the name, but every parameter, its type, and default value. 2. **Read the documentation** — docstrings, Javadoc, TSDoc, ScalaDoc. They often specify return types, exceptions, and edge case behavior. 3. **Read one existing test that calls it** — existing tests show you the exact calling convention, fixture shape, and assertion pattern. 4. **Read real data files** — if the function processes configs, schemas, or data files, read an actual file from the project. Your test fixtures must match this shape exactly. **Common failure pattern:** The agent explores the architecture, understands conceptually what a function does, then writes a test call with guessed parameters. The test fails because the real function takes `(config, items_data, limit)` not `(items, seed, strategy)`. Reading the actual signature takes 5 seconds and prevents this entirely. -**Library version awareness:** Check the project's dependency manifest (`requirements.txt`, `build.sbt`, `package.json`, `pom.xml`, `build.gradle`, `Cargo.toml`) to verify what's available. Use the test framework's skip mechanism for optional dependencies: Python `pytest.importorskip()`, JUnit `Assumptions.assumeTrue()`, ScalaTest `assume()`, Jest conditional `describe.skip`, Go `t.Skip()`, Rust `#[ignore]` with a comment explaining the prerequisite. +**Library version awareness:** Check the project's dependency manifest (`requirements.txt`, `build.sbt`, `package.json`, `pom.xml`, `build.gradle`, `Cargo.toml`) to verify what's available. Use the test framework's skip mechanism for optional dependencies (e.g., `pytest.importorskip()`, `Assumptions.assumeTrue()`, `t.Skip()`, `#[ignore]`, etc.). ## Writing Spec-Derived Tests @@ -151,68 +83,9 @@ Each test should: 2. **Execute** — Call the function, run the pipeline, make the request 3. **Assert specific properties** the spec requires -```python -# Python (pytest) -class TestSpecRequirements: - def test_requirement_from_spec_section_N(self, fixture): - """[Req: formal — Design Doc §N] X should produce Y.""" - result = process(fixture) - assert result.property == expected_value -``` - -```java -// Java (JUnit 5) -class SpecRequirementsTest { - @Test - @DisplayName("[Req: formal — Design Doc §N] X should produce Y") - void testRequirementFromSpecSectionN() { - var result = process(fixture); - assertEquals(expectedValue, result.getProperty()); - } -} -``` - -```scala -// Scala (ScalaTest) -class SpecRequirements extends FlatSpec with Matchers { - // [Req: formal — Design Doc §N] X should produce Y - "Section N requirement" should "produce Y from X" in { - val result = process(fixture) - result.property should equal (expectedValue) - } -} -``` +Each test should include a traceability annotation (via docstring, display name, or comment) citing the spec section it verifies, e.g., `[Req: formal — Design Doc §N] X should produce Y`. -```typescript -// TypeScript (Jest) -describe('Spec Requirements', () => { - test('[Req: formal — Design Doc §N] X should produce Y', () => { - const result = process(fixture); - expect(result.property).toBe(expectedValue); - }); -}); -``` -```go -// Go (testing) -func TestSpecRequirement_SectionN_XProducesY(t *testing.T) { - // [Req: formal — Design Doc §N] X should produce Y - result := Process(fixture) - if result.Property != expectedValue { - t.Errorf("expected %v, got %v", expectedValue, result.Property) - } -} -``` - -```rust -// Rust (cargo test) -#[test] -fn test_spec_requirement_section_n_x_produces_y() { - // [Req: formal — Design Doc §N] X should produce Y - let result = process(&fixture); - assert_eq!(result.property, expected_value); -} -``` ## What Makes a Good Functional Test @@ -226,72 +99,9 @@ fn test_spec_requirement_section_n_x_produces_y() { If the project handles multiple input types, cross-variant coverage is where silent bugs hide. Aim for roughly 30% of tests exercising all variants — the exact percentage matters less than ensuring every cross-cutting property is tested across all variants. -Use your framework's parametrization mechanism: +Use your framework's parametrization mechanism (e.g., `@pytest.mark.parametrize`, `@ParameterizedTest`, `test.each`, table-driven tests, iterating over cases) to run the same assertion logic across all variants. -```python -# Python (pytest) -@pytest.mark.parametrize("variant", [variant_a, variant_b, variant_c]) -def test_feature_works(variant): - output = process(variant.input) - assert output.has_expected_property -``` -```java -// Java (JUnit 5) -@ParameterizedTest -@MethodSource("variantProvider") -void testFeatureWorks(Variant variant) { - var output = process(variant.getInput()); - assertTrue(output.hasExpectedProperty()); -} -``` - -```scala -// Scala (ScalaTest) -Seq(variantA, variantB, variantC).foreach { variant => - it should s"work for ${variant.name}" in { - val output = process(variant.input) - output should have ('expectedProperty (true)) - } -} -``` - -```typescript -// TypeScript (Jest) -test.each([variantA, variantB, variantC])( - 'feature works for %s', (variant) => { - const output = process(variant.input); - expect(output).toHaveProperty('expectedProperty'); -}); -``` - -```go -// Go (testing) — table-driven tests -func TestFeatureWorksAcrossVariants(t *testing.T) { - variants := []Variant{variantA, variantB, variantC} - for _, v := range variants { - t.Run(v.Name, func(t *testing.T) { - output := Process(v.Input) - if !output.HasExpectedProperty() { - t.Errorf("variant %s: missing expected property", v.Name) - } - }) - } -} -``` - -```rust -// Rust (cargo test) — iterate over cases -#[test] -fn test_feature_works_across_variants() { - let variants = [variant_a(), variant_b(), variant_c()]; - for v in &variants { - let output = process(&v.input); - assert!(output.has_expected_property(), - "variant {}: missing expected property", v.name); - } -} -``` If parametrization doesn't fit, loop explicitly within a single test. @@ -312,68 +122,15 @@ These patterns look like tests but don't catch real bugs: ### The Exception-Catching Anti-Pattern in Detail -```java -// Java — WRONG: tests the validation mechanism -@Test -void testBadValueRejected() { - fixture.setField("invalid"); // Schema rejects this! - assertThrows(ValidationException.class, () -> process(fixture)); - // Tells you nothing about output -} - -// Java — RIGHT: tests the requirement -@Test -void testBadValueNotInOutput() { - fixture.setField(null); // Schema accepts null for Optional - var output = process(fixture); - assertFalse(output.contains(badProperty)); // Bad data absent - assertTrue(output.contains(expectedType)); // Rest still works -} -``` - -```scala -// Scala — WRONG: tests the decoder, not the requirement -"bad value" should "be rejected" in { - val input = fixture.copy(field = "invalid") // Circe decoder fails! - a [DecodingFailure] should be thrownBy process(input) - // Tells you nothing about output -} - -// Scala — RIGHT: tests the requirement -"missing optional field" should "not produce bad output" in { - val input = fixture.copy(field = None) // Option[String] accepts None - val output = process(input) - output should not contain badProperty // Bad data absent - output should contain (expectedType) // Rest still works -} -``` - -```typescript -// TypeScript — WRONG: tests the validation mechanism -test('bad value rejected', () => { - fixture.field = 'invalid'; // Zod schema rejects this! - expect(() => process(fixture)).toThrow(ZodError); - // Tells you nothing about output -}); - -// TypeScript — RIGHT: tests the requirement -test('bad value not in output', () => { - fixture.field = undefined; // Schema accepts undefined for optional - const output = process(fixture); - expect(output).not.toContain(badProperty); // Bad data absent - expect(output).toContain(expectedType); // Rest still works -}); -``` - ```python -# Python — WRONG: tests the validation mechanism +# WRONG: tests the validation mechanism def test_bad_value_rejected(fixture): fixture.field = "invalid" # Schema rejects this! with pytest.raises(ValidationError): process(fixture) # Tells you nothing about output -# Python — RIGHT: tests the requirement +# RIGHT: tests the requirement def test_bad_value_not_in_output(fixture): fixture.field = None # Schema accepts None for Optional output = process(fixture) @@ -381,42 +138,9 @@ def test_bad_value_not_in_output(fixture): assert expected_type in output # Rest still works ``` -```go -// Go — WRONG: tests the error, not the outcome -func TestBadValueRejected(t *testing.T) { - fixture.Field = "invalid" // Validator rejects this! - _, err := Process(fixture) - if err == nil { t.Fatal("expected error") } - // Tells you nothing about output -} - -// Go — RIGHT: tests the requirement -func TestBadValueNotInOutput(t *testing.T) { - fixture.Field = "" // Zero value is valid - output, err := Process(fixture) - if err != nil { t.Fatalf("unexpected error: %v", err) } - if containsBadProperty(output) { t.Error("bad data should be absent") } - if !containsExpectedType(output) { t.Error("expected data should be present") } -} -``` +The pattern is the same in every language: don't test that the validation mechanism rejects bad input — test that the system produces correct output when given edge-case input the schema accepts. The WRONG approach tests the implementation (the validator); the RIGHT approach tests the requirement (the output). + -```rust -// Rust — WRONG: tests the error, not the outcome -#[test] -fn test_bad_value_rejected() { - let input = Fixture { field: "invalid".into(), ..default() }; - assert!(process(&input).is_err()); // Tells you nothing about output -} - -// Rust — RIGHT: tests the requirement -#[test] -fn test_bad_value_not_in_output() { - let input = Fixture { field: None, ..default() }; // Option accepts None - let output = process(&input).expect("should succeed"); - assert!(!output.contains(bad_property)); // Bad data absent - assert!(output.contains(expected_type)); // Rest still works -} -``` Always check your Step 5b schema map before choosing mutation values. @@ -428,154 +152,20 @@ Ask: "What does the *spec* say should happen?" The spec says "invalid data shoul ## Fitness-to-Purpose Scenario Tests -For each scenario in QUALITY.md, write a test. This is a 1:1 mapping: - -```scala -// Scala (ScalaTest) -class FitnessScenarios extends FlatSpec with Matchers { - // [Req: formal — QUALITY.md Scenario 1] - "Scenario 1: [Name]" should "prevent [failure mode]" in { - val result = process(fixture) - result.property should equal (expectedValue) - } -} -``` - -```python -# Python (pytest) -class TestFitnessScenarios: - """Tests for fitness-to-purpose scenarios from QUALITY.md.""" - - def test_scenario_1_memorable_name(self, fixture): - """[Req: formal — QUALITY.md Scenario 1] [Name]. - Requirement: [What the code must do]. - """ - result = process(fixture) - assert condition_that_prevents_the_failure -``` +For each scenario in QUALITY.md, write a test. This is a 1:1 mapping. Each test should include a traceability annotation citing the scenario, e.g., `[Req: formal — QUALITY.md Scenario 1]`, and be named to match the scenario's memorable name. -```java -// Java (JUnit 5) -class FitnessScenariosTest { - @Test - @DisplayName("[Req: formal — QUALITY.md Scenario 1] [Name]") - void testScenario1MemorableName() { - var result = process(fixture); - assertTrue(conditionThatPreventsFailure(result)); - } -} -``` -```typescript -// TypeScript (Jest) -describe('Fitness Scenarios', () => { - test('[Req: formal — QUALITY.md Scenario 1] [Name]', () => { - const result = process(fixture); - expect(conditionThatPreventsFailure(result)).toBe(true); - }); -}); -``` - -```go -// Go (testing) -func TestScenario1_MemorableName(t *testing.T) { - // [Req: formal — QUALITY.md Scenario 1] [Name] - // Requirement: [What the code must do] - result := Process(fixture) - if !conditionThatPreventsFailure(result) { - t.Error("scenario 1 failed: [describe expected behavior]") - } -} -``` - -```rust -// Rust (cargo test) -#[test] -fn test_scenario_1_memorable_name() { - // [Req: formal — QUALITY.md Scenario 1] [Name] - // Requirement: [What the code must do] - let result = process(&fixture); - assert!(condition_that_prevents_the_failure(&result)); -} -``` ## Boundary and Negative Tests -One test per defensive pattern from Step 5: - -```typescript -// TypeScript (Jest) -describe('Boundaries and Edge Cases', () => { - test('[Req: inferred — from functionName() guard] guards against X', () => { - const input = { ...validFixture, field: null }; - const result = process(input); - expect(result).not.toContainBadOutput(); - }); -}); -``` +One test per defensive pattern from Step 5. Each test should include a traceability annotation citing the defensive pattern, e.g., `[Req: inferred — from function_name() guard] guards against X`. -```python -# Python (pytest) -class TestBoundariesAndEdgeCases: - """Tests for boundary conditions, malformed input, error handling.""" - - def test_defensive_pattern_name(self, fixture): - """[Req: inferred — from function_name() guard] guards against X.""" - # Mutate to trigger defensive code path - # Assert graceful handling -``` +For each boundary test: +1. Mutate input to trigger the defensive code path (using a value the schema accepts) +2. Process the mutated input +3. Assert graceful handling — the result is valid despite the edge-case input -```java -// Java (JUnit 5) -class BoundariesAndEdgeCasesTest { - @Test - @DisplayName("[Req: inferred — from methodName() guard] guards against X") - void testDefensivePatternName() { - fixture.setField(null); // Trigger defensive code path - var result = process(fixture); - assertNotNull(result); // Assert graceful handling - assertFalse(result.containsBadData()); - } -} -``` -```scala -// Scala (ScalaTest) -class BoundariesAndEdgeCases extends FlatSpec with Matchers { - // [Req: inferred — from methodName() guard] - "defensive pattern: methodName()" should "guard against X" in { - val input = fixture.copy(field = None) // Trigger defensive code path - val result = process(input) - result should equal (defined) - result.get should not contain badData - } -} -``` - -```go -// Go (testing) -func TestDefensivePattern_FunctionName_GuardsAgainstX(t *testing.T) { - // [Req: inferred — from FunctionName() guard] guards against X - input := defaultFixture() - input.Field = nil // Trigger defensive code path - result, err := Process(input) - if err != nil { - t.Fatalf("expected graceful handling, got: %v", err) - } - // Assert result is valid despite edge-case input -} -``` - -```rust -// Rust (cargo test) -#[test] -fn test_defensive_pattern_function_name_guards_against_x() { - // [Req: inferred — from function_name() guard] guards against X - let input = Fixture { field: None, ..default_fixture() }; - let result = process(&input).expect("expected graceful handling"); - // Assert result is valid despite edge-case input -} -``` Use your Step 5b schema map when choosing mutation values. Every mutation must use a value the schema accepts. diff --git a/skills/quality-playbook/references/iteration.md b/skills/quality-playbook/references/iteration.md new file mode 100644 index 000000000..545ed2270 --- /dev/null +++ b/skills/quality-playbook/references/iteration.md @@ -0,0 +1,191 @@ +# Iteration Mode Reference + +> This file contains the detailed instructions for each iteration strategy. +> The agent reads this file when running an iteration — all operational detail lives here, +> not in the prompt or in the benchmark runner. + +## Iteration cycle + +The recommended iteration order is: **gap → unfiltered → parity → adversarial**. Each strategy finds different bug classes, and running them in this order maximizes cumulative yield. After each iteration, the skill prints a suggested prompt for the next strategy — follow the cycle until you hit diminishing returns or decide to stop. + +``` +Baseline run # structured three-stage exploration +→ gap scan previous coverage, explore gaps # finds bugs in uncovered subsystems +→ unfiltered pure domain-driven, no structure # finds bugs that structure suppresses +→ parity cross-path comparison and diffing # finds inconsistencies between parallel implementations +→ adversarial challenge dismissed/demoted findings # recovers Type II errors from previous triage +``` + +## Shared rules for all strategies + +These rules apply to every iteration strategy: + +1. **ITER file naming.** Write findings to `quality/EXPLORATION_ITER{N}.md` — check which iteration files already exist and use the next number (e.g., `EXPLORATION_ITER2.md` for the first iteration, `EXPLORATION_ITER3.md` for the second). + +2. **Do NOT delete or archive quality/.** You are building on the existing run, not replacing it. + +3. **Context budget discipline.** A first-run EXPLORATION.md can be 200–400 lines. Loading it all into context before starting your own exploration leaves too little room for deep investigation. The previous-run scan should consume ~20–30 lines of context. Targeted deep-reads should consume ~40–60 lines total. This leaves the bulk of your context budget for new exploration. + +4. **Merge.** After completing the strategy-specific exploration, create or update `quality/EXPLORATION_MERGED.md` that combines findings from ALL iterations. For each section, concatenate the findings with clear attribution (`[Iteration 1]` / `[Iteration 2: gap]` / `[Iteration 3: unfiltered]` / etc.). Include the strategy name in the attribution so downstream phases can see which approach surfaced each finding. The Candidate Bugs section should be re-consolidated from all findings across all iterations. If `EXPLORATION_MERGED.md` already exists from a previous iteration, merge the new iteration's findings into it rather than starting from scratch. + + **Demoted Candidates Manifest (mandatory in EXPLORATION_MERGED.md).** After re-consolidating the Candidate Bugs section, add or update a `## Demoted Candidates` section at the end of EXPLORATION_MERGED.md. This section tracks findings that were dismissed, demoted, or deprioritized during any iteration — they are the raw material for the adversarial strategy. For each demoted candidate, record: + + ``` + ### DC-NNN: [short title] + - **Source:** [which iteration and strategy first surfaced this] + - **Dismissal reason:** [why it was demoted — e.g., "classified as design choice," "insufficient evidence," "needs runtime confirmation"] + - **Code location:** [file:line references] + - **Re-promotion criteria:** [specific evidence that would flip this to a confirmed candidate — e.g., "show that the permissive behavior violates a documented contract," "trace the code path to prove the edge case is reachable," "demonstrate that the output differs from what the spec requires"] + - **Status:** DEMOTED | RE-PROMOTED [iteration] | FALSE POSITIVE [iteration] + ``` + + The re-promotion criteria are the most important field — they tell the adversarial strategy exactly what evidence to gather. Vague criteria like "needs more investigation" are not acceptable; write criteria that a different agent session could act on without additional context. If a subsequent iteration re-promotes or definitively falsifies a demoted candidate, update its status and add a note explaining the resolution. + +5. **Continue with Phases 2–6.** Use `EXPLORATION_MERGED.md` as the primary input for Phase 2 artifact generation. All downstream artifacts (REQUIREMENTS.md, code review, spec audit) should reference the merged exploration. + + **TDD is mandatory for iteration runs (v1.3.49).** Iteration runs must execute the full TDD red-green cycle for every newly confirmed bug, exactly as baseline runs do. This means: for each new BUG-NNN confirmed in this iteration, create a regression test patch, run it against unpatched code to produce `quality/results/BUG-NNN.red.log`, and if a fix patch exists, run it against patched code to produce `quality/results/BUG-NNN.green.log`. The TDD Log Closure Gate in Phase 5 applies equally to iteration runs — missing log files will cause quality_gate.sh to FAIL. Do not skip TDD because this is "just an iteration" or because prior bugs already have logs. New bugs need new logs. If the test runner is not available for the project's language, create the log file with `NOT_RUN` on the first line and an explanation — the file must still exist. + +6. **Iteration mode completion gate.** Before proceeding to Phase 2 (applies to all strategies): + - `quality/ITERATION_PLAN.md` exists and names the strategy used + - `quality/EXPLORATION_ITER{N}.md` exists for this iteration with at least 80 lines of substantive content + - `quality/EXPLORATION_MERGED.md` exists and contains findings from all iterations + - The merged Candidate Bugs section has at least 2 new candidates not present in previous iterations + - At least 1 finding covers a code area not explored in previous iterations OR re-confirms a previously dismissed finding with fresh evidence + +7. **Suggested next iteration.** At the end of Phase 6, after writing the final PROGRESS.md summary, print a suggested prompt for the next iteration strategy in the cycle. If the current strategy was: + - **gap** → suggest: `Run the next iteration of the quality playbook using the unfiltered strategy.` + - **unfiltered** → suggest: `Run the next iteration of the quality playbook using the parity strategy.` + - **parity** → suggest: `Run the next iteration of the quality playbook using the adversarial strategy.` + - **adversarial** → suggest: `Run the quality playbook from scratch.` (cycle complete) + - **baseline (no strategy)** → suggest: `Run the next iteration of the quality playbook using the gap strategy.` + + Format the suggestion clearly so the user can copy-paste it: + ``` + ──────────────────────────────────────────────────────── + Next iteration suggestion: + "Run the next iteration of the quality playbook using the [strategy] strategy." + ──────────────────────────────────────────────────────── + ``` + +## Meta-strategy: `all` — run every strategy in sequence + +The `all` strategy is a runner-level convenience that executes gap → unfiltered → parity → adversarial in order, each as a separate agent session. A single agent session cannot run multiple strategies (context budget), so `all` is implemented by the orchestrator agent or benchmark runner as a loop of iteration calls. If any strategy finds zero new bugs, stop early (diminishing returns). + +Usage (orchestrator agent): "Run all iterations" — the agent runs gap → unfiltered → parity → adversarial sequentially. +Usage (benchmark runner): `python3 bin/run_playbook.py --next-iteration --strategy all ` (benchmark tooling, not shipped with the skill). `--strategy` also accepts a comma-separated ordered subset, e.g. `--strategy unfiltered,parity,adversarial`. + +--- + +## Strategy: `gap` (default) — find what the previous run missed + +Scan the previous run's coverage and deliberately explore elsewhere. Best when the first run was structurally sound but only covered a subset of the codebase. + +1. **Coverage scan (lightweight).** Read the previous `quality/EXPLORATION.md` using a divide-and-conquer strategy — do NOT load the entire file into context at once. Instead: + - Read just the section headers and first 2–3 lines of each section to build a coverage map + - For each section, record: section name, subsystems covered, number of findings, depth level (shallow = single-function mentions, deep = multi-function traces) + - Write the coverage map to `quality/ITERATION_PLAN.md` + +2. **Gap identification.** From the coverage map, identify: + - Subsystems or modules that were not explored at all + - Sections with shallow findings (few lines, single-function mentions, no code-path traces) + - Quality Risks scenarios that were listed but never traced to specific code + - Pattern deep dives that could apply but weren't selected (from the applicability matrix) + - Domain-knowledge questions from Step 6 that weren't addressed + +3. **Targeted deep-read.** For only the 2–3 thinnest or most gap-rich sections, read the full section content from the previous EXPLORATION.md. This gives you specific context about what was already found without consuming your entire context budget on previous findings. + +4. **Gap exploration.** Run a focused Phase 1 exploration targeting only the identified gaps. Use the same three-stage approach (open exploration → quality risks → selected patterns) but scoped to the uncovered areas. Write findings to `quality/EXPLORATION_ITER{N}.md` using the same template structure. + +--- + +## Strategy: `unfiltered` — pure domain-driven exploration without structural constraints + +Ignore the three-stage gated structure entirely. Explore the codebase the way an experienced developer would — reading code, following hunches, tracing suspicious paths — with no pattern templates, applicability matrices, or section format requirements. This strategy deliberately removes the structural scaffolding to let domain expertise drive discovery without constraint. + +**Why this strategy exists:** In benchmarking, the unfiltered domain-driven approach used in skill versions v1.3.25–v1.3.26 found bugs that the structured three-stage approach consistently missed, particularly in web frameworks and HTTP libraries. The structured approach excels at systematic coverage but can over-constrain exploration, causing the model to spend context on format compliance rather than deep code reading. The unfiltered strategy recovers that lost discovery power. + +1. **Lightweight previous-run scan.** Read just the `## Candidate Bugs for Phase 2` section and `quality/BUGS.md` from the previous run to know what was already found. Do NOT read the full EXPLORATION.md — you want a fresh perspective, not to be anchored by previous exploration paths. Write a brief note to `quality/ITERATION_PLAN.md` listing what the previous run found and confirming you are using the unfiltered strategy. + +2. **Unfiltered exploration.** Explore the codebase from scratch using pure domain knowledge. No required sections, no pattern applicability matrix, no gate self-check. Instead: + - Read source code deeply — entry points, hot paths, error handling, edge cases + - Follow your domain expertise: "What would an expert in [this domain] find suspicious?" + - For each suspicious finding, trace the code path across 2+ functions with file:line citations + - Generate bug hypotheses directly — not "areas to investigate" but "this specific code at file:line produces wrong behavior because [reason]" + - Write findings to `quality/EXPLORATION_ITER{N}.md` as a flat list of findings, each with file:line references and a bug hypothesis. No structural template required — depth and specificity matter, not section formatting. + - Minimum: 10 concrete findings with file:line references, at least 5 of which trace code paths across 2+ functions + +3. **Domain-knowledge questions.** Complete these questions using the code you just explored AND your domain knowledge. Write your answers inline with your findings, not in a separate gated section: + - What API surface inconsistencies exist between similar methods? + - Where does the code do ad-hoc string parsing of structured formats? + - What inputs would a domain expert try that a developer might not test? + - What metadata or configuration values could be silently wrong? + +--- + +## Strategy: `parity` — cross-path comparison and diffing + +Systematically enumerate parallel implementations of the same contract and diff them for inconsistencies. This strategy finds bugs by comparing code paths that should behave the same way but don't. + +**Why this strategy exists:** In benchmarking, the v1.3.40 skill version found 5 bugs in virtio using "fallback path parity" and "cross-implementation consistency" as explicit exploration patterns. Three of those bugs (MSI-X slow_virtqueues reattach, GFP_KERNEL under spinlock, INTx admin queue_idx) were found by lining up parallel code paths and spotting differences — not by exploring individual subsystems. The gap, unfiltered, and adversarial strategies all explore areas or challenge decisions, but none explicitly compare parallel paths. This strategy fills that gap. + +1. **Enumerate parallel paths.** Scan the codebase for groups of code that implement the same contract or handle the same logical operation via different paths. Common categories: + - **Transport/backend variants:** multiple implementations of the same interface (e.g., PCI vs MMIO vs vDPA, sync vs async, HTTP/1.1 vs HTTP/2) + - **Fallback chains:** primary path → fallback → last-resort (e.g., MSI-X → shared → INTx, rich error → generic error) + - **Setup vs teardown/reset:** initialization path vs cleanup/reset path for the same resource + - **Happy path vs error path:** normal flow vs exception/error handling for the same operation + - **Public API variants:** overloaded methods, convenience wrappers, format-specific parsers that should produce equivalent results + - Write the enumeration to `quality/ITERATION_PLAN.md` with a brief description of each parallel group. + +2. **Pairwise comparison.** For each parallel group, read the code paths side by side and systematically check each comparison sub-type below. Not every sub-type applies to every parallel group — but explicitly considering each one prevents the strategy from only finding "obvious" discrepancies while missing structural ones. + + **Comparison sub-type checklist** (check each one for every parallel group): + + - **Resource lifecycle parity:** Compare what setup/init does with a resource vs. what teardown/reset/cleanup does with the same resource. Every resource acquired in setup must be released in teardown — and in the same order, with the same scope. Look for resources that setup creates but reset forgets (e.g., a list populated during probe but not drained during reset). + - **Allocation context parity:** Compare allocation flags, lock context, and interrupt state across parallel paths. If one path allocates with `GFP_KERNEL` (sleepable) but runs under a spinlock that another path doesn't hold, that's a bug. Check: what locks are held? What allocation flags are valid in that context? Do parallel paths agree? + - **Identifier and index parity:** Compare how parallel paths compute indices, offsets, or identifiers for the same logical entity. If setup uses `queue_index + admin_offset` but reset uses `raw_queue_index`, the mismatch is a bug candidate. + - **Capability/feature-bit parity:** Compare which feature bits, flags, or capabilities each parallel path checks or sets. If the MSI-X path checks a slow-path vector list but the INTx fallback path doesn't, vectors may be misrouted after fallback. + - **Error/exception parity:** Compare error handling between paths. If the primary path handles an error gracefully but the fallback path lets it propagate, the fallback is less robust than the primary — which is backwards. + - **Iteration/collection parity:** Compare what collections each path iterates over. If setup iterates over `all_queues` but reset iterates over `active_queues`, resources for inactive queues leak. + + For each discrepancy found, trace both code paths with file:line citations and determine whether the difference is intentional (documented, tested, or structurally necessary) or a bug. + +3. **Cross-file contract tracing.** For the most promising discrepancies, trace the call chain across files to verify: + - What lock/interrupt context each path runs in + - What allocation flags are valid in that context + - Whether the contract (documented in specs, comments, or headers) requires parity + - Write findings to `quality/EXPLORATION_ITER{N}.md` with both code paths cited for each finding. + +4. **Minimum output:** At least 5 parallel groups enumerated, at least 8 pairwise comparisons traced with file:line references, at least 3 concrete discrepancy findings. + +--- + +## Strategy: `adversarial` — challenge the previous run's conclusions + +Re-investigate what the previous run dismissed, demoted, or marked SATISFIED. This strategy assumes the previous run made Type II errors (missed real bugs by being too conservative) and systematically challenges those decisions. + +**Why this strategy exists:** In benchmarking, the triage step reliably demotes legitimate findings by demanding excessive evidence, marking ambiguous cases as "design choice," or accepting code-review SATISFIED verdicts without deep verification. The adversarial strategy specifically targets these failure modes. + +1. **Load previous decisions.** Read these files from the previous run (use divide-and-conquer — section headers first, then targeted deep reads): + - `quality/EXPLORATION_MERGED.md` — specifically the `## Demoted Candidates` section (this is your primary input — it contains structured re-promotion criteria for each dismissed finding) + - `quality/BUGS.md` — what was confirmed (to avoid re-finding the same bugs) + - `quality/spec_audits/*triage*` — what was dismissed or demoted during triage, and why + - `quality/code_reviews/*.md` — Pass 2 SATISFIED/VIOLATED verdicts + - `quality/EXPLORATION.md` — just the `## Candidate Bugs for Phase 2` section to see which candidates didn't become confirmed bugs + - Write a summary to `quality/ITERATION_PLAN.md` listing: (a) demoted candidates from the manifest with their re-promotion criteria, (b) additional dismissed triage findings not yet in the manifest, (c) candidates that weren't promoted, (d) requirements marked SATISFIED that had thin evidence + +2. **Re-investigate dismissed findings with a lower evidentiary bar.** The adversarial strategy uses a deliberately lower evidentiary standard than earlier strategies. The baseline and gap strategies rightly demand strong evidence to avoid false positives during initial discovery. But by the adversarial iteration, remaining undiscovered bugs are precisely the ones that conservative triage keeps rejecting — they look ambiguous, they could be "design choices," they lack dramatic runtime failures. For these findings: + - A code-path trace showing observable semantic drift (output differs from what spec or contract requires) is sufficient to confirm — you do not need a runtime crash or dramatic failure + - "Permissive behavior" is not automatically a design choice — check whether the spec, docs, or API contract defines the expected behavior. If the code deviates from a documented contract, it's a bug regardless of whether the deviation is "permissive" + - If the Demoted Candidates Manifest includes re-promotion criteria, attempt to satisfy those criteria specifically. Each criterion was written to be actionable — follow it + - Read the specific code location cited in the finding + - Trace the code path independently — do not rely on the previous run's analysis + - Make an explicit CONFIRMED/FALSE-POSITIVE determination with fresh evidence + - Update the Demoted Candidates Manifest: change status to RE-PROMOTED or FALSE POSITIVE with the iteration attribution + +3. **Challenge SATISFIED verdicts.** For each requirement the code review marked SATISFIED with thin evidence (single-line citation, no code-path trace, or grouped with 3+ other requirements under one citation): + - Re-verify the requirement by reading the cited code and tracing the behavior + - Check whether the requirement is actually satisfied or whether the review took a shallow pass + +4. **Explore adjacent code.** For each re-confirmed or newly confirmed finding, explore the surrounding code for related bugs — bugs cluster. If a function has one bug, its callers and siblings likely have related issues. + +5. Write all findings to `quality/EXPLORATION_ITER{N}.md`. Each finding must include: the original source (triage dismissal, candidate demotion, or SATISFIED challenge), the fresh evidence, and the new determination. diff --git a/skills/quality-playbook/references/orchestrator_protocol.md b/skills/quality-playbook/references/orchestrator_protocol.md new file mode 100644 index 000000000..760a74c05 --- /dev/null +++ b/skills/quality-playbook/references/orchestrator_protocol.md @@ -0,0 +1,63 @@ +# Orchestrator Protocol + +Shared rules for all orchestrator agent files (Claude Code, Copilot, Cursor, Windsurf). Platform-specific agent files reference this file for the hardening rules and verification gates. + +## Role Definition + +Your ONLY jobs are: (1) spawn sub-agents to execute phases, (2) verify their output files exist on disk, (3) report progress to the user. You do NOT execute phase logic yourself. Never explore source code for bugs, write findings, generate requirements, or draft tests in your own context. If you find yourself doing any of that, you have violated your role. + +**Why this is strict.** The Quality Playbook is intentionally multi-session: each phase needs the full context window for deep analysis. Running phases in the orchestrator's context is the single most common failure mode — the orchestrator collapses into single-context execution, produces shallow summaries, and writes zero files to disk. This happened on a real casbin run and is why this protocol was hardened. + +## File-Writing Override + +The user's invocation of the playbook is explicit authorization for you and every sub-agent you spawn to write .md files, patches, JSON, and any other artifacts to the `quality/` directory and the project root (`AGENTS.md`). This overrides any base harness rules discouraging documentation or .md file creation. No sub-agent should skip file writes citing harness constraints. + +## Rationalization Patterns + +If you catch yourself producing text like any of these, stop — that's the tell that you're about to collapse into single-context execution: + +- "per system constraint: no report .md files" (or any invented harness restriction) +- "I'll do the analytical work in-context and summarize for the user" +- "spawning a sub-agent is unnecessary overhead for this step" +- "I can cover multiple phases in one pass" +- "the artifacts are optional / can be described rather than written" + +Any of these means you're about to replicate the casbin failure. Spawn the sub-agent instead. + +## Grounding + +If `ai_context/DEVELOPMENT_CONTEXT.md` exists in the skill repo or the working directory, read it before Phase 1. It contains the three-axes improvement model and the design intent behind phase separation. Grounding in this document materially reduces the chance of collapsing into single-context execution. + +## Post-Phase Verification Gate (Mandatory) + +After each sub-agent returns, confirm that the expected output files exist and contain real content — not empty scaffolding or placeholder text. If any required file is missing or trivially small, the phase failed regardless of what the sub-agent reported. The sub-agent's claim of completion is insufficient evidence — only files on disk count. + +Express each check as content criteria ("verify that `quality/EXPLORATION.md` exists and has at least 120 lines"), not as specific tool invocations. Use whatever file-reading and directory-listing capability is available. + +### Expected outputs per phase + +Cross-reference SKILL.md's Complete Artifact Contract for the authoritative list. + +- **Phase 1 (Explore):** `quality/EXPLORATION.md` exists with at least 120 lines of substantive content; `quality/PROGRESS.md` exists with Phase 1 marked complete. +- **Phase 2 (Generate):** All of these exist: `quality/REQUIREMENTS.md`, `quality/QUALITY.md`, `quality/CONTRACTS.md`, `quality/COVERAGE_MATRIX.md`, `quality/COMPLETENESS_REPORT.md`, `quality/RUN_CODE_REVIEW.md`, `quality/RUN_INTEGRATION_TESTS.md`, `quality/RUN_SPEC_AUDIT.md`, `quality/RUN_TDD_TESTS.md`. A functional test file exists in `quality/` (naming varies by language: `quality/test_functional.`). **AGENTS.md is NOT a Phase 2 output** — it is generated post-Phase-6 by the orchestrator (see SKILL.md Phase 2 source-modification guardrail). Phase 2 writes ONLY into `quality/`. +- **Phase 3 (Code Review):** `quality/code_reviews/` contains at least one review file. If bugs were confirmed: `quality/BUGS.md` has at least one `### BUG-` entry, `quality/patches/` contains a regression-test patch per confirmed bug, and `quality/test_regression.*` exists. +- **Phase 4 (Spec Audit):** `quality/spec_audits/` contains at least one triage file AND at least one individual auditor file. +- **Phase 5 (Reconciliation):** If bugs were confirmed: `quality/results/tdd-results.json` exists, a writeup at `quality/writeups/BUG-NNN.md` exists for every confirmed bug, and a red-phase log exists at `quality/results/BUG-NNN.red.log` for every confirmed bug. +- **Phase 6 (Verify):** `quality/results/quality-gate.log` exists and PROGRESS.md marks Phase 6 complete with a Terminal Gate Verification section. + +### After verification passes + +Report the phase's key findings to the user. Continue to the next phase (or stop if in phase-by-phase mode). + +### If verification fails + +Report what files are missing or empty. Do NOT spawn the next phase — the missing output must be repaired first. Offer to retry the failed phase in a fresh sub-agent. + +## Error Recovery + +If a sub-agent fails or runs out of context: + +1. Assess what was saved to disk (PROGRESS.md and the `quality/` directory). +2. Report the failure with specifics. +3. Suggest retrying in a fresh sub-agent — phase writes are preserved incrementally, so a retry can pick up where the previous attempt left off. +4. Never skip phases — each depends on prior output. diff --git a/skills/quality-playbook/references/requirements_pipeline.md b/skills/quality-playbook/references/requirements_pipeline.md new file mode 100644 index 000000000..361730e03 --- /dev/null +++ b/skills/quality-playbook/references/requirements_pipeline.md @@ -0,0 +1,427 @@ +# Requirements Pipeline + +## Overview + +This document defines the five-phase requirements generation pipeline for Step 7 of the Quality Playbook. The pipeline separates contract discovery from requirement derivation, uses file-based external memory so the model doesn't need to hold everything in context simultaneously, and includes mechanical verification with a completeness gate. + +**Why a pipeline?** Single-pass requirement generation runs out of attention after ~70 requirements because the model is simultaneously discovering contracts and writing formal requirements. Separating these into distinct phases with file-based handoffs produces significantly more complete coverage. In testing on Gson (81 source files, ~21K lines), single-pass produced 48 requirements; the pipeline produced 110. + +## Files produced + +| File | Purpose | +|------|---------| +| `quality/CONTRACTS.md` | Raw behavioral contracts extracted from source | +| `quality/REQUIREMENTS.md` | Testable requirements with narrative (the primary deliverable) | +| `quality/COVERAGE_MATRIX.md` | Contract-to-requirement traceability | +| `quality/COMPLETENESS_REPORT.md` | Final completeness assessment with verdict | +| `quality/VERSION_HISTORY.md` | Review log with version table and provenance | +| `quality/REFINEMENT_HINTS.md` | Review progress and feedback (created during review) | + +Versioned backups go in `quality/history/vX.Y/`. + +--- + +## Phase A: Extract behavioral contracts + +**Input:** All source files in the project (or a scoped subsystem — see scaling check below). +**Output:** `quality/CONTRACTS.md` + +### Scaling check + +Before starting extraction, count the source files in the project (exclude tests, generated code, vendored dependencies, and build artifacts). + +- **Standard project (≤300 source files):** Proceed normally — extract contracts from all files. Projects in this range have been tested end-to-end (e.g., Gson at ~81 source files produced 110 requirements with full coverage). +- **Large project (301–500 source files):** Focus on the 3–5 core subsystems identified in Phase 1, Step 2. Extract contracts from those modules and their internal dependencies. Note the scope in the CONTRACTS.md header so reviewers know what was covered. +- **Very large project (>500 source files):** Recommend that the user scope the pipeline to one subsystem at a time. Each subsystem gets its own pipeline run producing its own REQUIREMENTS.md, CONTRACTS.md, etc. Tell the user: "This project has N source files. For best results, run the requirements pipeline separately for each major subsystem (e.g., 'Generate requirements for the authentication module'). A single pipeline run across the full codebase will miss contracts due to context limits." + +If the user explicitly asks for full-project scope on a large codebase, honor the request but warn that coverage will be thinner than subsystem-level runs. + +### Scope breadth on the initial pass + +On the first pipeline run, favor breadth over depth. Cover all major subsystems and modules rather than going deep on a few. The goal is a broad baseline that the self-refinement loop and later review/refinement passes can deepen. If you focus on 3 modules and skip 8 others, the completeness check can't find gaps in modules it never saw. + +For projects with both a core library and supporting modules (middleware, plugins, adapters, extensions), include at least the core and the highest-risk supporting modules in Phase A. Note the scope in the CONTRACTS.md header so it's clear what was covered and what wasn't. Refinement passes can expand scope later, but the initial pass should cast the widest net the context window allows. + +### Contract extraction + +Read every source file (within scope) and list every behavioral contract it implements or should implement. A behavioral contract is any promise the code makes to its callers: + +- **METHOD**: What a public method guarantees about return value, side effects, exceptions, thread safety +- **NULL**: What happens when null is passed, returned, or stored +- **CONFIG**: What effect a configuration option has at its boundaries +- **ERROR**: What exceptions are thrown, when, and with what diagnostic information +- **INVARIANT**: Properties that must always hold +- **COMPAT**: Behaviors preserved for backward compatibility +- **ORDER**: Whether output/iteration order is stable, documented, or undefined +- **LIFECYCLE**: Resource creation/cleanup, initialization sequencing +- **THREAD**: Thread-safety guarantees or requirements + +### Contract extraction rules + +- **Be thorough.** For a 200-line file, expect 5–15 contracts. For a 1000-line file, expect 20–40. If you're finding fewer than 3 contracts in a file with real logic, you're skipping things. +- **Include internal files.** Internal contracts matter because the public API depends on them. +- **Include "should exist" contracts** — things the code doesn't do but should based on its domain. These catch absence bugs. +- **Read the code, not just the Javadoc/docstrings.** When documentation and code disagree, list both. +- **This is discovery, not judgment.** List everything, even if it seems obvious. + +### Output format + +``` +# Behavioral Contract Extraction +Generated: [date] +Source files analyzed: N +Total contracts extracted: N + +## Summary by category +- METHOD: N +- NULL: N +- CONFIG: N +[etc.] + +### path/to/file.ext (N contracts) + +1. [METHOD] ClassName.methodName(): description of what it guarantees +2. [NULL] ClassName.methodName(): what happens when null is passed/returned +[etc.] +``` + +--- + +### Requirement heading format + +All requirements in REQUIREMENTS.md must use the format `### REQ-NNN: Title` where NNN is a zero-padded three-digit number and Title is a short descriptive name. Do not use alternative formats like `### REQ-NNN — Title`, `### REQ-NNN. Title`, `**REQ-NNN**: Title`, or freeform headings without a number. Consistent formatting enables automated tooling to parse and cross-reference requirements. + +--- + +## Phase B: Derive requirements from contracts + +**Input:** `quality/CONTRACTS.md`, project documentation, SKILL.md Step 7 template. +**Output:** `quality/REQUIREMENTS.md` + +### How to work + +**B.1 — Group related contracts.** Many contracts across different files serve the same behavioral requirement. Group them by behavioral concern, not by file. Don't merge unrelated contracts just because they're in the same file. + +**B.2 — Enrich with intent.** For each group, find the user story from documentation: GitHub issues state what users expect, the user guide states intended behavior, troubleshooting docs reveal known edge cases, design docs explain design goals. The "so that" clause must come from understanding who cares and why. + +**B.3 — Write requirements.** Use the 7-field template from SKILL.md Step 7. Conditions of satisfaction come from the individual contracts in the group — each contract becomes a condition of satisfaction. + +**B.4 — Check for orphan contracts.** After writing all requirements, verify every contract in CONTRACTS.md is covered. Uncovered contracts become new requirements or get added to existing requirements' conditions of satisfaction. + +### Rules + +- **Do not cap the requirement count.** Write as many as the contracts warrant. +- **Every contract must map to at least one requirement.** +- **One requirement per distinct behavioral concern.** Don't merge "thread safety" with "null handling" just because they're in the same class. +- **Do not modify CONTRACTS.md.** Only read it. + +--- + +## Phase C: Verify coverage (loop, max 3 iterations) + +**Input:** `quality/CONTRACTS.md`, `quality/REQUIREMENTS.md` +**Output:** `quality/COVERAGE_MATRIX.md`, updated `quality/REQUIREMENTS.md` + +For every contract in CONTRACTS.md, determine whether it is covered by a requirement. A contract is "covered" if a requirement's conditions of satisfaction explicitly test the behavior. A contract is NOT covered if it's only tangentially mentioned, implied but not stated, or if a different aspect of the same file is covered but this specific contract isn't. + +### Output format + +``` +# Contract Coverage Matrix +Generated: [date] +Total contracts: N +Covered: N (percentage) +Uncovered: N (percentage) +Partially covered: N (percentage) + +## Fully covered contracts +[file]: [contract summary] → REQ-NNN (conditions of satisfaction #M) + +## Partially covered contracts +[file]: [contract summary] → REQ-NNN covers the general area but misses [specific aspect] + +## Uncovered contracts +[file]: [contract summary] → No requirement addresses this behavior +``` + +After writing the matrix, fix gaps in REQUIREMENTS.md: add missing conditions to existing requirements or create new requirements. Report changes. + +**Loop termination:** If uncovered count reaches 0, proceed to Phase D. Otherwise, regenerate the matrix and check again. Maximum 3 iterations. + +--- + +## Phase D: Completeness check + +**Input:** `quality/REQUIREMENTS.md`, `quality/CONTRACTS.md`, `quality/COVERAGE_MATRIX.md`, source tree. +**Output:** `quality/COMPLETENESS_REPORT.md`, updated `quality/REQUIREMENTS.md` + +This is the final gate before the narrative pass. Run three checks: + +### Check 1: Domain completeness + +The following behavioral domains MUST have requirements. Check each one. This checklist is a minimum — if you notice a domain not listed that should have requirements for this project's domain, add it. + +- [ ] **Null handling:** explicit null, absent fields, null keys, null values in collections +- [ ] **Type coercion:** string↔number, string↔boolean, number precision, overflow +- [ ] **Primitive vs wrapper:** primitive vs object null semantics during deserialization (for languages with this distinction) +- [ ] **Generic types:** erasure boundaries, wildcard handling, recursive generics (for languages with generics) +- [ ] **Thread safety:** concurrent access, publication safety, cache visibility +- [ ] **Error diagnostics:** exception types, path context, location information +- [ ] **Resource management:** stream closing, reader/writer lifecycle +- [ ] **Backward compatibility:** wire format stability, API behavioral stability +- [ ] **Security:** DoS protection (nesting depth, string length), injection prevention +- [ ] **Encoding:** Unicode, BOM, surrogate pairs, escape sequences +- [ ] **Date/time:** format precedence, timezone handling, precision +- [ ] **Collections:** arrays, lists, sets, maps, queues — empty, null elements, ordering +- [ ] **Enums:** name resolution, aliases, unknown values +- [ ] **Polymorphism:** runtime type vs declared type, adapter/handler delegation +- [ ] **Tree model / intermediate representation:** mutation semantics, deep copy structural independence, null normalization +- [ ] **Configuration:** builder immutability, instance isolation, option composition +- [ ] **Entry points:** every distinct public entry point must have its own contract — string-based, stream-based, tree-based, standalone parsing, multi-value parsing. If the library has N ways to start a read or write, there must be N sets of contracts. +- [ ] **Output escaping:** which characters are escaped by default, what disabling escaping changes, how builder-level and writer-level controls interact +- [ ] **Built-in type handler contracts:** for each built-in handler that processes a standard library type, state what it promises about format, precision, normalization, and round-trip fidelity. The requirement should specify the handler's promise, not just that a handler exists. +- [ ] **Field/property serialization ordering:** whether output order follows declaration order, inheritance order, alphabetical order, or is undefined. State whether ordering is a promised contract or merely observed behavior. +- [ ] **Identity contracts for public types:** `toString()`, `hashCode()`/`equals()` (or language equivalent) on public model types. These are behavioral contracts users depend on for comparison, logging, and collection key usage. +- [ ] **Input validation:** for every configuration field with domain constraints, state the valid range and whether validation exists. + +For each domain, either cite the REQ-NNN numbers that cover it or flag it as a gap. + +### Check 2: Testability audit + +For each requirement, check whether its conditions of satisfaction are actually testable. Can a reviewer write a concrete test case from this condition? Is pass/fail unambiguous? Does the condition cover failure modes, not just the happy path? + +### Check 3: Cross-requirement consistency + +Check pairs of requirements that reference the same concept. Do ranges agree? Do null-handling rules agree? Do thread-safety guarantees conflict with lifecycle contracts? Do configuration defaults match across requirements? + +### Check 4: Cross-artifact consistency (if code review or spec audit results exist) + +If `quality/code_reviews/` or `quality/spec_audits/` contain results from a previous or current run, read them. For every finding with status VIOLATED, BUG, or INCONSISTENT, check whether the requirements address the behavioral concern that finding targets. If a code review found a bug in compression header parsing that the requirements don't cover, that's a completeness gap — add a requirement or conditions of satisfaction to close it. + +**The completeness report cannot say COMPLETE if unaddressed findings exist.** If any VIOLATED/BUG/INCONSISTENT finding from code review or spec audit targets behavior not covered by requirements, the verdict must be INCOMPLETE with the specific gaps listed. + +This check exists because earlier versions of the pipeline produced completeness reports that said "COMPLETE" while the code review in the same run found requirement violations. The completeness report must be consistent with all other quality artifacts. + +### Post-review completeness refresh (mandatory) + +**After the code review and spec audit are complete**, re-read `quality/COMPLETENESS_REPORT.md` and update it. The initial completeness report was written before the code review and spec audit ran, so it cannot reflect their findings. This refresh step reconciles the completeness verdict with the actual review results. + +**Procedure:** +1. Read the combined summary from `quality/code_reviews/` — count VIOLATED and BUG findings. +2. Read the triage summary from `quality/spec_audits/` — count confirmed code bugs. +3. For each finding, check whether REQUIREMENTS.md has a requirement covering that behavior. +4. Append a `## Post-Review Reconciliation` section to COMPLETENESS_REPORT.md: + +``` +## Post-Review Reconciliation +Updated: [date] + +### Code review findings: N VIOLATED, M BUG +- [finding summary] → covered by REQ-NNN / NOT COVERED (gap) +- ... + +### Spec audit findings: N confirmed code bugs +- [finding summary] → covered by REQ-NNN / NOT COVERED (gap) +- ... + +### Updated verdict +[COMPLETE if all findings are covered by requirements, INCOMPLETE if gaps remain] +``` + +5. If the original verdict was COMPLETE but unaddressed findings exist, change the verdict to INCOMPLETE. + +### Resolving code review vs spec audit conflicts + +When the code review and spec audit disagree about the same behavioral claim — one says BUG, the other says design choice or false positive — the reconciliation must resolve the conflict, not paper over it. + +**Resolution procedure:** +1. Identify the factual claim at the center of the disagreement. What does the code actually do? +2. Deploy a verification probe: give a model the disputed claim and the relevant source code, and ask it to report ground truth. (See `spec_audit.md` § "The Verification Probe.") +3. Record the resolution in the Post-Review Reconciliation section: + ``` + ### Conflicts resolved + - [finding description]: Code review said [X], spec audit said [Y]. + Verification probe: [what the code actually does]. + Resolution: [BUG CONFIRMED / FALSE POSITIVE / DESIGN CHOICE]. [Explanation.] + ``` +4. If the resolution confirms a BUG, ensure it has a regression test. If the resolution overturns a BUG, clean up the regression test per `review_protocols.md` § "Cleaning up after spec audit reversals." + +**Do not resolve conflicts by defaulting to one source.** Neither the code review nor the spec audit is automatically more authoritative — they use different methods (structural reading vs. spec comparison) and have different blind spots. The verification probe is the tiebreaker. + +**This refresh is not optional.** A completeness report that predates the code review is a timestamp, not a quality gate. The refresh turns it into an actual reconciliation. + +### Output format + +``` +# Completeness Report +Generated: [date] + +## Domain coverage +[For each domain: COVERED (REQ-NNN, REQ-NNN) or GAP (description)] + +## Testability issues +[For each vague requirement: REQ-NNN — condition N is not testable because...] + +## Consistency issues +[For each conflict: REQ-NNN and REQ-NNN disagree about...] + +## Cross-artifact gaps (if code review/spec audit results exist) +[For each unaddressed finding: finding summary → missing requirement or condition] + +## Verdict +COMPLETE or INCOMPLETE with recommended actions +``` + +Then fix what you can: add requirements for domain gaps, sharpen vague conditions, resolve consistency issues, and close cross-artifact gaps. + +**Important:** This is the final check. Be adversarial. Assume previous passes were imperfect. For each domain marked COVERED, verify that the cited requirements actually address the checklist item — don't just check the box. + +### Self-refinement loop (max 3 iterations) + +After the initial completeness check, run up to 3 refinement iterations to close the gaps Phase D identified: + +1. **Read the completeness report.** Identify all GAP entries, testability issues, and consistency issues. +2. **Fix gaps in REQUIREMENTS.md.** For each GAP: add a new requirement using the 7-field template, or add conditions of satisfaction to an existing requirement. For testability issues: sharpen the condition. For consistency issues: resolve the conflict. +3. **Re-run all three checks** (domain completeness, testability audit, cross-requirement consistency). Write the updated results to COMPLETENESS_REPORT.md. +4. **Count the delta.** How many new requirements were added or existing requirements modified in this iteration? +5. **Short-circuit check:** If the delta is fewer than 3 changes, stop — you've hit diminishing returns. Proceed to Phase E. + +**Why this works:** The initial completeness check identifies gaps but the model may not fix all of them in one pass, especially conceptual gaps where the model needs to re-read source files to understand what's missing. Each iteration shrinks the gap. Three iterations is enough to close the mechanical gaps; the remaining conceptual gaps are where cross-model audit and human review earn their keep. + +**Why it has limits:** This is self-refinement — the same model checking its own work. It catches gaps the model can see once they're pointed out (uncovered domains, vague conditions, numeric inconsistencies) but won't catch blind spots the model doesn't recognize as gaps. That's by design. The review and refinement protocols exist for closing those deeper gaps with different models or human input. + +After the loop completes (or short-circuits), proceed to Phase E. + +--- + +## Phase E: Narrative pass + +**Input:** `quality/REQUIREMENTS.md`, `quality/CONTRACTS.md`, project documentation, source tree. +**Output:** Restructured `quality/REQUIREMENTS.md` + +**Before starting:** Save a backup: `cp quality/REQUIREMENTS.md quality/REQUIREMENTS_pre_narrative.md` + +This phase transforms the specification into a guide. Add explanatory tissue so a new team member, code reviewer, or AI agent can read the document top-to-bottom and understand the software. + +### E.1 — Project overview (new, top of document) + +Write 400–600 words of connected prose explaining: what the software is, who uses it and why (primary personas and goals), how data flows through the major components, and the design philosophy (key architectural decisions and why they were made). + +### E.2 — Use cases (new, after overview) + +Write 6–8 use cases in the style of Applied Software Project Management (Stellman & Greene). Each has: + +- **Name**: Short descriptive name +- **Actor**: Who initiates it +- **Preconditions**: What must be true before this begins +- **Steps**: Numbered actor/system action sequence +- **Postconditions**: What is true on success +- **Alternative paths**: Variations and error cases +- **Requirements**: Which REQ-NNN numbers this use case exercises + +Cover the major usage patterns. The use cases are the bridge between "what the software does" and "what the requirements specify." + +### E.3 — Cross-cutting concerns (new, after use cases) + +Document architectural invariants that span multiple categories: threading model, null contract, error philosophy, backward compatibility strategy, configuration composition. Each references specific REQ-NNN numbers. Write as prose paragraphs. + +### E.4 — Category narratives (augment existing) + +For each requirement category, add 2–4 sentences before the first requirement explaining what the category covers, how it relates to other categories, and what a reviewer should keep in mind. + +### E.5 — Reorder for top-down flow + +Reorder categories from user-facing (entry points, configuration) to infrastructure (error handling, backward compatibility). Fold any catch-all sections into proper categories. + +### E.6 — Renumber sequentially + +After reordering, renumber all requirements REQ-001 through REQ-NNN following document order. Update all internal cross-references. + +### Rules + +- **Do not delete, merge, or weaken any existing requirement.** +- **Do not add new requirements in this pass.** +- **Write the overview and use cases from the user's perspective.** +- **Use cases must cite specific REQ numbers.** + +--- + +## Versioning protocol + +### Version scheme: major.minor + +- **Major** bump: structural changes (new pipeline architecture, narrative pass added, major scope expansion). Bumped by the user. +- **Minor** bump: refinement passes, gap fills, sharpened conditions. Increments automatically on each pipeline run or refinement pass. + +### VERSION_HISTORY.md + +Maintain a version history file at `quality/VERSION_HISTORY.md`: + +```markdown +# Requirements Version History + +## Current version: vX.Y + +| Version | Date | Model | Author | Reqs | Summary | +|---------|------|-------|--------|------|---------| +| v1.0 | YYYY-MM-DD | [model] | Quality Playbook | N | Initial pipeline generation | +| v1.1 | YYYY-MM-DD | [model] | [author] | N | [what changed] | + +## Pending review +[status from REFINEMENT_HINTS.md if review is in progress] +``` + +The **Author** column records provenance: "Quality Playbook" for automated pipeline runs, a person's name for manual edits, a model name for refinement passes. + +### Backup protocol + +Before each version change, copy all quality files to `quality/history/vX.Y/`: + +``` +quality/history/ +├── v1.0/ +│ ├── REQUIREMENTS.md +│ ├── CONTRACTS.md +│ ├── COVERAGE_MATRIX.md +│ └── COMPLETENESS_REPORT.md +├── v1.1/ +│ └── ... +└── v2.0/ + └── ... +``` + +Each version folder is a complete snapshot. Users can diff any two versions. + +### Version stamping + +The REQUIREMENTS.md header includes the current version: + +```markdown +# Behavioral Requirements — [Project Name] +Version: vX.Y +Generated: [date] +Pipeline: contract-extraction v2 with narrative pass +``` + +--- + +## After the pipeline: review and refinement + +The pipeline produces a solid baseline, but AI isn't 100% reliable. The skill provides two standalone tools for iterative improvement: + +### Requirements review (`quality/REVIEW_REQUIREMENTS.md`) + +An interactive or guided review of requirements organized by use case. Three modes: +- **Self-guided**: Pick use cases to drill into +- **Fully guided**: Walk through use cases sequentially +- **Cross-model audit**: A different model fact-checks the completeness report + +Progress and feedback are tracked in `quality/REFINEMENT_HINTS.md`. See the generated `quality/REVIEW_REQUIREMENTS.md` for the full protocol. + +### Requirements refinement (`quality/REFINE_REQUIREMENTS.md`) + +Reads `quality/REFINEMENT_HINTS.md` and updates `quality/REQUIREMENTS.md` to close identified gaps. Can be run with any model. Backs up the current version, bumps minor version, reports all changes. See the generated `quality/REFINE_REQUIREMENTS.md` for the full protocol. + +### Multi-model refinement + +Users can run refinement passes with different models to catch different blind spots. Each pass: backup → refine → version bump → log in VERSION_HISTORY.md. Run as many models as desired until diminishing returns. diff --git a/skills/quality-playbook/references/requirements_refinement.md b/skills/quality-playbook/references/requirements_refinement.md new file mode 100644 index 000000000..6a15a4e6c --- /dev/null +++ b/skills/quality-playbook/references/requirements_refinement.md @@ -0,0 +1,113 @@ +# Requirements Refinement Protocol + +## Overview + +This is the template for `quality/REFINE_REQUIREMENTS.md`. The playbook generates this file alongside the requirements pipeline output. It provides a structured process for updating requirements based on review feedback, and can be run with any model. + +## Generated file template + +The playbook should generate the following as `quality/REFINE_REQUIREMENTS.md`: + +--- + +```markdown +# Requirements Refinement Protocol: [Project Name] + +## How to use + +This protocol reads feedback from `quality/REFINEMENT_HINTS.md` and updates `quality/REQUIREMENTS.md` to close identified gaps. It can be run with any AI model — the protocol is self-contained. + +**Multi-model refinement:** You can run this protocol multiple times with different models. Each run backs up the current version, makes targeted improvements, bumps the minor version, and logs the changes. Run as many models as you want until you hit diminishing returns. + +--- + +## Before starting + +1. Read `quality/REFINEMENT_HINTS.md` — this contains the review feedback to address. +2. Read `quality/REQUIREMENTS.md` — the current requirements to update. +3. Read `quality/CONTRACTS.md` — for contract-level detail when adding new conditions. +4. Read `quality/VERSION_HISTORY.md` — to determine the current version number. + +## Step 1: Backup and version + +1. Read the current version from `quality/VERSION_HISTORY.md`. +2. Copy all files in `quality/` to `quality/history/vX.Y/` (current version number). +3. Bump the minor version: v1.2 becomes v1.3. +4. Update the version stamp at the top of `quality/REQUIREMENTS.md`. + +## Step 2: Process feedback + +Read each item in REFINEMENT_HINTS.md and categorize it: + +- **Gap — missing requirement:** A behavioral contract or domain area has no requirement. Create a new requirement using the 7-field template. +- **Gap — missing condition:** An existing requirement doesn't cover a specific scenario. Add a condition of satisfaction to the existing requirement. +- **Gap — missing use case coverage:** A use case doesn't link to a requirement that governs one of its steps. Add the REQ-NNN to the use case's Requirements line. +- **Sharpening — vague condition:** A condition of satisfaction is too vague to test. Rewrite it with concrete pass/fail criteria. +- **Correction — wrong content:** A requirement states something incorrect. Fix the specific field. +- **Cross-model audit finding:** A domain was marked COVERED in the completeness report but the cited requirements don't actually address it. Add the missing requirements. +- **Removal (user-directed only):** The user explicitly states a requirement is incorrect and should be removed (e.g., "REQ-047 is incorrect because X — remove it"). Only process removals when the hint clearly comes from the user, not from an automated pass. Log the removal and reason in the change report. + +## Step 3: Make changes + +For each feedback item: + +1. **New requirements:** Add at the end of the appropriate category section. Continue the existing numbering sequence. Follow the 7-field template exactly. +2. **Modified requirements:** Edit the specific field that needs changing. Do not rewrite requirements that aren't flagged. +3. **Use case updates:** Add newly created REQ numbers to the relevant use case's Requirements line. +4. **Cross-cutting concerns:** If new requirements affect cross-cutting concerns, update those sections. + +### Rules + +- **Do not delete or weaken existing requirements during automated refinement.** Every requirement that exists today must exist after refinement with at least the same conditions of satisfaction — unless the user has explicitly marked a requirement for removal with a reason. User-directed removals are the only exception. +- **Do not renumber existing requirements.** New requirements get the next available number. This preserves traceability across versions. +- **Do not restructure the document.** The narrative pass already established the structure. Refinement is surgical — add, sharpen, or fix individual items. +- **Each change must be traceable to a feedback item.** Don't make changes that weren't asked for. + +## Step 4: Report changes + +After all changes, append a summary to `quality/REFINEMENT_HINTS.md`: + +``` +## Refinement Pass — v[new version] +Date: [date] +Model: [model name] + +### Changes made +- REQ-NNN (NEW): [brief description] — addresses feedback: "[quoted hint]" +- REQ-NNN: Added condition of satisfaction for [what] — addresses feedback: "[quoted hint]" +- REQ-NNN: Sharpened condition #N: [what changed] — addresses feedback: "[quoted hint]" +- Use Case N: Added REQ-NNN to requirements list + +### Feedback items not addressed +- "[quoted hint]" — reason: [why this wasn't actionable or was out of scope] + +### Summary +Added N new requirements, modified N existing requirements, updated N use cases. +Total requirements: N (was N). +``` + +## Step 5: Update version history + +Add a row to `quality/VERSION_HISTORY.md`: + +``` +| vX.Y | YYYY-MM-DD | [model] | [author] | N | [summary of changes] | +``` + +## Step 6: Update completeness report + +If new requirements were added that address domain checklist gaps, update the relevant domain entries in `quality/COMPLETENESS_REPORT.md` to cite the new REQ numbers. + +--- + +## Running multiple refinement passes + +Each pass follows the same protocol: +1. Read the latest REFINEMENT_HINTS.md (which now includes the previous pass's report) +2. Focus only on feedback items marked "not addressed" or new feedback added since the last pass +3. Backup, bump version, make changes, report + +The user can add new hints between passes by editing REFINEMENT_HINTS.md directly. The next refinement pass picks them up automatically. + +The user can also run a fresh cross-model audit (Mode 3 of the review protocol) between refinement passes to find new gaps that the previous refinement didn't catch. This creates a review → refine → review → refine cycle that converges on completeness. +``` diff --git a/skills/quality-playbook/references/requirements_review.md b/skills/quality-playbook/references/requirements_review.md new file mode 100644 index 000000000..e395bb1c2 --- /dev/null +++ b/skills/quality-playbook/references/requirements_review.md @@ -0,0 +1,158 @@ +# Requirements Review Protocol + +## Overview + +This is the template for `quality/REVIEW_REQUIREMENTS.md`. The playbook generates this file alongside the requirements pipeline output. It provides three modes for reviewing requirements interactively after generation. + +## Generated file template + +The playbook should generate the following as `quality/REVIEW_REQUIREMENTS.md`: + +--- + +```markdown +# Requirements Review Protocol: [Project Name] + +## How to use + +This protocol helps you review the generated requirements for completeness and accuracy. Run it with any AI model — the review is self-contained and reads from the files in `quality/`. + +**Before starting:** Make sure `quality/REQUIREMENTS.md` exists (from the pipeline) and that you've read the Project Overview and Use Cases sections at the top. + +### Choose a review mode + +**Mode 1 — Self-guided review.** You pick which use cases to examine. Best when you already know which areas of the project need the most scrutiny. + +**Mode 2 — Fully guided review.** The AI walks you through every use case in order, drilling into each linked requirement. Best for a thorough first review. + +**Mode 3 — Cross-model audit.** A different AI model fact-checks the completeness report by verifying that every domain marked COVERED actually has requirements addressing the checklist item. Best run with a different model than the one that generated the requirements. + +All three modes track progress in `quality/REFINEMENT_HINTS.md`. + +--- + +## Mode 1: Self-guided review + +Read `quality/REQUIREMENTS.md` and present the user with a numbered list of use cases: + +``` +Use cases in REQUIREMENTS.md: +1. [x] Use Case 1: [name] (reviewed) +2. [ ] Use Case 2: [name] +3. [ ] Use Case 3: [name] +... +``` + +Check `quality/REFINEMENT_HINTS.md` for review progress — use cases marked `[x]` have already been reviewed. Present the list and ask the user which use case to examine. + +When the user picks a use case: +1. Show the use case (actor, steps, postconditions, alternative paths) +2. List the linked REQ-NNN numbers +3. Ask: "Want to drill into any of these requirements, or does this use case look complete?" + +When drilling into a requirement: +1. Show the full requirement (summary, user story, conditions of satisfaction, alternative paths) +2. Ask: "Does this capture the right behavior? Anything missing or wrong?" +3. Record feedback in REFINEMENT_HINTS.md under the use case heading + +After reviewing a use case, mark it `[x]` in REFINEMENT_HINTS.md and return to the use case list. + +Also offer: "Are there any cross-cutting concerns or requirements NOT linked to a use case that you'd like to review?" + +--- + +## Mode 2: Fully guided review + +Same as Mode 1, but instead of asking the user to pick, start at Use Case 1 and proceed sequentially. + +For each use case: +1. Present the use case overview +2. Walk through each linked requirement one by one +3. For each requirement, ask: "Does this look right? Anything missing?" +4. Record any feedback in REFINEMENT_HINTS.md +5. Mark the use case as reviewed +6. Move to the next use case + +After all use cases: +1. Present the Cross-Cutting Concerns section +2. Ask: "Any concerns about threading, null handling, errors, compatibility, or configuration composition?" +3. Ask: "Are there any requirements you expected to see that aren't here?" +4. Record feedback and present a summary of all hints collected + +--- + +## Mode 3: Cross-model audit + +Read `quality/COMPLETENESS_REPORT.md` and `quality/REQUIREMENTS.md`. For each domain in the completeness report: + +1. Read the domain checklist item (from the report's domain coverage section) +2. Read each cited REQ-NNN +3. Verify: does this requirement actually address the domain checklist item? +4. If the citation is wrong (the requirement covers something else), flag it as a gap + +Also check: +- Are there requirements that don't appear in any use case's Requirements list? If so, flag as potentially orphaned. +- Does every use case's alternative paths section have corresponding requirements for the error/edge cases it mentions? +- Do the cross-cutting concerns reference requirements that actually exist and address the stated concern? + +Write findings to `quality/REFINEMENT_HINTS.md` under a `## Cross-Model Audit` heading: + +``` +## Cross-Model Audit +Date: [date] +Model: [model name] + +### Verified domains +- Null handling: CONFIRMED (REQ-054, REQ-055 correctly address null semantics) +- ... + +### Gaps found +- Entry points: COMPLETENESS_REPORT cites REQ-100, REQ-101 but these are about + pretty printing, not entry point contracts. JsonStreamParser has no coverage. +- ... + +### Orphaned requirements +- REQ-NNN is not linked to any use case +- ... +``` + +Present findings to the user and ask which gaps should be addressed in a refinement pass. + +--- + +## REFINEMENT_HINTS.md format + +The review protocol creates and maintains this file: + +```markdown +# Refinement Hints + +## Review Progress +- [x] Use Case 1: [name] — reviewed, no issues +- [x] Use Case 2: [name] — reviewed, see feedback below +- [ ] Use Case 3: [name] +- [ ] Use Case 4: [name] +... + +## Cross-Cutting Concerns +- [ ] Threading model — not yet reviewed +- [ ] Null contract — not yet reviewed +- [ ] Error philosophy — not yet reviewed +- [ ] Backward compatibility — not yet reviewed +- [ ] Configuration composition — not yet reviewed + +## Feedback + +### Use Case 2: [name] +- REQ-NNN: [specific feedback about what's missing or wrong] +- General: [broader observation about this use case's coverage] + +### Cross-Model Audit +[if Mode 3 was run] + +## Additional hints +[freeform feedback from the user, not tied to a specific use case] +``` + +This file serves dual purpose: it tracks review progress (so the user can resume across sessions) AND accumulates feedback that the refinement pass reads. +``` diff --git a/skills/quality-playbook/references/review_protocols.md b/skills/quality-playbook/references/review_protocols.md index 3f3b0cb94..795684af9 100644 --- a/skills/quality-playbook/references/review_protocols.md +++ b/skills/quality-playbook/references/review_protocols.md @@ -11,23 +11,16 @@ Before reviewing, read these files for context: 1. `quality/QUALITY.md` — Quality constitution and fitness-to-purpose scenarios -2. [Main architectural doc] -3. [Key design decisions doc] -4. [Any other essential context] +2. `quality/REQUIREMENTS.md` — Testable requirements derived during playbook generation +3. [Main architectural doc] +4. [Key design decisions doc] +5. [Any other essential context] -## What to Check +## Pass 1: Structural Review -### Focus Area 1: [Subsystem/Risk Area Name] +Read the code and report anything that looks wrong. No requirements, no focus areas — use your own knowledge of code correctness. Look for: race conditions, null pointer hazards, resource leaks, off-by-one errors, type mismatches, error handling gaps, and any code that looks suspicious. -**Where:** [Specific files and functions] -**What:** [Specific things to look for] -**Why:** [What goes wrong if this is incorrect] - -### Focus Area 2: [Subsystem/Risk Area Name] - -[Repeat for 4–6 focus areas, mapped to architecture and risk areas from exploration] - -## Guardrails +### Guardrails - **Line numbers are mandatory.** If you cannot cite a specific line, do not include the finding. - **Read function bodies, not just signatures.** Don't assume a function works correctly based on its name. @@ -35,21 +28,189 @@ Before reviewing, read these files for context: - **Grep before claiming missing.** If you think a feature is absent, search the codebase. If found in a different file, that's a location defect, not a missing feature. - **Do NOT suggest style changes, refactors, or improvements.** Only flag things that are incorrect or could cause failures. -## Output Format - -Save findings to `quality/code_reviews/YYYY-MM-DD-reviewer.md` +### Output For each file reviewed: -### filename.ext +#### filename.ext - **Line NNN:** [BUG / QUESTION / INCOMPLETE] Description. Expected vs. actual. Why it matters. -### Summary -- Total findings by severity -- Files with no findings -- Overall assessment: SHIP IT / FIX FIRST / NEEDS DISCUSSION +## Pass 2: Requirement Verification + +Read `quality/REQUIREMENTS.md`. For each requirement, check whether the code satisfies it. This is a pure verification pass — your only job is "does the code satisfy this requirement?" + +Do NOT also do a general code review. Do NOT look for other bugs. Do NOT evaluate code quality. Just check each requirement. + +For each requirement, report one of: +- **SATISFIED**: The code implements this requirement. Quote the specific code. +- **VIOLATED**: The code does NOT satisfy this requirement. Explain what the code does vs. what the requirement says. Quote the code. +- **PARTIALLY SATISFIED**: Some aspects implemented, others missing. Explain both. +- **NOT ASSESSABLE**: Can't be checked from the files under review. + +### Output + +For each requirement: + +#### REQ-N: [requirement text] +**Status**: SATISFIED / VIOLATED / PARTIALLY SATISFIED / NOT ASSESSABLE +**Evidence**: [file:line] — [code quote] +**Analysis**: [explanation] +[If VIOLATED] **Severity**: [impact description] + +## Pass 3: Cross-Requirement Consistency + +Compare pairs of requirements from `quality/REQUIREMENTS.md` that reference the same field, constant, range, or security policy. For each pair, check whether their constraints are mutually consistent. + +What to look for: +- **Numeric range vs bit width**: If one requirement says the valid range is [0, N) and another says the field is M bits wide, does N = 2^M? +- **Security policy propagation**: If one requirement says a CA file is configured, do all requirements about connections that should use it actually reference using it? +- **Validation bounds vs encoding limits**: Does a validation check in one file agree with the storage capacity in another? +- **Lifecycle consistency**: If a resource is created by one requirement's code, is it cleaned up by another's? + +For each pair that shares a concept, verify consistency against the actual code. + +### Output + +For each shared concept: + +#### Shared Concept: [name] +**Requirements**: REQ-X, REQ-Y +**What REQ-X claims**: [summary] +**What REQ-Y claims**: [summary] +**Consistency**: CONSISTENT / INCONSISTENT +**Code evidence**: [quotes from both locations] +**Analysis**: [explanation] +[If INCONSISTENT] **Impact**: [what happens when the contradiction is triggered] + +## Combined Summary + +| Source | Finding | Severity | Status | +|--------|---------|----------|--------| +| Pass 1 | [structural finding] | [severity] | BUG / QUESTION | +| Pass 2, REQ-N | [requirement violation] | [severity] | VIOLATED | +| Pass 3, REQ-X vs REQ-Y | [consistency issue] | [severity] | INCONSISTENT | + +- Total findings by pass and severity +- Overall assessment: SHIP / FIX BEFORE MERGE / BLOCK ``` +### Execution requirements + +**All three passes are mandatory.** Do not consolidate passes into a single review. Each pass produces distinct findings because it uses a different lens: + +- **Pass 1** finds structural bugs (race conditions, null hazards, resource leaks) +- **Pass 2** finds requirement violations (missing behavior, spec deviations) +- **Pass 3** finds cross-requirement contradictions (inconsistent ranges, conflicting guarantees) + +**Write each pass as a clearly labeled section** in the output file. Use the headers `## Pass 1: Structural Review`, `## Pass 2: Requirement Verification`, `## Pass 3: Cross-Requirement Consistency`, and `## Combined Summary`. + +**If a pass has no findings, explain why.** Do not just write "No findings." Write what you checked and why nothing was wrong. For example: "Reviewed 12 functions in lib/response.js for null hazards, resource leaks, and error handling gaps. No confirmed bugs — all error paths either throw or return a well-defined default." A pass with no findings and no explanation is a pass that wasn't done. + +**Scoping for large codebases:** If the project has more than 50 requirements, Pass 2 does not need to verify every requirement against every file. Instead, focus Pass 2 on the requirements most relevant to the files being reviewed — check the requirements that reference those files or that govern the behavioral domain those files implement. The goal is depth on the files under review, not breadth across all requirements. + +**Self-check before finishing:** After writing all three passes and the combined summary, verify: (1) all three pass sections exist in the output, (2) Pass 2 references specific REQ-NNN numbers with SATISFIED/VIOLATED verdicts, (3) Pass 3 identifies at least one shared concept between requirements (even if consistent), (4) every BUG finding has a corresponding regression test in `quality/test_regression.*` (see Phase 2 below), (5) every regression test exercises the actual code path cited in the finding (see test-finding alignment check below). If any check fails, go back and complete the missing work. + +### Adversarial stance when documentation is available + +If the playbook was generated with supplemental documentation (reference_docs/, community docs, user guides, API references), the code review must use that documentation *against* the code, not in its defense. Documentation tells you what the code is supposed to do. Your job is to find where it doesn't. + +**Do not let documentation explanations excuse code defects.** If the docs say "the library handles X gracefully" but the code doesn't check for X, that's a bug — the documentation makes it *more* of a bug, not less. A richer understanding of intent should make you *harder* on the code, not softer. + +The failure mode this addresses: when models have access to documentation, they build a richer mental model of the software and become more *forgiving* of code that approximately matches that model. The documentation gives the model reasons to believe the code works, which suppresses detections. Fight this by treating documentation as the prosecution's evidence — it defines what the code promised, and your job is to find broken promises. + +### Test-finding alignment check + +For each regression test that claims to reproduce a specific finding, verify that the test actually exercises the cited code path. A test that targets a different function, a different branch, or a different failure mode than the finding it claims to reproduce is worse than no test — it creates false confidence. + +**Verification procedure:** For each regression test: +1. Read the finding: note the specific file, line number, function, and failure condition +2. Read the test: identify which function it calls and what condition it asserts +3. Confirm alignment: the test must call the function cited in the finding, trigger the specific condition the finding describes, and assert on the behavior the finding says is wrong + +If the test doesn't exercise the cited code path, either fix the test or mark the finding as UNCONFIRMED. Do not ship a regression test that passes or fails for reasons unrelated to the finding. + +### Closure mandate + +Every confirmed BUG finding must produce a regression test in `quality/test_regression.*`. The test must be an executable source file in the project's language — not a Markdown file, not prose documentation, not a comment block describing what a test would do. If the project uses Java, write a `.java` file. If Python, a `.py` file. The test must compile (or parse) and be runnable by the project's test framework. + +**No language exemptions.** If introducing failing tests before fixes is a concern, use the language's expected-failure mechanism. The guard must be the **earliest syntactic guard for the framework** — a decorator or annotation where idiomatic, otherwise the first executable line in the test body: + +- **Python (pytest):** `@pytest.mark.xfail(strict=True, reason="BUG-NNN: [description]")` — decorator above `def test_...():`. When the bug is present: XFAIL (expected). When the bug is fixed but marker not removed: XPASS → strict mode fails, signaling the guard should be removed. +- **Python (unittest):** `@unittest.expectedFailure` — decorator above the test method. +- **Go:** `t.Skip("BUG-NNN: [description] — unskip after applying quality/patches/BUG-NNN-fix.patch")` — first line inside the test function. Note: Go's `t.Skip` hides the test (reports SKIP, not FAIL), which is weaker than Python's xfail. +- **Rust:** `#[ignore]` attribute on the test function — the standard "don't run in default suite" mechanism. Use `#[should_panic]` only for panic-shaped bugs. +- **Java (JUnit 5):** `@Disabled("BUG-NNN: [description]")` — annotation above the test method. +- **TypeScript/JavaScript (Jest):** `test.failing("BUG-NNN: [description]", () => { ... })` +- **TypeScript/JavaScript (Vitest):** `test.fails("BUG-NNN: [description]", () => { ... })` +- **JavaScript (Mocha):** `it.skip("BUG-NNN: [description]", () => { ... })` or `this.skip()` inside the test body for conditional skipping. + +Every guard must reference the bug ID (BUG-NNN format) and the fix patch path so that someone encountering a skipped test knows how to resolve it. + +These patterns ensure every bug has an executable test that can be enabled when the fix lands, without polluting CI with expected failures. + +**TDD red/green interaction with skip guards.** During TDD verification, the red and green phases must temporarily bypass the skip guard: +- **Red phase (NEVER SKIPPED):** Remove or disable the guard, run against unpatched code. Must fail. Re-enable guard after recording result. **The red phase is mandatory for every confirmed bug, even when no fix patch exists.** Record `verdict: "confirmed open"` with `red_phase: "fail"` and `green_phase: "skipped"`. Do not use `verdict: "skipped"` — that value is deprecated. +- **Green phase:** Remove or disable the guard, apply fix patch, run. Must pass. Re-enable guard if fix will be reverted. If no fix patch exists, record `green_phase: "skipped"`. +- **After successful red→green:** Generate a per-bug writeup at `quality/writeups/BUG-NNN.md` (see SKILL.md File 7, "Bug writeup generation"). Record the path in `tdd-results.json` as `writeup_path`. After writing `tdd-results.json`, reopen it and verify all required fields, enum values, and no extra undocumented root keys (see SKILL.md post-write validation step). Both sidecar JSON files must use `schema_version: "1.1"`. +- **After TDD cycle:** Guard remains in committed regression test file, removed only when fix is permanently merged. + +**The only acceptable exemption** is when a regression test genuinely cannot be written — for example, the bug requires multi-threaded timing that can't be reliably reproduced, or requires an external service not available in the test environment. In that case, write an explicit exemption note in the combined summary explaining why, and include a minimal code sketch showing what you would test if you could. + +Findings without either an executable regression test or an explicit exemption note are incomplete. The combined summary must not include unresolved findings — every BUG must have closure. + +### Regression test semantic convention + +All regression tests must assert **desired correct behavior** and be marked as expected-to-fail on the current code. Do not write tests that assert the current broken behavior and pass. The distinction matters: + +- **Correct:** Test says "this input should produce X" → test fails because buggy code produces Y → marked `xfail`/`@Disabled`/`t.Skip` → when bug is fixed, test passes and the skip marker is removed. +- **Wrong:** Test says "this input produces Y (the buggy output)" → test passes on buggy code → when bug is fixed, test fails silently → stale test that now asserts wrong behavior. + +The `xfail(strict=True)` pattern (Python/pytest) is the gold standard: it fails if the bug is present (expected), and also fails if the bug is fixed but the `xfail` marker wasn't removed (strict). Other languages should approximate this with skip + reason. + +### Post-review closure verification + +After writing all regression tests and the combined summary, run this checklist: + +1. **Count BUGs in the combined summary.** This is the expected count. +2. **Count test functions in `quality/test_regression.*`.** This should equal or exceed the BUG count (some BUGs may need multiple tests). +3. **For each BUG row in the summary**, verify it has either: + - A `REGRESSION TEST:` line citing the test function name, OR + - An `EXEMPTION:` line explaining why no test was written +4. **If any BUG lacks both**, go back and write the test or the exemption before declaring the review complete. + +This checklist is the enforcement mechanism for the closure mandate. Without it, the mandate is aspirational — agents document bugs fully in the pass summaries but skip the regression test and move on. + +### Post-spec-audit regression tests + +The closure mandate applies to spec-audit confirmed code bugs, not just code review bugs. After the spec audit triage categorizes findings, every finding classified as "Real code bug" must get a regression test — using the same conventions as code review regression tests (executable source file, expected-failure marker, test-finding alignment). + +**Why this is a separate step:** Code review regression tests are written immediately after the code review, before the spec audit runs. This means spec-audit bugs are systematically orphaned — they appear in the triage report but never enter the regression test file. Across v1.3.4 runs on 8 repos, spec-audit bugs accounted for ~30% of all findings, and only 1 of 8 repos (httpx) wrote regression tests for them. + +**Procedure:** +1. After spec audit triage, read the triage summary for findings classified as "Real code bug." +2. For each, write a regression test in `quality/test_regression.*` using the same format as code review regression tests. Use the spec audit report as the source citation: `[BUG from spec_audits/YYYY-MM-DD-triage.md]`. +3. Run the test to confirm it fails (expected) or passes (needs investigation). +4. Update the cumulative BUG tracker in PROGRESS.md with the test reference. + +If the spec audit produced no confirmed code bugs, skip this step — but document that in PROGRESS.md so the audit trail is complete. + +### Cleaning up after spec audit reversals + +When the spec audit overturns a code review finding (classifies a BUG as a design choice or false positive), the corresponding regression test must be either deleted or moved to a separate file (`quality/design_behavior_tests.*`) that documents intentional behavior. A failing test that points at documented-correct behavior is worse than no test — it creates noise and erodes trust in the regression suite. + +After spec audit triage, check: does any test in `quality/test_regression.*` correspond to a finding that was reclassified as non-defect? If so, remove it from the regression file. + +### Why Three Passes Instead of Focus Areas + +Previous experiments (the QPB NSQ benchmark) showed that focus areas don't reliably improve AI code review. A generic "review for bugs" prompt scored 65.5%, while a playbook with 7 named focus areas scored 48.3% — the focus areas narrowed the model's attention and suppressed detections. + +The three-pass pipeline works because each pass does one thing well with no cross-contamination: +- **Pass 1** lets the model do what it's already good at (structural review, ~65% of defects) +- **Pass 2** catches individual requirement violations that structural review misses (absence bugs, spec deviations) +- **Pass 3** catches contradictions between individually-correct pieces of code (cross-file arithmetic bugs, security policy gaps) + +Experiments on the NSQ codebase showed this pipeline finding 2 of 3 defects that were invisible to all structural review conditions — with zero knowledge of the specific bugs. The defects found were a cross-file numeric mismatch (validation bound vs bit field width) and a security design gap (configured CA not propagated to outbound auth client). + ### Phase 2: Regression Tests for Confirmed Bugs After the code review produces findings, write regression tests that reproduce each BUG finding. This transforms the review from "here are potential bugs" into "here are proven bugs with failing tests." @@ -228,7 +389,7 @@ After all tests complete, show a summary table and a recommendation: **Passed:** 7/8 | **Failed:** 1/8 -**Recommendation:** FIX FIRST — Rate limit handling needs investigation. +**Recommendation:** FIX BEFORE MERGE — Rate limit handling needs investigation. ``` Then save the detailed results to `quality/results/YYYY-MM-DD-integration.md`. @@ -246,7 +407,7 @@ Save results to `quality/results/YYYY-MM-DD-integration.md` [Specific failures, unexpected behavior, performance observations] ### Recommendation -[SHIP IT / FIX FIRST / NEEDS INVESTIGATION] +[SHIP / FIX BEFORE MERGE / BLOCK] ``` ### Tips for Writing Good Integration Checks @@ -362,6 +523,80 @@ The number of units/records/iterations per integration test run matters: Look for `chunk_size`, `batch_size`, or similar configuration in the project to calibrate. When in doubt, 10–30 records is usually the right range for integration testing — enough to catch real issues without burning API budget. +### Integration Testing for Skills and LLM-Automated Tools + +When the project under test is an AI skill, CLI tool that wraps LLM calls, or any software whose primary execution path involves invoking an AI model, the integration test protocol must include **LLM-automated integration tests** — tests that run the tool end-to-end via a command-line AI agent and structurally verify the output. + +This is distinct from standard integration tests because the system under test doesn't have a deterministic API to call. The "integration" is: install the skill into a test repo, invoke it through a CLI agent (GitHub Copilot CLI, Claude Code, or similar), and verify the output artifacts meet structural and content expectations. + +**Why this matters:** Skills and LLM tools cannot be tested by calling functions directly — their execution path goes through an AI agent that interprets instructions, reads files, and produces artifacts. The only way to test whether the skill works is to run it. Manual execution is fine for development, but a quality playbook should encode the test as a repeatable protocol. + +**Protocol structure for skill/LLM integration tests:** + +```markdown +## Skill Integration Test Protocol + +### Prerequisites +- CLI agent installed and configured (e.g., `gh copilot`, `claude`, `npx @anthropic-ai/claude-code`) +- Test repo prepared with skill installed at `.github/skills/SKILL.md` (or equivalent) +- Clean `quality/` directory (no artifacts from prior runs) +- Optional: `reference_docs/` folder for with-docs comparison runs + +### Test Matrix + +| Test | Method | Pass Criteria | +|------|--------|---------------| +| Full execution | Run skill via CLI with "execute" prompt | All expected artifacts exist in `quality/` | +| PROGRESS.md completeness | Read `quality/PROGRESS.md` | All phases checked complete, BUG tracker populated | +| Artifact structural check | Verify each expected file | Files are non-empty, contain expected sections | +| BUG tracker closure | Count BUG entries vs regression tests | Every BUG has a test reference or exemption | +| Baseline vs with-docs (optional) | Run twice: without and with reference_docs/ | With-docs run produces >= baseline requirement count | + +### Execution + +```bash +# Install skill into test repo +cp -r path/to/skill/.github test-repo/.github + +# Run via CLI agent (adapt command to your agent) +cd test-repo +gh copilot -p "Read .github/skills/SKILL.md and its reference files. Execute the quality playbook for this project." \ + --model gpt-5.4 --yolo > quality_run.output.txt 2>&1 +``` + +### Structural Verification (automated) + +After the run, verify output structurally: + +```bash +# Required artifacts exist and are non-empty +for f in quality/QUALITY.md quality/REQUIREMENTS.md quality/CONTRACTS.md \ + quality/COVERAGE_MATRIX.md quality/COMPLETENESS_REPORT.md \ + quality/PROGRESS.md quality/RUN_CODE_REVIEW.md \ + quality/RUN_INTEGRATION_TESTS.md quality/RUN_SPEC_AUDIT.md; do + [ -s "$f" ] || echo "FAIL: $f missing or empty" +done + +# Functional test file exists (language-appropriate name) +ls quality/test_functional.* quality/FunctionalSpec.* quality/functional.test.* 2>/dev/null \ + || echo "FAIL: no functional test file" + +# PROGRESS.md has all phases checked +grep -c '\[x\]' quality/PROGRESS.md # should equal total phase count + +# BUG tracker has entries (if bugs were found) +grep -c '^| [0-9]' quality/PROGRESS.md + +# Code reviews and spec audits produced substantive files +find quality/code_reviews -name "*.md" -size +500c | wc -l # should be >= 1 +find quality/spec_audits -name "*triage*" -size +500c | wc -l # should be >= 1 +``` +``` + +**Baseline vs with-docs comparison pattern:** Run the skill twice on the same repo — once without supplemental docs, once with a `reference_docs/` folder containing project history. Compare: requirement count, scenario count, bug count, and pipeline completion. The with-docs run should produce equal or more requirements and equal or more bugs. If the baseline outperforms the with-docs run on bug detection, that's a finding about the docs quality, not a skill failure. + +**When to generate this protocol:** Generate a skill integration test section in `RUN_INTEGRATION_TESTS.md` whenever the project being analyzed is a skill, a CLI tool that wraps AI calls, or a framework for building AI-powered tools. Look for: `SKILL.md` files, prompt templates, LLM client configurations, agent orchestration code, or references to AI models in the codebase. + ### Post-Run Verification Depth A run that completes without errors may still be wrong. For each integration test run, verify at multiple levels: diff --git a/skills/quality-playbook/references/run_state_schema.md b/skills/quality-playbook/references/run_state_schema.md new file mode 100644 index 000000000..b2b7d8938 --- /dev/null +++ b/skills/quality-playbook/references/run_state_schema.md @@ -0,0 +1,366 @@ +# Run-State Schema (v1.5.6) + +*Authoritative schema for `quality/run_state.jsonl`, `quality/PROGRESS.md`, and `Calibration Cycles//run_state.jsonl`. The playbook AI writes these files directly via the file-tool layer; the orchestrator AI reads them to drive multi-benchmark calibration cycles.* + +*Companion to: `docs/design/QPB_v1.5.5_Design.md` ("Design — Run-state event taxonomy" section).* + +--- + +## File locations and ownership + +- `/quality/run_state.jsonl` — per-run event log. Append-only. Written by the AI executing the playbook. +- `/quality/PROGRESS.md` — human-readable run status. Atomically rewritten by the AI on each event. +- `Calibration Cycles//run_state.jsonl` — cycle-level event log. Append-only. Written by the orchestrator AI. + +All three live in the bind-mounted workspace owned by the user. The AI writes via Edit/Write file tools, never via shell redirection or `tee` (which routes through a different UID layer in some sandbox runtimes). + +--- + +## Schema versioning + +Every `run_state.jsonl` opens with an `_index` event recording `schema_version`. Current version: `"1.5.6"`. Schema bumps preserve backward compatibility — older files remain readable by newer parsers. Breaking schema changes bump the major number. + +--- + +## Required fields (every event) + +Every event object MUST have: + +- `ts` — ISO 8601 UTC timestamp with `Z` suffix (e.g. `"2026-05-15T14:32:01Z"`). Sub-second precision allowed but not required. +- `event` — string, the event-type name. Must match one of the names listed in `_index.event_types`. + +Events MAY have additional fields per their type's spec below. Unknown fields are tolerated by readers (forward-compatible). + +--- + +## Per-run events (`/quality/run_state.jsonl`) + +### `_index` + +ALWAYS the first line. Records schema metadata. + +| Field | Type | Required | Notes | +|---|---|---|---| +| `event` | string | yes | Always `"_index"` | +| `ts` | string | yes | ISO 8601 UTC | +| `schema_version` | string | yes | `"1.5.6"` | +| `event_types` | array of string | yes | Every event type this file uses | +| `benchmark` | string | yes | E.g. `"chi-1.3.45"`, `"virtio-1.5.1"` | +| `lever_state` | string | yes | E.g. `"pre-pattern7"`, `"post-pattern7"`, `"baseline"` | +| `started_at` | string | yes | ISO 8601 UTC, equals `ts` of this event | + +### `run_start` + +Marks the beginning of a playbook run. + +| Field | Type | Required | Notes | +|---|---|---|---| +| `event` | string | yes | `"run_start"` | +| `ts` | string | yes | | +| `runner` | string | yes | One of `"claude"`, `"codex"`, `"copilot"`, `"cursor"` | +| `playbook_version` | string | yes | E.g. `"1.5.6-pre"`, `"1.5.6"` (matches `bin.benchmark_lib.RELEASE_VERSION`) | +| `target_path` | string | yes | Relative path to benchmark target | + +### `phase_start` + +Marks the beginning of one of the six playbook phases. + +| Field | Type | Required | Notes | +|---|---|---|---| +| `event` | string | yes | `"phase_start"` | +| `ts` | string | yes | | +| `phase` | integer | yes | 1, 2, 3, 4, 5, or 6 | + +### `pattern_walked` + +Phase 1 only. Records that one of the seven exploration patterns was walked. + +| Field | Type | Required | Notes | +|---|---|---|---| +| `event` | string | yes | `"pattern_walked"` | +| `ts` | string | yes | | +| `phase` | integer | yes | Always 1 | +| `pattern` | integer | yes | 1 through 7 | +| `findings_count` | integer | yes | Number of findings produced by this pattern | +| `duration_seconds` | number | optional | Wall-clock for this pattern walk | + +### `pass_started` / `pass_ended` + +Phase 4 only. Records start/end of one of the four skill-derivation passes. + +| Field | Type | Required | Notes | +|---|---|---|---| +| `event` | string | yes | `"pass_started"` or `"pass_ended"` | +| `ts` | string | yes | | +| `phase` | integer | yes | Always 4 | +| `pass` | string | yes | One of `"A"`, `"B"`, `"C"`, `"D"` | +| `output_artifact` | string | optional | Relative path to pass artifact (on `pass_ended`) | + +### `finding_logged` + +Records that a finding (skill-divergence, code-bug, etc.) was logged in the current phase. + +| Field | Type | Required | Notes | +|---|---|---|---| +| `event` | string | yes | `"finding_logged"` | +| `ts` | string | yes | | +| `phase` | integer | yes | 1-6 | +| `finding_id` | string | yes | E.g. `"BUG-007"`, `"REQ-042"` | +| `category` | string | yes | E.g. `"code-bug"`, `"skill-divergence"`, `"missing-citation"`, `"prose-to-code-mismatch"` | + +### `artifact_written` + +Records that an artifact file was produced/updated. + +| Field | Type | Required | Notes | +|---|---|---|---| +| `event` | string | yes | `"artifact_written"` | +| `ts` | string | yes | | +| `relative_path` | string | yes | Path relative to benchmark target (e.g. `"quality/EXPLORATION.md"`) | +| `byte_size` | integer | optional | Size of the file at write time | +| `line_count` | integer | optional | Line count | + +### `gate_check` + +Records the outcome of a single quality-gate check. + +| Field | Type | Required | Notes | +|---|---|---|---| +| `event` | string | yes | `"gate_check"` | +| `ts` | string | yes | | +| `gate_name` | string | yes | Identifier from `quality_gate.py` | +| `verdict` | string | yes | One of `"pass"`, `"fail"`, `"warn"`, `"skip"` | +| `reason` | string | optional | Human-readable explanation | + +### `phase_end` + +Marks the end of a phase. Cross-validated against the phase's expected artifacts before being written (see "Cross-validation rules" below). + +| Field | Type | Required | Notes | +|---|---|---|---| +| `event` | string | yes | `"phase_end"` | +| `ts` | string | yes | | +| `phase` | integer | yes | 1-6 | +| `key_counts` | object | yes | Phase-specific counts (see below) | +| `artifacts_produced` | array of string | yes | Relative paths of artifacts produced this phase | +| `duration_seconds` | number | optional | Wall-clock for the whole phase | + +`key_counts` per phase: + +- Phase 1: `{"findings_total": N, "patterns_walked": M}` (M should be 7 for full Phase 1) +- Phase 2: `{"findings_promoted": N, "findings_dropped": M}` +- Phase 3: `{"bugs_identified": N, "bug_writeups": M}` +- Phase 4: `{"req_count": N, "uc_count": M, "passes_complete": K}` (K should be 4) +- Phase 5: `{"gate_checks_total": N, "gate_failures": M}` +- Phase 6: `{"bugs_md_count": N, "gate_verdict": "pass|fail|partial"}` + +### `error` + +Records an error during the run. + +| Field | Type | Required | Notes | +|---|---|---|---| +| `event` | string | yes | `"error"` | +| `ts` | string | yes | | +| `phase` | integer | optional | If error is phase-scoped | +| `message` | string | yes | Human-readable description | +| `recoverable` | boolean | yes | If true, the run will retry the affected phase; if false, the run is aborting | + +### `documentation_state` + +v1.5.6+. Records the documentation-availability state at Phase 1 entry. Currently the only emitted state is `"code_only"`, indicating that `reference_docs/` and `reference_docs/cite/` carry no recognized plaintext content (`.md` or `.txt`) and Phase 1 is proceeding in code-only mode (see `references/code-only-mode.md`). A `"with_docs"` value is reserved for future explicit emission; today the absence of a `documentation_state` event implies docs were present. + +| Field | Type | Required | Notes | +|---|---|---|---| +| `event` | string | yes | `"documentation_state"` | +| `ts` | string | yes | | +| `state` | string | yes | Currently `"code_only"`. Future values may include `"with_docs"`. | +| `reason` | string | yes | Free-form (e.g. `"reference_docs/ empty"`) | + +When `documentation_state state="code_only"` is emitted, the playbook also prepends a "Documentation status: code-only mode" section to `quality/EXPLORATION.md` and adds a "Documentation state: code_only" line to `quality/PROGRESS.md` so the downgrade is visible to anyone reading either artifact. New runs adding the `documentation_state` event must include it in the `_index.event_types` list. + +### `aborted_missing_docs` + +v1.5.6+. Records that the run aborted at Phase 1 entry because `--require-docs` was set and `reference_docs/` was empty. Mutually exclusive with `documentation_state state="code_only"` for the same Phase 1 entry — `--require-docs` is the opt-IN abort path; the absence of the flag preserves the documented code-only-mode downgrade. After this event the runner returns non-zero without invoking any LLM work, so no `phase_start phase=1` is recorded. + +| Field | Type | Required | Notes | +|---|---|---|---| +| `event` | string | yes | `"aborted_missing_docs"` | +| `ts` | string | yes | | +| `reason` | string | yes | Free-form (e.g. `"reference_docs/ empty and --require-docs set"`) | + +When `aborted_missing_docs` is emitted, the playbook also writes an `ERROR: aborted_missing_docs — ` block to `quality/PROGRESS.md` so the abort is visible without reading the JSONL. New runs that pass `--require-docs` against an empty `reference_docs/` must include `aborted_missing_docs` in the `_index.event_types` list. + +### `run_end` + +Marks the end of the playbook run. + +| Field | Type | Required | Notes | +|---|---|---|---| +| `event` | string | yes | `"run_end"` | +| `ts` | string | yes | | +| `status` | string | yes | One of `"success"`, `"aborted"`, `"failed"` | +| `total_findings` | integer | optional | Sum across all phases | +| `final_verdict` | string | optional | The Phase 6 gate verdict | + +--- + +## Cycle-level events (`Calibration Cycles//run_state.jsonl`) + +### `_index` (cycle-level) + +| Field | Type | Required | Notes | +|---|---|---|---| +| `event` | string | yes | `"_index"` | +| `ts` | string | yes | | +| `schema_version` | string | yes | `"1.5.6"` | +| `event_types` | array of string | yes | | +| `cycle_name` | string | yes | E.g. `"2026-05-15-pattern7-displacement-recovery"` | +| `lever_under_test` | string | yes | E.g. `"lever-1-exploration-breadth-depth"` | +| `benchmarks` | array of string | yes | Cycle's pinned benchmark list | +| `iteration` | integer | yes | Iteration ordinal (1, 2, or 3 — see iterate-cap) | + +### `cycle_start` + +| Field | Type | Required | Notes | +|---|---|---|---| +| `event` | string | yes | `"cycle_start"` | +| `ts` | string | yes | | +| `hypothesis` | string | yes | The cycle's testable hypothesis | +| `noise_floor_threshold` | number | yes | Recall delta below this is treated as noise (default 0.05) | + +### `benchmark_start` + +| Field | Type | Required | Notes | +|---|---|---|---| +| `event` | string | yes | `"benchmark_start"` | +| `ts` | string | yes | | +| `benchmark` | string | yes | | +| `lever_state` | string | yes | `"pre-lever"` or `"post-lever"` | + +### `lever_change_applied` + +| Field | Type | Required | Notes | +|---|---|---|---| +| `event` | string | yes | `"lever_change_applied"` | +| `ts` | string | yes | | +| `lever_id` | string | yes | E.g. `"lever-1-exploration-breadth-depth"` | +| `files_changed` | array of string | yes | Paths relative to QPB repo root | +| `commit_sha` | string | yes | Commit SHA on the implementing branch | +| `description` | string | yes | What the change is (e.g. `"Pattern 7 budget cap 3-5 → 2-3"`) | + +### `lever_change_reverted` + +| Field | Type | Required | Notes | +|---|---|---|---| +| `event` | string | yes | `"lever_change_reverted"` | +| `ts` | string | yes | | +| `files_changed` | array of string | yes | | +| `commit_sha` | string | optional | Null/absent if revert is uncommitted | + +### `benchmark_end` + +| Field | Type | Required | Notes | +|---|---|---|---| +| `event` | string | yes | `"benchmark_end"` | +| `ts` | string | yes | | +| `benchmark` | string | yes | | +| `lever_state` | string | yes | | +| `recall` | number | yes | 0.0-1.0 | +| `bugs_found` | array of string | yes | Bug IDs found this run | +| `bugs_missed` | array of string | yes | Bug IDs in baseline missed this run | +| `historical_baseline_path` | string | yes | Path to the baseline BUGS.md used for recall computation | + +### `cycle_end` + +| Field | Type | Required | Notes | +|---|---|---|---| +| `event` | string | yes | `"cycle_end"` | +| `ts` | string | yes | | +| `verdict` | string | yes | One of `"ship"`, `"revert"`, `"iterate"`, `"halt-iterate-cap"` | +| `recall_before` | object | yes | Per-benchmark recall before lever change | +| `recall_after` | object | yes | Per-benchmark recall after lever change | +| `delta` | object | yes | Per-benchmark delta (recall_after - recall_before) | +| `cross_benchmark_check` | object | yes | `{"clean": bool, "regressions": [list of bench/bug pairs that regressed]}` | + +--- + +## Cross-validation rules (per `phase_end`) + +The AI verifies these conditions before appending a `phase_end` event. If any check fails, the AI appends an `error` event with `recoverable: true` and re-runs the failing phase. + +| Phase | Required conditions | +|---|---| +| 1 | `quality/EXPLORATION.md` exists, ≥ 120 lines (aligned with the Phase 2 startup gate in `bin/run_playbook.check_phase_gate`), contains at least one finding section (regex `^##\s+(Finding\|Open Exploration Findings\|\d+\.)` — accepts `## Finding ...`, the SKILL-prescribed exact heading `## Open Exploration Findings`, and numbered `## N.` headings) | +| 2 | All nine fixed-name Generate-contract artifacts exist non-empty under `quality/`: `REQUIREMENTS.md`, `QUALITY.md`, `CONTRACTS.md`, `COVERAGE_MATRIX.md`, `COMPLETENESS_REPORT.md`, `RUN_CODE_REVIEW.md`, `RUN_INTEGRATION_TESTS.md`, `RUN_SPEC_AUDIT.md`, `RUN_TDD_TESTS.md`. Plus at least one non-empty `quality/test_functional.` (extension varies by primary language). Pre-v1.5.6 this row described the v1.5.5-design triage model (`EXPLORATION_MERGED.md` / `triage.md`); that mapping was never adopted by shipped SKILL.md / orchestrator_protocol.md / agent files, which always documented Phase 2 as Generate. | +| 3 | `quality/code_reviews/` directory contains at least one review file. If `quality/BUGS.md` has any `### BUG-` heading, `quality/patches/` contains at least one `BUG-*-regression-test.patch` file. Pre-v1.5.6 this row checked `quality/RUN_CODE_REVIEW.md` (a Phase 2 Generate output, not a Phase 3 review result) — same v1.5.5-design / shipped-Generate drift class as the Phase 2 row. Cluster B reconciled. | +| 4 | `quality/spec_audits/` directory contains at least one `*-triage.md` file AND at least one `*-auditor-*.md` file (per orchestrator_protocol.md naming convention). When neither name pattern matches, the validator falls back to a weaker "≥2 files" check — older bootstrap runs with arbitrary `.md` names still pass; the gate at Phase 6 enforces deeper conformance. Pre-v1.5.6 this row checked `quality/REQUIREMENTS.md` + `COVERAGE_MATRIX.md` (Phase 2 outputs) — same v1.5.5-design drift class. Cluster B reconciled. | +| 5 | If `quality/BUGS.md` has confirmed `### BUG-` entries: `quality/results/tdd-results.json` exists non-empty; for every confirmed bug, `quality/writeups/BUG-NNN.md` exists AND `quality/results/BUG-NNN.red.log` exists. With no confirmed bugs the row is vacuously satisfied. Pre-v1.5.6 this row checked `quality/results/quality-gate.log` (a Phase 6 output) — same v1.5.5-design drift class. Cluster B reconciled. | +| 6 | `quality/results/quality-gate.log` exists non-empty AND `quality/PROGRESS.md` contains a `Terminal Gate Verification` section (the orchestrator-protocol marker that Phase 6 ran the script-verified gate to completion). Pre-v1.5.6 this row checked `quality/BUGS.md` + `quality/INDEX.md` — BUGS.md is a Phase 3 output, INDEX.md was never adopted in the shipped contract. Same v1.5.5-design drift class. Cluster B reconciled. | + +The `run_end` event additionally requires: all 6 `phase_end` events present in the log; the final BUGS.md count matches `phase_end phase=6 key_counts.bugs_md_count`. + +--- + +## Resume semantics + +When an AI session starts on a run directory: + +1. If `quality/run_state.jsonl` does not exist: fresh run. Write `_index` + `run_start` + `phase_start phase=1`. +2. If it exists: read all events. Find the last `phase_start` not followed by a matching `phase_end`. Call it the "in-progress phase". +3. Verify the in-progress phase's expected artifacts (per cross-validation rules above): + - If artifacts complete: append the missing `phase_end` event and proceed to the next phase. Note: this is the "session crashed mid-phase but the work is done" recovery path. + - If artifacts incomplete: re-run that phase from scratch. The prior session left a partial state that can't be safely resumed. +4. If all 6 `phase_end` events are present but no `run_end`: append `run_end status=success` and finalize. + +The policy is "trust artifacts more than events." If events claim phase 4 done but `REQUIREMENTS.md` doesn't exist, the AI re-runs phase 4. If events stop mid-phase but artifacts are complete, the AI catches up the events. + +--- + +## PROGRESS.md format + +Atomically rewritten on every event. Markdown. + +```markdown +# QPB Run Progress + +**Started:** 2026-05-15T14:32:01Z **Benchmark:** chi-1.5.1 **Lever:** post-pattern7 +**Runner:** claude **Playbook version:** 1.5.6 + +## Phases + +- [x] Phase 1 — Explore (10:10, 12 findings, patterns 1-7 walked) +- [x] Phase 2 — Generate (0:42, 9 artifacts produced) +- [x] Phase 3 — Code Review (15:31, 6 bugs identified) +- [x] Phase 4 — Spec Audit (3 auditors, 1 triage) +- [ ] Phase 5 — Reconciliation *(in progress, started 14:58:31Z)* +- [ ] Phase 6 — Verify + +## Recent events (last 10) + +- 2026-05-15T14:58:31Z — phase_start phase=5 +- 2026-05-15T14:58:30Z — phase_end phase=4 passes=[A,B,C,D] req_count=89 +- 2026-05-15T14:42:11Z — phase_end phase=1 findings=12 + +## Artifacts produced + +- quality/EXPLORATION.md (12,034 bytes) +- quality/REQUIREMENTS.md (28,891 bytes) +- quality/COVERAGE_MATRIX.md (3,022 bytes) +``` + +Sections (header, phase checklist, recent events, artifacts produced) are required. Phase checklist uses `[x]` for complete phases (with summary stats), `[ ]` for incomplete, with in-progress phase noted explicitly with start time. Recent events shows last 10 event lines from `run_state.jsonl` in human-readable form. Artifacts produced shows files written this run with byte sizes. + +--- + +## Format invariants (enforced by `bin/run_state_lib.py` validators) + +1. `_index` is line 1. +2. Every line is valid JSON (one object per line). +3. Every event has `ts` and `event` fields. +4. Every `event` value appears in `_index.event_types`. +5. Append-only: events are added, never edited. Editing a prior event is a schema violation. +6. `phase_start` and `phase_end` events for a given phase appear at most once per run (no out-of-order or duplicate phase markers). +7. `run_start` is the second line (after `_index`); `run_end` is the last line if the run completed. + +Validators are read-only checks. They surface violations as findings; they don't auto-correct. diff --git a/skills/quality-playbook/references/spec_audit.md b/skills/quality-playbook/references/spec_audit.md index 5049fa3cd..25cf3a3f2 100644 --- a/skills/quality-playbook/references/spec_audit.md +++ b/skills/quality-playbook/references/spec_audit.md @@ -65,6 +65,31 @@ Requirements are tagged with `[Req: tier — source]`. Weight your findings by t --- +## Pre-audit docs validation (required triage section) + +The triage report must include a `## Pre-audit docs validation` section regardless of whether `reference_docs/` exists. This section documents what the auditors used as their factual baseline. + +**If `reference_docs/` exists:** Spot-check the gathered docs for factual accuracy before running the audit. Stale or incorrect docs can skew audit confidence — a model that reads "the library handles X by doing Y" in the docs will rate a divergent finding higher even if the docs are wrong. + +**Quick validation procedure (5 minutes max):** +1. Pick 2–3 factual claims from `reference_docs/` that describe specific runtime behavior (e.g., "invalid input raises ValueError", "field X defaults to Y", "format Z is not supported"). +2. Grep the source code for the cited behavior. Does the code match the docs? +3. If any claim is wrong, note it in the triage header: "reference_docs/ contains N known inaccuracies: [list]. Findings that rely on these claims are downgraded to NEEDS REVIEW." + +**Spot-check claims about code contents must extract, not assert.** When the spec audit prompt or pre-validation includes claims like "function X handles constant Y at line Z," the triage must read the cited lines and report what they actually contain. Do not confirm a claim by checking that the function exists or that the constant is defined somewhere — confirm it by showing the exact text at the cited lines. Format each spot-check result as: + +``` +Claim: "vring_transport_features() preserves VIRTIO_F_RING_RESET at line 3527" +Actual line 3527: `default:` +Result: CLAIM IS FALSE — line 3527 is the default branch, not a RING_RESET case label +``` + +Spot-check claims derived from generated requirements or gathered docs (rather than from the code) are **hypotheses to test**, not facts to confirm. This rule prevents the contamination chain observed in v1.3.17 where a false spot-check claim was accepted as "accurate" without reading the actual lines, causing three auditors to inherit a hallucinated code-presence claim. + +**If `reference_docs/` does not exist:** State this explicitly: "No supplemental docs provided. Auditors relied on in-repo specs and code only." This confirms the absence is intentional, not an oversight. + +This section fires in every triage, not just when docs are present. In v1.3.5 cross-repo testing, it only fired in 1/8 repos because it was conditional — making it required ensures the audit trail always documents the factual baseline. + ## Running the Audit 1. Give the identical prompt to three AI tools @@ -73,7 +98,18 @@ Requirements are tagged with `[Req: tier — source]`. Weight your findings by t ## Triage Process -After all three models report, merge findings: +After all three models report, merge findings. + +**Log the effective council size.** If a model did not return a usable report (timeout, empty output, refusal), record this in the triage header: + +``` +## Council Status +- Model A: Fresh report received (YYYY-MM-DD) +- Model B: Fresh report received (YYYY-MM-DD) +- Model C: TIMEOUT — no usable report. Effective council: 2/3. +``` + +When the effective council is 2/3, downgrade the confidence tier: "All three" becomes impossible, "Two of three" becomes the ceiling. When the effective council is 1/3, all findings are "Needs verification" regardless of how confident that single model is. Do not silently substitute stale reports from prior runs — if a model didn't produce a fresh report for this run, it didn't participate. | Confidence | Found By | Action | |------------|----------|--------| @@ -81,10 +117,36 @@ After all three models report, merge findings: | High | Two of three | Likely real — verify and fix | | Needs verification | One only | Could be real or hallucinated — deploy verification probe | +**When the effective council is 2/3 or less:** Distinguish single-auditor findings from multi-auditor findings explicitly in the triage. With a 2/3 council, a finding from both present auditors has "High" confidence. A finding from only one present auditor has "Needs verification" — it cannot be promoted to confirmed BUG without a verification probe, because the missing auditor might have contradicted it. Do not treat all findings as equivalent just because the council is incomplete. + +In the triage summary table, add a column for auditor agreement: "2/2 present", "1/2 present", etc. This makes the confidence tier visible and auditable. + +**Incomplete council gate for enumeration/dispatch checks.** If the effective council is less than 3/3 and the run includes whitelist/enumeration/dispatch-function checks (claims about which constants a function handles), the audit may not conclude "no confirmed defects" for those checks without executed mechanical proof. Check whether `quality/mechanical/_cases.txt` exists for each relevant function. If it does and shows the constant is present, the claim is confirmed. If it does and shows the constant is absent, the claim is false regardless of what any auditor wrote. If no mechanical artifact exists, generate one before closing the enumeration check. This rule exists because v1.3.18 had an effective council of 1/3, and the single model's triage fabricated line contents for enumeration claims — a mechanical artifact would have caught the contradiction. + ### The Verification Probe When models disagree on factual claims, deploy a read-only probe: give one model the disputed claim and ask it to read the code and report ground truth. Never resolve factual disputes by majority vote — the majority can be wrong about what code actually does. +**Verification probes must produce executable evidence.** Prose reasoning is not sufficient for either confirmations or rejections. Every verification probe must produce a test assertion that mechanically proves the determination: + +**For rejections** (finding is false positive): Write an assertion that PASSES, proving the auditor's claim is wrong: +```python +# Rejection proof: function X does check for null at line 247 +assert "if (ptr == NULL)" in source_of("X"), "X has null check at line 247" +``` +If you cannot write a passing assertion, **do not reject the finding**. The inability to produce mechanical proof is itself evidence that the finding may be real. + +**For confirmations** (finding is a real bug): Write an assertion that FAILS (expected-failure), proving the bug exists: +```python +# Confirmation proof: RING_RESET is not a case label in the whitelist +assert "case VIRTIO_F_RING_RESET:" in source_of("vring_transport_features"), \ + "RING_RESET should be in the switch but is not — cleared by default at line 3527" +``` + +**Every assertion must cite an exact line number** for the evidence it references. Not "lines 3527-3528" but "line 3527: `default:`" — showing what the line actually contains. + +**Why this rule exists:** In v1.3.16 virtio testing, the triage received a correct minority finding that VIRTIO_F_RING_RESET was missing from a switch/case whitelist. The triage performed a verification probe that claimed lines 3527-3528 "explicitly preserve VIRTIO_F_RING_RESET" — but those lines contained the `default:` branch. The probe hallucinated compliance. Had it been required to write `assert "case VIRTIO_F_RING_RESET:" in source`, the assertion would have failed, exposing the hallucination. Requiring executable evidence makes hallucinated rejections self-defeating. + ### Categorize Each Confirmed Finding - **Spec bug** — Spec is wrong, code is fine → update spec @@ -96,6 +158,45 @@ When models disagree on factual claims, deploy a read-only probe: give one model That last category is the bridge between the spec audit and the test suite. Every confirmed finding not already covered by a test should become one. +### Legacy and historical scripts + +Scripts documented as "historical," "deprecated," or "not part of current workflow" are sometimes downgraded during triage on the theory that they don't affect current operations. This is correct when the script genuinely never runs. But if the script's bug has already materialized in canonical artifacts — duplicate entries in a published file, stale data in a checked-in cache, incorrect mappings that downstream tools consume — the bug is not historical. It's a live defect in the repository's published state. + +**Rule: If a legacy script's bug is already visible in canonical artifacts, promote it to confirmed BUG regardless of the script's status.** The script may be historical, but the damage it left behind is current. The regression test should target the artifact (the duplicate entry, the stale mapping), not the script — because the artifact is what users encounter. + +This rule exists because v1.3.5 bootstrap runs on QPB found duplicate changelog entries and stale cache mappings produced by a "historical" script. Both triages downgraded the findings because the script was historical. But the duplicate entries were already in the published library, visible to every user. + +### Cross-artifact consistency check + +After triage, compare the spec audit findings against the code review findings from `quality/code_reviews/`. If the code review and spec audit disagree on the same factual claim (one says a bug is real, the other calls it a false positive), flag the disagreement and deploy a verification probe. The code review and spec audit use different methods (structural reading vs. spec comparison), so disagreements are informative, not errors. But a factual contradiction about what the code actually does needs to be resolved before either report is trusted. + +## Detecting partial sessions and carried-over artifacts + +### Partial session detection + +A session that terminates early (timeout, context exhaustion, crash) may generate scaffolding (directory structure, empty templates) without producing the actual review or audit content. The retry mechanism in the run script can regenerate scaffolding but cannot recover the analytical work. + +**After any session completes, check for partial results:** +1. If `quality/code_reviews/` exists but contains no `.md` files with actual findings (or only contains template headers with no BUG/VIOLATED/INCONSISTENT entries), the code review did not run. Mark this as FAILED in PROGRESS.md, not as "complete with no findings." +2. If `quality/spec_audits/` exists but contains no triage summary, the spec audit did not run. +3. If `quality/test_regression.*` exists but contains only imports and no test functions, regression tests were not written. + +A partial session is not a "clean run with no findings" — it's a failed run that needs to be re-executed. PROGRESS.md should record this clearly: "Phase 6: FAILED — code review session terminated before producing findings. Re-run required." + +### Provenance headers on carried-over artifacts + +When a new playbook run finds existing artifacts from a previous run (after archiving), or when artifacts survive from a failed session, they must carry provenance headers so readers know their origin. + +**If any artifact was NOT generated fresh in the current run**, add a provenance header: + +```markdown + +``` + +This prevents the failure mode observed in v1.3.4 where express and zod silently preserved v1.3.3 code reviews and spec audits without marking them as archival. Users reading those artifacts assumed they were fresh v1.3.4 results. + ## Fix Execution Rules - Group fixes by subsystem, not by defect number @@ -130,6 +231,10 @@ Different models have different audit strengths. In practice: The specific models that excel will change over time. The principle holds: use multiple models with different strengths, and always include the four guardrails. +### Minimum model capability + +The audit protocol requires reading function bodies, citing line numbers, grepping before claiming missing, and classifying defect types. Lightweight or speed-optimized models (Haiku-class, GPT-4o-mini-class) are not suitable as auditors. They tend to skim rather than read, skip the grep step, and produce shallow or empty reports ("No defects found") on codebases where stronger models find real bugs. Use models with strong code-reading ability for all three auditor slots. A weak auditor doesn't just miss findings — it reduces the Council from three independent perspectives to two. + ## Tips for Writing Scrutiny Areas The scrutiny areas are the most important part of the prompt. Generic questions like "check if the code matches the spec" produce generic answers. Specific questions that name functions, files, and edge cases produce specific findings. diff --git a/skills/quality-playbook/references/verification.md b/skills/quality-playbook/references/verification.md index 66a0aaff7..25098d038 100644 --- a/skills/quality-playbook/references/verification.md +++ b/skills/quality-playbook/references/verification.md @@ -1,4 +1,4 @@ -# Verification Checklist (Phase 3) +# Verification Checklist (Phase 6: Verify) Before declaring the quality playbook complete, check every benchmark below. If any fails, go back and fix it. @@ -53,6 +53,8 @@ Run the test suite using the project's test runner: **Check for both failures AND errors.** Most test frameworks distinguish between test failures (assertion errors) and test errors (setup failures, missing fixtures, import/resolution errors, exceptions during initialization). Both are broken tests. A common mistake: generating tests that reference shared fixtures or helpers that don't exist. These show up as setup errors, not assertion failures — but they are just as broken. +**Expected-failure (xfail) tests do not count against this benchmark.** Regression tests in `quality/test_regression.*` use expected-failure markers (`@pytest.mark.xfail(strict=True)`, `@Disabled`, `t.Skip`, `#[ignore]`) to confirm that known bugs are still present. These tests are *supposed* to fail — that's the point. The "zero failures and zero errors" benchmark applies to `quality/test_functional.*` (the functional test suite), not to `quality/test_regression.*` (the bug confirmation suite). If your test runner reports failures from xfail-marked regression tests, that's correct behavior, not a benchmark violation. If an xfail test unexpectedly *passes*, that means the bug was fixed and the xfail marker should be removed — treat that as a finding to investigate, not a test failure. + After running, check: - All tests passed — count must equal total test count - Zero failures @@ -70,7 +72,7 @@ Run the project's full test suite (not just your new tests). Your new files shou Every scenario should mention actual function names, file names, or patterns that exist in the codebase. Grep for each reference to confirm it exists. -If working from non-formal requirements, verify that each scenario and test includes a requirement tag using the canonical format: `[Req: formal — README §3]`, `[Req: inferred — from validate_input() behavior]`, `[Req: user-confirmed — "must handle empty input"]`. Inferred requirements should be flagged for user review in Phase 4. +If working from non-formal requirements, verify that each scenario and test includes a requirement tag using the canonical format: `[Req: formal — README §3]`, `[Req: inferred — from validate_input() behavior]`, `[Req: user-confirmed — "must handle empty input"]`. Inferred requirements should be flagged for user review in the Phase 7 interactive session. ### 11. RUN_CODE_REVIEW.md Is Self-Contained @@ -93,6 +95,158 @@ If any field name, count, or type is wrong, fix it before proceeding. The table The definitive audit prompt should work when pasted into Claude Code, Cursor, and Copilot without modification (except file reference syntax). +### 14. Structured Output Schemas Are Valid and Conformant + +Verify that `RUN_TDD_TESTS.md` and `RUN_INTEGRATION_TESTS.md` both instruct the agent to produce: +- JUnit XML output using the framework's native reporter (pytest `--junitxml`, gotestsum `--junitxml`, Maven Surefire reports, `jest-junit`, `cargo2junit`) +- A sidecar JSON file (`tdd-results.json` or `integration-results.json`) in `quality/results/` + +Check that each protocol's JSON schema includes all mandatory fields: +- **tdd-results.json:** `schema_version`, `skill_version`, `date`, `project`, `bugs`, `summary`. Per-bug: `id`, `requirement`, `red_phase`, `green_phase`, `verdict`, `fix_patch_present`, `writeup_path`. +- **integration-results.json:** `schema_version`, `skill_version`, `date`, `project`, `recommendation`, `groups`, `summary`, `uc_coverage`. Per-group: `group`, `name`, `use_cases`, `result`. + +Verify that the protocol does NOT contain flat command-list schemas (a `"results"` or `"commands_run"` array without `"groups"` is non-conformant). Verify that verdict/result enum values use only the allowed values defined in SKILL.md (e.g., `"TDD verified"`, `"red failed"`, `"green failed"`, `"confirmed open"` for TDD verdicts; `"pass"`, `"fail"`, `"skipped"`, `"error"` for integration results; `"SHIP"`, `"FIX BEFORE MERGE"`, `"BLOCK"` for recommendations). The TDD verdict `"skipped"` is deprecated — use `"confirmed open"` with `red_phase: "fail"` and `green_phase: "skipped"` instead. The TDD summary must include a `confirmed_open` count alongside `verified`, `red_failed`, and `green_failed`. + +Both sidecar JSON templates must use `schema_version: "1.1"` (v1.1 change: `verdict: "skipped"` deprecated in favor of `"confirmed open"`). Both protocols must include a **post-write validation step** instructing the agent to reopen the sidecar JSON after writing it and verify required fields, enum values, and no extra undocumented root keys. + +### 15. Patch Validation Gate Is Executable + +For each confirmed bug with patches, verify: +1. The `git apply --check` commands specified in the patch validation gate use the correct patch paths (`quality/patches/BUG-NNN-*.patch`) +2. The compile/syntax check command matches the project's actual build system — not a generic placeholder +3. For interpreted languages (Python, JavaScript), the gate specifies the appropriate syntax check (`python -m py_compile`, `node --check`, `pytest --collect-only`, or equivalent) +4. The gate includes a temporary worktree or stash-and-revert instruction to comply with the source boundary rule + +### 16. Regression Test Skip Guards Are Present + +Grep `quality/test_regression.*` for the language-appropriate skip/xfail mechanism. Every test function must have a guard: +- Python: `@pytest.mark.xfail` or `@unittest.expectedFailure` +- Go: `t.Skip(` +- Java: `@Disabled` +- Rust: `#[ignore]` +- TypeScript/JavaScript: `test.failing(`, `test.fails(`, or `it.skip(` + +A regression test without a skip guard will cause unexpected failures when the test suite runs on unpatched code. Every guard must reference the bug ID (BUG-NNN format) and the fix patch path. + +### 17. Integration Group Commands Pass Pre-Flight Discovery + +For each integration test group command in `RUN_INTEGRATION_TESTS.md`, verify that the command discovers at least one test using the framework's dry-run mode (`pytest --collect-only`, `go test -list`, `vitest list`, `jest --listTests`, `cargo test -- --list`). A group whose command fails discovery will produce a `covered_fail` result that masks a selector bug as a code bug. If a command cannot be validated (no dry-run mode available), note the limitation. + +### 18. Version Stamps Present on All Generated Files + +Grep every generated Markdown file in `quality/` for the attribution line: `Generated by [Quality Playbook]`. Grep every generated code file for `Generated by Quality Playbook`. Every file must have the stamp with the correct version number. Files without stamps are not traceable to the tool and version that created them. **Exemptions:** sidecar JSON files (use `skill_version` field), JUnit XML files (framework-generated), and `.patch` files (stamp would break `git apply`). For Python files with shebang or encoding pragma, verify the stamp comes after the pragma, not before. + +### 19. Enumeration Completeness Checks Performed + +Verify that the code review (Pass 1 and Pass 2) performed mechanical two-list enumeration checks wherever the code uses `switch`/`case`, `match`, or if-else chains to dispatch on named constants. For each such check, the review must show: (a) the list of constants defined in headers/enums/specs, (b) the list of case labels actually present in the code, (c) any gaps. A review that claims "the whitelist covers all values" or "all cases are handled" without showing the two-list comparison is non-conformant — this is the specific hallucination pattern the check prevents. + +### 20. Bug Writeups Generated for All Confirmed Bugs + +For each bug in `tdd-results.json` (both `verdict: "TDD verified"` and `verdict: "confirmed open"`), verify that a corresponding `quality/writeups/BUG-NNN.md` file exists and that `tdd-results.json` has a non-null `writeup_path` for that bug. Each writeup must include: summary, spec reference, code citation, observable consequence, fix diff, and test description. A confirmed bug without a writeup is incomplete. + +### 21. Triage Verification Probes Include Executable Evidence + +Open the triage report (`quality/spec_audits/YYYY-MM-DD-triage.md`). For every finding that was confirmed or rejected via a verification probe, verify that the triage entry includes a test assertion (not just prose reasoning). Rejections must include a PASSING assertion proving the finding is wrong. Confirmations must include a FAILING assertion proving the bug exists. Every assertion must cite an exact line number. A triage decision based on prose reasoning alone ("lines 3527-3528 explicitly preserve X") without a mechanical assertion is non-conformant. + +### 22. Enumeration Lists Extracted From Code, Not Copied From Requirements + +When the code review includes an enumeration check (e.g., "case labels present in function X"), verify that the code-side list includes per-item line numbers from the actual source. If the list matches the requirements list word-for-word without line numbers, the enumeration was likely copied rather than extracted and must be redone. Also verify that the triage pre-audit spot-checks report the actual contents of cited lines ("line 3527 contains `default:`") rather than merely confirming claims ("line 3527 preserves RING_RESET"). + +### 23. Mechanical Verification Artifacts Exist and Pass Integrity Check + +For every contract or requirement that asserts a function handles/preserves/dispatches a set of named constants (feature bits, enum values, opcode tables), verify that a corresponding `quality/mechanical/_cases.txt` file exists and was generated by a non-interactive shell pipeline. Contracts that reference dispatch-function coverage without citing a mechanical artifact are non-conformant. + +**Integrity check (mandatory):** Run `bash quality/mechanical/verify.sh`. This script re-executes the same extraction commands that generated each mechanical artifact and diffs the results. If ANY diff is non-empty, the artifact was tampered with — the model may have written expected output instead of capturing actual shell output. A mismatched artifact must be regenerated by re-running the extraction command (not by editing the file). This check exists because in v1.3.19, the model executed the correct awk/grep command but wrote a fabricated 9-line output (including a hallucinated `case VIRTIO_F_RING_RESET:`) to the file, when the actual command only produces 8 lines. + +### 24. Source-Inspection Regression Tests Execute (No `run=False`) + +Grep `quality/test_regression.*` for `run=False` (Python), `t.Skip` with a source-inspection comment, or equivalent skip mechanisms. Any regression test whose purpose is source-structure verification (string presence in function bodies, case label existence, enum extraction) must execute — it must NOT use `run=False`. These tests are safe, deterministic string-match operations. An `xfail(strict=True)` test that actually fails reports as XFAIL (expected), which is correct behavior. A source-inspection test with `run=False` is the worst possible state: the correct check exists but never fires. + +### 25. Contradiction Gate Passed (Executed Evidence vs. Prose) + + +Verify that no executed artifact contradicts a prose artifact at closure. Specifically: (a) if any `quality/mechanical/*` file shows a constant as absent, no prose artifact (`CONTRACTS.md`, `REQUIREMENTS.md`, code review, triage) may claim it is present; (b) if any regression test with `xfail` actually fails (XFAIL), `BUGS.md` may not claim that bug is "fixed in working tree" without a commit reference; (c) if TDD traceability shows a red-phase failure, the triage may not claim the corresponding code is compliant. Any contradiction must be resolved before closure. + +### 26. Version Stamp Consistency + +Read the `version:` field from the SKILL.md metadata (locate SKILL.md in the skill installation directory — typically `.github/skills/SKILL.md` or `.claude/skills/quality-playbook/SKILL.md`). Check every generated artifact: PROGRESS.md's `Skill version:` field, every `> Generated by` attribution line, every code file header stamp, and every sidecar JSON `skill_version` field. Every version stamp must match the SKILL.md metadata exactly. A single mismatch is a benchmark failure. This check exists because in v1.3.21 benchmarking, 5 of 9 repos had version stamps from older skill versions due to a hardcoded template. + +### 27. Mechanical Directory Conformance + +If `quality/mechanical/` exists, it must contain at minimum a `verify.sh` file. An empty `quality/mechanical/` directory is non-conformant. If no dispatch-function contracts exist, the directory should not exist — instead record `Mechanical verification: NOT APPLICABLE` in PROGRESS.md. If the directory exists with extraction artifacts, `verify.sh` must include one verification block per saved file (not just one). A verify.sh that checks only one artifact when multiple exist is incomplete. + +### 28. TDD Artifact Closure + +If `quality/BUGS.md` contains any confirmed bugs, `quality/results/tdd-results.json` is mandatory. If any bug has a red-phase result, `quality/TDD_TRACEABILITY.md` is also mandatory. Zero-bug repos may omit both files. For repos where TDD cannot execute, tdd-results.json must exist with `verdict: "deferred"` and a `notes` field explaining why. + +### 29. Triage-to-BUGS.md Sync + +After spec audit triage, every finding confirmed as a code bug must appear in `quality/BUGS.md`. A triage report with confirmed code bugs and no corresponding BUGS.md entries is non-conformant. If BUGS.md does not exist when confirmed bugs exist, it must be created. + +### 30. Writeups for All Confirmed Bugs + +Every confirmed bug (TDD-verified or confirmed-open) must have a writeup at `quality/writeups/BUG-NNN.md`. For confirmed-open bugs without fix patches, the writeup notes the absence of fix/green-phase evidence. A run with confirmed bugs and no writeups directory is incomplete. + +### 31. Phase 4 Triage File Exists + +Phase 4 is not complete until a triage file exists at `quality/spec_audits/YYYY-MM-DD-triage.md`. If only auditor reports exist with no triage synthesis, Phase 4 is incomplete. + +### 32. Seed Checks Executed Mechanically (Continuation Mode) + +When `quality/previous_runs/` exists and Phase 0 runs, verify that `quality/SEED_CHECKS.md` was generated with one entry per unique bug from prior runs. Each seed must have a mechanical verification result (FAIL = bug still present, PASS = bug fixed) obtained by actually running the assertion — not by reading prose from the prior run. If a seed's regression test exists in a prior run, the assertion must be re-executed against the current source tree. A seed marked FAIL without executing the assertion is non-conformant. This benchmark only applies when continuation mode is active (prior runs exist). + +### 33. Convergence Status Recorded in PROGRESS.md (Continuation Mode) + +When Phase 0 runs, verify that PROGRESS.md contains a `## Convergence` section with: run number, seed count, net-new bug count, and a CONVERGED/NOT CONVERGED verdict. The net-new count must equal the number of bugs in BUGS.md that don't match any seed by file:line. A missing convergence section when `SEED_CHECKS.md` exists is non-conformant. This benchmark only applies when continuation mode is active. + +### 34. BUGS.md Always Exists + +Every completed run must produce `quality/BUGS.md`. If the run confirmed source-code bugs, BUGS.md must list them. If the run found zero source-code bugs, BUGS.md must contain a `## Summary` with a positive assertion: "No confirmed source-code bugs found" with counts of candidates evaluated and eliminated. A completed run (Phase 5 marked complete) with no BUGS.md is non-conformant. This benchmark exists because in v1.3.22 benchmarking, express completed all phases with zero source bugs but produced no BUGS.md, making it ambiguous whether the file was intentionally omitted or accidentally skipped. + +### 35. Immediate Mechanical Integrity Gate (Phase 2a) + +If `quality/mechanical/` exists, verify that `bash quality/mechanical/verify.sh` was executed immediately after each `*_cases.txt` was written — before any contract, requirement, or triage artifact cites the extraction. Evidence: `quality/results/mechanical-verify.log` and `quality/results/mechanical-verify.exit` exist, and the exit file contains `0`. If these receipt files are missing or the exit code is non-zero, the mechanical extraction was not verified at the point of creation. This benchmark exists because v1.3.23 deferred verification to Phase 6, allowing downstream artifacts (CONTRACTS.md, REQUIREMENTS.md, triage probes) to build on a forged extraction for the entire run before the mismatch was (not) caught. + +### 36. Mechanical Artifacts Not Used as Evidence in Triage Probes + +Grep all triage and verification probe files (`quality/spec_audits/*`) for `open('quality/mechanical/` or `cat quality/mechanical/`. If any probe reads a `quality/mechanical/*.txt` file as sole evidence for what a source file contains, it is circular verification and the benchmark fails. Probes must read the source file directly or re-execute the extraction pipeline. This benchmark exists because v1.3.23 Probe C validated the forged mechanical artifact instead of the source code, passing with fabricated data. + +### 37. Phase 6 Mechanical Closure Uses Bash (Not Python Substitution) + +If `quality/mechanical/` exists, verify that Phase 6 ran `bash quality/mechanical/verify.sh` as a literal shell command — not a Python script reading the artifact file. Evidence: `quality/results/mechanical-verify.log` contains output from the bash script (lines like "OK: ..." or "MISMATCH: ..."), not Python tracebacks or `pathlib` output. PROGRESS.md must include a `## Phase 6 Mechanical Closure` heading with the recorded stdout and exit code. This benchmark exists because v1.3.23 substituted Python `Path.read_text()` for `bash verify.sh`, creating a circular check that passed despite the artifact being fabricated. + +### 38. Individual Auditor Report Artifacts Exist + +If Phase 4 (spec audit) ran, verify that individual auditor report files exist at `quality/spec_audits/YYYY-MM-DD-auditor-N.md` (one per auditor), not just the triage synthesis. A single triage file without individual reports conflates discovery with reconciliation. This benchmark exists to ensure pre-reconciliation findings are preserved for independent verification. + +### 39. BUGS.md Uses Canonical Heading Format + +Every confirmed bug in BUGS.md must use the heading level `### BUG-NNN`. Grep for `^### BUG-` and count; grep for other bug heading patterns (`^## BUG-`, `^\*\*BUG-`, `^- BUG-`) and verify zero matches. Inconsistent heading levels cause machine-readable counts to disagree with the document. + +### 40. Artifact File-Existence Gate Passed + +Before Phase 5 is marked complete, verify that all required artifacts exist as files on disk — not just referenced in PROGRESS.md. Required files: EXPLORATION.md, BUGS.md, REQUIREMENTS.md, QUALITY.md, PROGRESS.md, COVERAGE_MATRIX.md, COMPLETENESS_REPORT.md, CONTRACTS.md, test_functional.* (or language-appropriate alternative: FunctionalSpec.*, FunctionalTest.*, functional.test.*), RUN_CODE_REVIEW.md, RUN_INTEGRATION_TESTS.md, RUN_SPEC_AUDIT.md, RUN_TDD_TESTS.md, and AGENTS.md (at project root). If Phase 3 ran: at least one file in code_reviews/. If Phase 4 ran: at least one auditor file and a triage file in spec_audits/. If Phase 0 or 0b ran: SEED_CHECKS.md as a standalone file. If confirmed bugs exist: tdd-results.json in results/. If any bug has a red-phase result: TDD_TRACEABILITY.md. This benchmark exists because v1.3.24 benchmarking showed express writing a terminal gate section to PROGRESS.md claiming 1 confirmed bug, but BUGS.md, code review files, and spec audit files were never written to disk. + +### 41. Sidecar JSON Post-Write Validation + +After `tdd-results.json` and/or `integration-results.json` are written, verify that each file contains all required keys with conformant values. For `tdd-results.json`: required root keys are `schema_version`, `skill_version`, `date`, `project`, `bugs`, `summary`. Each `bugs` entry must have `id`, `requirement`, `red_phase`, `green_phase`, `verdict`, `fix_patch_present`, `writeup_path`. The `summary` must include `confirmed_open`. For `integration-results.json`: required root keys are `schema_version`, `skill_version`, `date`, `project`, `recommendation`, `groups`, `summary`, `uc_coverage`. Both must have `schema_version: "1.1"`. A sidecar JSON with missing required keys, non-standard root keys, or invalid enum values is non-conformant. This benchmark exists because v1.3.25 benchmarking showed 6 of 8 repos with non-conformant sidecar JSON — httpx invented an alternate schema, serde used legacy shape, javalin omitted `summary` and per-bug fields, express used invalid phase values, and others used invalid verdict/result enum values. + +### 42. Script-Verified Closure Gate Passed + +Before Phase 5 is marked complete, `quality_gate.sh` must be executed from the project root and must exit 0. The script's full output must be saved to `quality/results/quality-gate.log`. A Phase 5 completion with no `quality-gate.log` or with a log showing FAIL results is non-conformant. This benchmark exists because v1.3.21–v1.3.25 relied entirely on model self-attestation for artifact conformance checks, and benchmarking showed persistent non-compliance (heading format, sidecar schema, use case identifiers, version stamps) that a script catches mechanically. + +### 43. Canonical Use Case Identifiers Present + +REQUIREMENTS.md must contain use cases labeled with canonical identifiers in the format `UC-01`, `UC-02`, etc. Grep for `UC-[0-9]` and count matches. A repo with use case content but no canonical identifiers is non-conformant. This benchmark exists because v1.3.25 benchmarking showed 7 of 8 repos with use case sections but no machine-readable identifiers — downstream tooling cannot count or cross-reference use cases without a canonical format. + +### 44. Regression-Test Patches Exist for Every Confirmed Bug + +For every confirmed bug (any BUG-NNN entry in BUGS.md), verify that `quality/patches/BUG-NNN-regression-test.patch` exists. A confirmed bug without a regression-test patch is incomplete — the patch is the strongest independent evidence that the bug exists. Fix patches (`BUG-NNN-fix.patch`) are optional but strongly encouraged for simple fixes. This benchmark exists because v1.3.25 and v1.3.26 benchmarking showed 4/8 repos with 0 patch files despite having confirmed bugs, and the writeups described what fixes should look like without generating actual patch files. + +### 45. Writeup Inline Fix Diffs + +Every writeup at `quality/writeups/BUG-NNN.md` must contain a ` ```diff ` fenced code block with the proposed fix in unified diff format. This is section 6 ("The fix") of the writeup template. A writeup that says "see patch file" or "no fix patch included" without an inline diff is incomplete — the inline diff is what makes the writeup actionable for a maintainer reading just the writeup without access to the patch directory. This benchmark exists because v1.3.27 benchmarking showed virtio producing 4 writeups with 0 inline diffs despite having fix patches in `quality/patches/`. The model wrote prose descriptions of the fix instead of pasting the actual diff. + ## Quick Checklist Format Use this as a final sign-off: @@ -112,3 +266,37 @@ Use this as a final sign-off: - [ ] Integration test quality gates were written from a Field Reference Table (not memory) - [ ] Integration tests have specific pass criteria - [ ] Spec audit prompt is copy-pasteable and uses `[Req: tier — source]` tag format +- [ ] Structured output schemas include all mandatory fields and valid enum values +- [ ] Patch validation gate uses correct commands for the project's build system +- [ ] Every regression test has a skip/xfail guard referencing the bug ID +- [ ] Integration group commands pass pre-flight discovery (dry-run finds tests) +- [ ] Every generated file has a version stamp with correct version number +- [ ] Enumeration completeness checks show two-list comparisons (not just assertions of coverage) +- [ ] Every TDD-verified bug has a writeup at `quality/writeups/BUG-NNN.md` +- [ ] Triage verification probes include test assertions (not just prose) for confirmations and rejections +- [ ] Enumeration code-side lists include per-item line numbers (not copied from requirements) +- [ ] Dispatch-function contracts cite `quality/mechanical/` artifacts (not hand-written lists) +- [ ] `bash quality/mechanical/verify.sh` passes (artifacts match re-extracted output) +- [ ] Source-inspection regression tests execute (no `run=False` for string-match tests) +- [ ] No executed artifact contradicts any prose artifact at closure (contradiction gate passed) +- [ ] All generated artifact version stamps match SKILL.md metadata version exactly +- [ ] `quality/mechanical/` is either absent (no dispatch contracts) or contains verify.sh + all extraction artifacts +- [ ] If BUGS.md has confirmed bugs: tdd-results.json exists (mandatory); TDD_TRACEABILITY.md exists if any bug has red-phase result +- [ ] Every confirmed bug in triage appears in BUGS.md (triage-to-BUGS.md sync) +- [ ] Every confirmed bug (TDD-verified or confirmed-open) has a writeup at `quality/writeups/BUG-NNN.md` +- [ ] Phase 4 has a triage file at `quality/spec_audits/YYYY-MM-DD-triage.md` +- [ ] (Continuation mode) Seed checks in `SEED_CHECKS.md` were executed mechanically, not inferred from prose +- [ ] Mechanical verification receipt files exist (`mechanical-verify.log` + `mechanical-verify.exit`) when `quality/mechanical/` exists +- [ ] No triage probe reads `quality/mechanical/*.txt` as sole evidence for source code contents +- [ ] Phase 6 mechanical closure used `bash verify.sh` (not Python substitution) +- [ ] Individual auditor reports exist at `quality/spec_audits/*-auditor-N.md` (not just triage) +- [ ] All BUGS.md bug headings use `### BUG-NNN` format +- [ ] quality/BUGS.md exists (zero-bug runs include a summary of candidates evaluated and eliminated) +- [ ] All required artifact files exist on disk before Phase 5 marked complete (not just referenced in PROGRESS.md) +- [ ] (Continuation mode) PROGRESS.md contains `## Convergence` section with net-new count and verdict +- [ ] `quality/BUGS.md` exists (zero-bug runs include a summary of candidates evaluated and eliminated) +- [ ] Sidecar JSON files (`tdd-results.json`, `integration-results.json`) contain all required keys with `schema_version: "1.1"` +- [ ] `quality_gate.sh` was executed and exited 0; output saved to `quality/results/quality-gate.log` +- [ ] REQUIREMENTS.md contains canonical use case identifiers (`UC-01`, `UC-02`, etc.) +- [ ] Every confirmed bug has `quality/patches/BUG-NNN-regression-test.patch` +- [ ] Every writeup has an inline fix diff (` ```diff ` block in section 6)