microsoft · Alan-Jowett · Mar 20, 2026 · Mar 20, 2026 · Mar 20, 2026 · Mar 20, 2026
diff --git a/docs/scenarios.md b/docs/scenarios.md
@@ -152,26 +152,28 @@ requirements (D8), code behavior not traced to any requirement (D9),
 and constraint violations in the implementation (D10), with
 implementation coverage metrics and specific code locations.
 
----
-
-## Future Scenarios (Roadmap)
-
-These scenarios describe capabilities that are planned but not yet
-implemented. See the [roadmap](roadmap.md) for details.
-
 ### "Do our tests actually test what the plan says they should?"
 
 Your validation plan specifies 58 test cases. Your test suite has
 tests. But are they the same tests? Do the assertions match the
-acceptance criteria?
+acceptance criteria? Are there test cases in the plan that have no
+automated test at all?
 
-**Planned template:** `audit-test-compliance` ·
-**Taxonomy:** `specification-drift` (D11–D13)
+**Template:** `audit-test-compliance` · **Persona:** `specification-analyst` ·
+**Protocol:** `test-compliance-audit` · **Taxonomy:** `specification-drift` (D11–D13)
 
-**What you'd get:** A report mapping validation plan test cases to
-actual test implementations, identifying unimplemented test cases,
-tests with wrong assertions, and coverage gaps between the plan and
-reality.
+**What you get:** An investigation report mapping validation plan test
+cases to actual test implementations, identifying unimplemented test
+cases (D11), missing acceptance criterion assertions (D12), and
+assertion mismatches where the test checks different conditions than
+the plan specifies (D13).
+
+---
+
+## Future Scenarios (Roadmap)
+
+These scenarios describe capabilities that are planned but not yet
+implemented. See the [roadmap](roadmap.md) for details.
 
 ### "Extract the invariants from this RFC"
 

diff --git a/manifest.yaml b/manifest.yaml
@@ -160,6 +160,14 @@ protocols:
         and classifies findings using the specification-drift taxonomy
         (D8–D10).
 
+    - name: test-compliance-audit
+      path: protocols/reasoning/test-compliance-audit.md
+      description: >
+        Systematic protocol for auditing test code against a validation
+        plan and requirements document. Maps test case definitions to
+        test implementations and classifies findings using the
+        specification-drift taxonomy (D11–D13).
+
 formats:
   - name: requirements-doc
     path: formats/requirements-doc.md
@@ -340,6 +348,18 @@ templates:
       format: investigation-report
       requires: requirements-document
 
+    - name: audit-test-compliance
+      path: templates/audit-test-compliance.md
+      description: >
+        Audit test code against a validation plan and requirements
+        document. Detects unimplemented test cases, missing acceptance
+        criterion assertions, and assertion mismatches.
+      persona: specification-analyst
+      protocols: [anti-hallucination, self-verification, operational-constraints, test-compliance-audit]
+      taxonomies: [specification-drift]
+      format: investigation-report
+      requires: [requirements-document, validation-plan]
+
   investigation:
     - name: investigate-bug
       path: templates/investigate-bug.md

diff --git a/protocols/reasoning/test-compliance-audit.md b/protocols/reasoning/test-compliance-audit.md
@@ -0,0 +1,175 @@
+<!-- SPDX-License-Identifier: MIT -->
+<!-- Copyright (c) PromptKit Contributors -->
+
+---
+name: test-compliance-audit
+type: reasoning
+description: >
+  Systematic protocol for auditing test code against a validation plan
+  and requirements document. Maps test case definitions to test
+  implementations, verifies assertions match acceptance criteria, and
+  classifies findings using the specification-drift taxonomy (D11–D13).
+applicable_to:
+  - audit-test-compliance
+---
+
+# Protocol: Test Compliance Audit
+
+Apply this protocol when auditing test code against a validation plan
+and requirements document to determine whether the automated tests
+implement what the validation plan specifies. The goal is to find every
+gap between planned and actual test coverage — missing tests,
+incomplete assertions, and mismatched expectations.
+
+## Phase 1: Validation Plan Inventory
+
+Extract the complete set of test case definitions from the validation
+plan.
+
+1. **Test cases** — for each TC-NNN, extract:
+   - The test case ID and title
+   - The linked requirement(s) (REQ-XXX-NNN)
+   - The test steps (inputs, actions, sequence)
+   - The expected results and pass/fail criteria
+   - The test level (unit, integration, system, etc.)
+   - Any preconditions or environmental assumptions
+
+2. **Requirements cross-reference** — for each linked REQ-ID, look up
+   its acceptance criteria in the requirements document. These are the
+   ground truth for what the test should verify.
+
+3. **Test scope classification** — classify each test case as:
+   - **Automatable**: Can be implemented as an automated test
+   - **Manual-only**: Requires human judgment, physical interaction,
+     or platform-specific behavior that cannot be automated
+   - **Deferred**: Explicitly marked as not-yet-implemented in the
+     validation plan
+   Restrict the audit to automatable test cases. Report manual-only
+   and deferred counts in the coverage summary.
+
+## Phase 2: Test Code Inventory
+
+Survey the test code to understand its structure.
+
+1. **Test organization**: Identify the test framework (e.g., pytest,
+   JUnit, Rust #[test], Jest), test file structure, and naming
+   conventions.
+2. **Test function catalog**: List all test functions/methods with
+   their names, locations (file, line), and any identifying markers
+   (TC-NNN in name or comment, requirement references).
+3. **Test helpers and fixtures**: Identify shared setup, teardown,
+   mocking, and assertion utilities — these affect what individual
+   tests can verify.
+
+Do NOT attempt to understand every test's implementation in detail.
+Build the catalog first, then trace specific tests in Phase 3.
+
+## Phase 3: Forward Traceability (Validation Plan → Test Code)
+
+For each automatable test case in the validation plan:
+
+1. **Find the implementing test**: Search the test code for a test
+   function that implements TC-NNN. Match by:
+   - Explicit TC-NNN reference in test name or comments
+   - Behavioral equivalence (test steps and assertions match the
+     validation plan's specification, even without an ID reference)
+   - Requirement reference (test references the same REQ-ID)
+
+2. **Assess implementation completeness**: For each matched test:
+
+   a. **Step coverage**: Does the test execute the steps described in
+      the validation plan? Are inputs, actions, and sequences present?
+
+   b. **Assertion coverage**: Does the test assert the expected results
+      from the validation plan? Check each expected result individually.
+
+   c. **Acceptance criteria alignment**: Cross-reference the linked
+      requirement's acceptance criteria. Does the test verify ALL
+      criteria, or only a subset? Flag missing criteria as
+      D12_UNTESTED_ACCEPTANCE_CRITERION.
+
+   d. **Assertion correctness**: Do the test's assertions match the
+      expected behavior? Check for:
+      - Wrong thresholds (plan says 200ms, test checks for non-null)
+      - Wrong error codes (plan says 403, test checks not-200)
+      - Missing negative assertions (plan says "MUST NOT", test only
+        checks positive path)
+      - Structural assertions that don't verify semantics (checking
+        "response exists" instead of "response contains expected data")
+      Flag mismatches as D13_ASSERTION_MISMATCH.
+
+3. **Classify the result**:
+   - **IMPLEMENTED**: Test fully implements the validation plan's
+     test case with correct assertions. Record the test location.
+   - **PARTIALLY IMPLEMENTED**: Test exists but is incomplete.
+     Classify based on *what* is missing:
+     - Missing acceptance criteria assertions →
+       D12_UNTESTED_ACCEPTANCE_CRITERION
+     - Wrong assertions or mismatched expected results →
+       D13_ASSERTION_MISMATCH
+   - **NOT IMPLEMENTED**: No test implements this test case (no
+     matching test function found in the provided code). Flag as
+     D11_UNIMPLEMENTED_TEST_CASE. Note: a test stub with an empty
+     body or skip annotation is NOT an implementation — classify it
+     as D13 (assertions don't match because there are none) and
+     record its code location.
+
+## Phase 4: Backward Traceability (Test Code → Validation Plan)
+
+Identify tests that don't trace to the validation plan.
+
+1. **For each test function** in the test code, determine whether it
+   maps to a TC-NNN in the validation plan.
+
+2. **Classify unmatched tests**:
+   - **Regression tests**: Tests added for specific bugs, not part of
+     the validation plan. These are expected and not findings.
+   - **Exploratory tests**: Tests that cover scenarios not in the
+     validation plan. Note these but do not flag as drift — they may
+     indicate validation plan gaps (candidates for new test cases).
+   - **Orphaned tests**: Tests that reference TC-NNN IDs or REQ-IDs
+     that do not exist in the validation plan or requirements. These
+     may be stale after a renumbering. Report orphaned tests as
+     observations in the coverage summary (Phase 6), not as D11–D13
+     findings — they don't fit the taxonomy since no valid TC-NNN
+     is involved.
+
+## Phase 5: Classification and Reporting
+
+Classify every finding using the specification-drift taxonomy.
+
+1. Assign exactly one drift label (D11, D12, or D13) to each finding.
+2. Assign severity using the taxonomy's severity guidance.
+3. For each finding, provide:
+   - The drift label and short title
+   - The validation plan location (TC-NNN, section) and test code
+     location (file, function, line). For D11 findings, the test code
+     location is "None — no implementing test found" with a description
+     of what was searched.
+   - The linked requirement and its acceptance criteria
+   - Evidence: what the validation plan specifies and what the test
+     does (or doesn't)
+   - Impact: what could go wrong
+   - Recommended resolution
+4. Order findings primarily by severity, then by taxonomy ranking
+   within each severity tier.
+
+## Phase 6: Coverage Summary
+
+After reporting individual findings, produce aggregate metrics:
+
+1. **Test implementation rate**: automatable test cases with
+   implementing tests / total automatable test cases.
+2. **Assertion coverage**: test cases with complete assertion
+   coverage / total implemented test cases.
+3. **Acceptance criteria coverage**: individual acceptance criteria
+   verified by test assertions / total acceptance criteria across
+   all linked requirements.
+4. **Manual/deferred test count**: count of test cases classified as
+   manual-only or deferred (excluded from the audit).
+5. **Unmatched test count**: count of test functions in the test code
+   with no corresponding TC-NNN in the validation plan (regression,
+   exploratory, or orphaned).
+6. **Overall assessment**: a summary judgment of test compliance
+   (e.g., "High compliance — 2 missing tests" or "Low compliance —
+   systemic assertion gaps across the test suite").
diff --git a/taxonomies/specification-drift.md b/taxonomies/specification-drift.md
@@ -13,6 +13,7 @@ domain: specification-traceability
 applicable_to:
   - audit-traceability
   - audit-code-compliance
+  - audit-test-compliance
 ---
 
 # Taxonomy: Specification Drift
@@ -200,32 +201,89 @@ safety-critical, security-related, or regulatory. High for performance
 or functional constraints. Assess based on the constraint itself,
 not the code's complexity.
 
-## Reserved Labels (Future Use)
+## Test Compliance Labels
 
-The following label range is reserved for future specification drift
-categories involving test code:
+### D11_UNIMPLEMENTED_TEST_CASE
 
-- **D11–D13**: Reserved for **test compliance** drift (validation plan
-  vs. test code). Example: D11_UNIMPLEMENTED_TEST_CASE — a test case in
-  the validation plan has no corresponding automated test.
+A test case is defined in the validation plan but has no corresponding
+automated test in the test code.
 
-These labels will be defined when the `audit-test-compliance` template
-is added to the library.
+**Pattern**: TC-NNN is specified in the validation plan with steps,
+inputs, and expected results. No test function, test class, or test
+file in the test code implements this test case — either by name
+reference, by TC-NNN identifier, or by behavioral equivalence.
+
+**Risk**: The validation plan claims coverage that does not exist in
+the automated test suite. The requirement linked to this test case
+is effectively untested in CI, even though the validation plan says
+it is covered.
+
+**Severity guidance**: High when the linked requirement is
+safety-critical or security-related. Medium for functional
+requirements. Note: test cases classified as manual-only or deferred
+in the validation plan are excluded from D11 findings and reported
+only in the coverage summary.
+
+### D12_UNTESTED_ACCEPTANCE_CRITERION
+
+A test implementation exists for a test case, but it does not assert
+one or more acceptance criteria specified for the linked requirement.
+
+**Pattern**: TC-NNN is implemented as an automated test. The linked
+requirement (REQ-XXX-NNN) has multiple acceptance criteria. The test
+implementation asserts some criteria but omits others — for example,
+it checks the happy-path output but does not verify error handling,
+boundary conditions, or timing constraints specified in the acceptance
+criteria.
+
+**Risk**: The test passes but does not verify the full requirement.
+Defects in the untested acceptance criteria will not be caught by CI.
+This is the test-code equivalent of D7 (acceptance criteria mismatch
+in the validation plan) but at the implementation level.
+
+**Severity guidance**: High when the missing criterion is a security
+or safety property. Medium for functional criteria. Assess based on
+what the missing criterion protects, not on the test's overall
+coverage.
+
+### D13_ASSERTION_MISMATCH
+
+A test implementation exists for a test case, but its assertions do
+not match the expected behavior specified in the validation plan.
+
+**Pattern**: TC-NNN is implemented as an automated test. The test
+asserts different conditions, thresholds, or outcomes than what the
+validation plan specifies — for example, the plan says "verify
+response within 200ms" but the test asserts "response is not null",
+or the plan says "verify error code 403" but the test asserts "status
+is not 200".
+
+**Risk**: The test passes but does not verify what the validation plan
+says it should. This creates illusory coverage — the traceability
+matrix shows the requirement as tested, but the actual test checks
+something different. More dangerous than D11 (missing test) because
+it is invisible without comparing test code to the validation plan.
+
+**Severity guidance**: High. This is the most dangerous test
+compliance drift type because it creates false confidence. Severity
+should be assessed based on the gap between what is asserted and what
+should be asserted.
 
 ## Ranking Criteria
 
 Within a given severity level, order findings by impact on specification
 integrity:
 
 1. **Highest risk**: D6 (constraint violation in design), D7 (illusory
-   test coverage), and D10 (constraint violation in code) — these
-   indicate active conflicts between artifacts.
-2. **High risk**: D2 (untested requirement), D5 (assumption drift), and
-   D8 (unimplemented requirement) — these indicate silent gaps that
-   will surface late.
+   test coverage), D10 (constraint violation in code), and D13
+   (assertion mismatch) — these indicate active conflicts between
+   artifacts.
+2. **High risk**: D2 (untested requirement), D5 (assumption drift),
+   D8 (unimplemented requirement), and D12 (untested acceptance
+   criterion) — these indicate silent gaps that will surface late.
 3. **Medium risk**: D1 (untraced requirement), D3 (orphaned design),
-   and D9 (undocumented behavior) — these indicate incomplete
-   traceability that needs human resolution.
+   D9 (undocumented behavior), and D11 (unimplemented test case) —
+   these indicate incomplete traceability that needs human resolution.
 4. **Lowest risk**: D4 (orphaned test case) — effort misdirection but
    no safety or correctness impact.