Skip to content

Introduce SpecResult to group test results by spec#4289

Merged
habdelra merged 15 commits intomainfrom
consolidate-test-specs
Apr 1, 2026
Merged

Introduce SpecResult to group test results by spec#4289
habdelra merged 15 commits intomainfrom
consolidate-test-specs

Conversation

@habdelra
Copy link
Copy Markdown
Contributor

@habdelra habdelra commented Mar 31, 2026

Summary

  • Introduce SpecResult FieldDef as an intermediate layer between TestRun and TestResultEntry, grouping test results by spec file
  • TestRun.specResults replaces TestRun.results + TestRun.specRef — each SpecResult has its own specRef (CodeRefField) and results (containsMany TestResultEntry)
  • TestRun.passedCount / failedCount are now computed roll-ups across all SpecResults
  • Playwright top-level suites are mapped to individual SpecResult entries during parsing
  • Strip ambient HOST_URL from process.env in the test harness to prevent shell env vars from breaking the hermetic test seal
image

Try it out

Prerequisites

  • mise run dev-all running
  • PR #4265 merged or its branch checked out

Run the smoke test

cd packages/software-factory

MATRIX_URL=http://localhost:8008 \
MATRIX_USERNAME=your-username \
MATRIX_PASSWORD=your-password \
pnpm smoke:test-realm -- \
  --target-realm-url http://localhost:4201/your-username/smoke-test-realm/

What to expect for TestRun artifacts

After the smoke test completes, navigate to your smoke-test-realm workspace in the Boxel app:

  1. Test Runs/ folder — contains TestRun cards like hello-smoke-1 and hello-fail-1
  2. Fitted view of a TestRun shows: #1 sequence number, status badge (passed/failed), aggregated pass/fail counts across all specs, and duration
  3. Isolated view of a TestRun now shows results organized by spec:
    • Each spec appears as a section header with its name (from specRef.module — the Playwright suite title / spec file path) and its own pass/total count
    • Individual test results are listed under their spec with status icons, test names, and durations
    • Failures section still shows a flat list of all failures across specs with error messages and stack traces
  4. Test artifacts realm — card instances created during spec execution (test data) land in a separate auto-created realm ({ProjectName} Test Artifacts), keeping test data separate from implementation artifacts

Card structure

TestRun
├── sequenceNumber, status, runAt, completedAt, durationMs
├── passedCount (computed: sum of specResults[].passedCount)
├── failedCount (computed: sum of specResults[].failedCount)
└── specResults: SpecResult[]
    ├── specRef: CodeRefField { module: "Tests/hello-smoke.spec.ts", name: "default" }
    ├── passedCount (computed from results)
    ├── failedCount (computed from results)
    └── results: TestResultEntry[]
        ├── testName, status, message, stackTrace, durationMs
        └── ...

Test plan

  • pnpm test:node — 321/321 unit tests pass
  • pnpm test:playwright — 14/14 Playwright tests pass (including HOST_URL sanitization fix)
  • npx tsc --noEmit — no type errors in software-factory files

🤖 Generated with Claude Code

TestRun previously held a flat list of TestResultEntry and a single
specRef. This restructures TestRun to contain multiple SpecResults,
each grouping test results under a spec reference. This enables
multi-spec test runs where results are organized by spec file.

- Add SpecResult FieldDef with specRef, results, passedCount/failedCount
- TestRun.specResults replaces TestRun.results + TestRun.specRef
- TestRun passedCount/failedCount aggregate across all SpecResults
- Parsing groups Playwright top-level suites into SpecResult entries
- Templates show results organized by spec in isolated view
- Strip ambient HOST_URL from process.env in harness to prevent
  shell env vars from breaking the hermetic test seal

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 61f1df810b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@github-actions
Copy link
Copy Markdown

Realm Server Test Results

  1 files  ±0    1 suites  ±0   11m 28s ⏱️ -39s
804 tests +2  804 ✅ +2  0 💤 ±0  0 ❌ ±0 
875 runs  +2  875 ✅ +2  0 💤 ±0  0 ❌ ±0 

Results for commit 61f1df8. ± Comparison against base commit 20a185a.

@github-actions
Copy link
Copy Markdown

Host Test Results

    1 files  ±0      1 suites  ±0   2h 12m 52s ⏱️ - 1m 32s
2 062 tests ±0  2 047 ✅ ±0  15 💤 ±0  0 ❌ ±0 
2 077 runs  ±0  2 062 ✅ ±0  15 💤 ±0  0 ❌ ±0 

Results for commit 61f1df8. ± Comparison against base commit 20a185a.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new SpecResult grouping layer for test runs so TestRun results can be organized by spec/suite, and updates parsing + card/document wiring to use specResults instead of a flat results list. It also hardens the test harness environment by sanitizing an ambient HOST_URL that could break hermetic test execution.

Changes:

  • Add SpecResult FieldDef and switch TestRun to specResults with roll-up passedCount/failedCount.
  • Update test-run parsing/execution/card document creation to emit and consume specResults.
  • Sanitize process.env.HOST_URL in the harness to prevent inherited shell env from leaking into child processes.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
packages/software-factory/tests/factory-test-realm.test.ts Updates unit tests to assert against specResults instead of results.
packages/software-factory/src/harness/shared.ts Removes ambient HOST_URL from the harness process environment.
packages/software-factory/scripts/lib/test-run-types.ts Replaces results with specResults in the serialized TestRun types; adds SpecResultData.
packages/software-factory/scripts/lib/test-run-parsing.ts Parses Playwright JSON into grouped specResults; updates failure/status/count derivation accordingly.
packages/software-factory/scripts/lib/test-run-execution.ts Updates resume logic and error completion paths to use specResults.
packages/software-factory/scripts/lib/test-run-cards.ts Builds TestRun documents with pre-populated specResults (pending entries) and writes completion with specResults.
packages/software-factory/scripts/lib/factory-test-realm.ts Re-exports SpecResultData from the library entrypoint.
packages/software-factory/realm/test-results.gts Adds SpecResult FieldDef, updates TestRun card schema/UI to render “Results by Spec”, and rolls up counts.
packages/software-factory/docs/software-factory-testing-strategy.md Documents the new specResults structure and grouping behavior.
packages/software-factory/docs/one-shot-factory-go-plan.md Updates the card definition overview to include SpecResult and the roll-up model.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

habdelra and others added 13 commits March 31, 2026 18:12
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ment, port timeout

- Gate `delete process.env.HOST_URL` behind NODE_ENV === 'test'
- SpecResult.Embedded.total uses results.length instead of passedCount + failedCount
  (shows correct total during running state when tests are pending)
- Update stale doc comment on TestRunAttributes to reference specResults
- Add comment on extractGroupedPlaywrightResults noting single-project assumption
- Bump waitForPortFree timeout from 10s to 30s to reduce CI port contention flakes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ENV guard

- SpecResult.Embedded shows "running..." instead of pass counts while
  any test result is still pending; shows counts once all tests complete
- Revert NODE_ENV guard on HOST_URL deletion — harness/shared.ts is only
  imported by test infrastructure, and NODE_ENV is not guaranteed to be
  'test' when the harness runs (Playwright global setup doesn't set it)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…stRun fetch retry

- Smoke test now runs both passing and failing specs in a single
  executeTestRunFromRealm call, producing one TestRun with two SpecResults
- Add retry logic to completeTestRun read — after a long spawnSync
  (Playwright), TCP connections can be stale causing the first fetch to
  fail with "fetch failed"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nt as realm name

- createRealm now calls addRealmToMatrixAccountData even when the realm
  already exists, ensuring it always appears in the Boxel dashboard
- Smoke test uses the URL endpoint segment as the realm display name
  instead of a hardcoded 'Smoke Test Realm'

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The fitted Project card and failure pre blocks could overflow their
sections, causing visual clipping of the "Failures" heading.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents the embedded card view from overflowing its section.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wrap fitted Project/Ticket cards in a max-height container to prevent
them from overflowing into the Failures section below.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The fitted format renders in a fixed-height container that clips content.
Embedded format flows naturally and shows the full card content.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Playwright error messages contain terminal color codes (ESC[31m etc.)
that render as garbled text in the card UI. Strip them during parsing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@habdelra habdelra requested a review from a team April 1, 2026 00:11
@habdelra habdelra merged commit 9137c09 into main Apr 1, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants