Introduce SpecResult to group test results by spec#4289
Conversation
TestRun previously held a flat list of TestResultEntry and a single specRef. This restructures TestRun to contain multiple SpecResults, each grouping test results under a spec reference. This enables multi-spec test runs where results are organized by spec file. - Add SpecResult FieldDef with specRef, results, passedCount/failedCount - TestRun.specResults replaces TestRun.results + TestRun.specRef - TestRun passedCount/failedCount aggregate across all SpecResults - Parsing groups Playwright top-level suites into SpecResult entries - Templates show results organized by spec in isolated view - Strip ambient HOST_URL from process.env in harness to prevent shell env vars from breaking the hermetic test seal Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 61f1df810b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
This PR introduces a new SpecResult grouping layer for test runs so TestRun results can be organized by spec/suite, and updates parsing + card/document wiring to use specResults instead of a flat results list. It also hardens the test harness environment by sanitizing an ambient HOST_URL that could break hermetic test execution.
Changes:
- Add
SpecResultFieldDef and switchTestRuntospecResultswith roll-uppassedCount/failedCount. - Update test-run parsing/execution/card document creation to emit and consume
specResults. - Sanitize
process.env.HOST_URLin the harness to prevent inherited shell env from leaking into child processes.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/software-factory/tests/factory-test-realm.test.ts | Updates unit tests to assert against specResults instead of results. |
| packages/software-factory/src/harness/shared.ts | Removes ambient HOST_URL from the harness process environment. |
| packages/software-factory/scripts/lib/test-run-types.ts | Replaces results with specResults in the serialized TestRun types; adds SpecResultData. |
| packages/software-factory/scripts/lib/test-run-parsing.ts | Parses Playwright JSON into grouped specResults; updates failure/status/count derivation accordingly. |
| packages/software-factory/scripts/lib/test-run-execution.ts | Updates resume logic and error completion paths to use specResults. |
| packages/software-factory/scripts/lib/test-run-cards.ts | Builds TestRun documents with pre-populated specResults (pending entries) and writes completion with specResults. |
| packages/software-factory/scripts/lib/factory-test-realm.ts | Re-exports SpecResultData from the library entrypoint. |
| packages/software-factory/realm/test-results.gts | Adds SpecResult FieldDef, updates TestRun card schema/UI to render “Results by Spec”, and rolls up counts. |
| packages/software-factory/docs/software-factory-testing-strategy.md | Documents the new specResults structure and grouping behavior. |
| packages/software-factory/docs/one-shot-factory-go-plan.md | Updates the card definition overview to include SpecResult and the roll-up model. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ment, port timeout - Gate `delete process.env.HOST_URL` behind NODE_ENV === 'test' - SpecResult.Embedded.total uses results.length instead of passedCount + failedCount (shows correct total during running state when tests are pending) - Update stale doc comment on TestRunAttributes to reference specResults - Add comment on extractGroupedPlaywrightResults noting single-project assumption - Bump waitForPortFree timeout from 10s to 30s to reduce CI port contention flakes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ENV guard - SpecResult.Embedded shows "running..." instead of pass counts while any test result is still pending; shows counts once all tests complete - Revert NODE_ENV guard on HOST_URL deletion — harness/shared.ts is only imported by test infrastructure, and NODE_ENV is not guaranteed to be 'test' when the harness runs (Playwright global setup doesn't set it) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…stRun fetch retry - Smoke test now runs both passing and failing specs in a single executeTestRunFromRealm call, producing one TestRun with two SpecResults - Add retry logic to completeTestRun read — after a long spawnSync (Playwright), TCP connections can be stale causing the first fetch to fail with "fetch failed" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nt as realm name - createRealm now calls addRealmToMatrixAccountData even when the realm already exists, ensuring it always appears in the Boxel dashboard - Smoke test uses the URL endpoint segment as the realm display name instead of a hardcoded 'Smoke Test Realm' Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The fitted Project card and failure pre blocks could overflow their sections, causing visual clipping of the "Failures" heading. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents the embedded card view from overflowing its section. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wrap fitted Project/Ticket cards in a max-height container to prevent them from overflowing into the Failures section below. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The fitted format renders in a fixed-height container that clips content. Embedded format flows naturally and shows the full card content. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Playwright error messages contain terminal color codes (ESC[31m etc.) that render as garbled text in the card UI. Strip them during parsing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
SpecResultFieldDef as an intermediate layer betweenTestRunandTestResultEntry, grouping test results by spec fileTestRun.specResultsreplacesTestRun.results+TestRun.specRef— eachSpecResulthas its ownspecRef(CodeRefField) andresults(containsMany TestResultEntry)TestRun.passedCount/failedCountare now computed roll-ups across all SpecResultsHOST_URLfromprocess.envin the test harness to prevent shell env vars from breaking the hermetic test sealTry it out
Prerequisites
mise run dev-allrunningRun the smoke test
cd packages/software-factory MATRIX_URL=http://localhost:8008 \ MATRIX_USERNAME=your-username \ MATRIX_PASSWORD=your-password \ pnpm smoke:test-realm -- \ --target-realm-url http://localhost:4201/your-username/smoke-test-realm/What to expect for TestRun artifacts
After the smoke test completes, navigate to your
smoke-test-realmworkspace in the Boxel app:Test Runs/folder — contains TestRun cards likehello-smoke-1andhello-fail-1#1sequence number, status badge (passed/failed), aggregated pass/fail counts across all specs, and durationspecRef.module— the Playwright suite title / spec file path) and its own pass/total count{ProjectName} Test Artifacts), keeping test data separate from implementation artifactsCard structure
Test plan
pnpm test:node— 321/321 unit tests passpnpm test:playwright— 14/14 Playwright tests pass (includingHOST_URLsanitization fix)npx tsc --noEmit— no type errors in software-factory files🤖 Generated with Claude Code