Introduce SpecResult to group test results by spec by habdelra · Pull Request #4289 · cardstack/boxel

habdelra · 2026-03-31T15:23:18Z

Summary

Introduce SpecResult FieldDef as an intermediate layer between TestRun and TestResultEntry, grouping test results by spec file
TestRun.specResults replaces TestRun.results + TestRun.specRef — each SpecResult has its own specRef (CodeRefField) and results (containsMany TestResultEntry)
TestRun.passedCount / failedCount are now computed roll-ups across all SpecResults
Playwright top-level suites are mapped to individual SpecResult entries during parsing
Strip ambient HOST_URL from process.env in the test harness to prevent shell env vars from breaking the hermetic test seal

Try it out

Prerequisites

mise run dev-all running
PR #4265 merged or its branch checked out

Run the smoke test

cd packages/software-factory

MATRIX_URL=http://localhost:8008 \
MATRIX_USERNAME=your-username \
MATRIX_PASSWORD=your-password \
pnpm smoke:test-realm -- \
  --target-realm-url http://localhost:4201/your-username/smoke-test-realm/

What to expect for TestRun artifacts

After the smoke test completes, navigate to your smoke-test-realm workspace in the Boxel app:

Test Runs/ folder — contains TestRun cards like hello-smoke-1 and hello-fail-1
Fitted view of a TestRun shows: #1 sequence number, status badge (passed/failed), aggregated pass/fail counts across all specs, and duration
Isolated view of a TestRun now shows results organized by spec:
- Each spec appears as a section header with its name (from specRef.module — the Playwright suite title / spec file path) and its own pass/total count
- Individual test results are listed under their spec with status icons, test names, and durations
- Failures section still shows a flat list of all failures across specs with error messages and stack traces
Test artifacts realm — card instances created during spec execution (test data) land in a separate auto-created realm ({ProjectName} Test Artifacts), keeping test data separate from implementation artifacts

Card structure

TestRun
├── sequenceNumber, status, runAt, completedAt, durationMs
├── passedCount (computed: sum of specResults[].passedCount)
├── failedCount (computed: sum of specResults[].failedCount)
└── specResults: SpecResult[]
    ├── specRef: CodeRefField { module: "Tests/hello-smoke.spec.ts", name: "default" }
    ├── passedCount (computed from results)
    ├── failedCount (computed from results)
    └── results: TestResultEntry[]
        ├── testName, status, message, stackTrace, durationMs
        └── ...

Test plan

pnpm test:node — 321/321 unit tests pass
pnpm test:playwright — 14/14 Playwright tests pass (including HOST_URL sanitization fix)
npx tsc --noEmit — no type errors in software-factory files

🤖 Generated with Claude Code

TestRun previously held a flat list of TestResultEntry and a single specRef. This restructures TestRun to contain multiple SpecResults, each grouping test results under a spec reference. This enables multi-spec test runs where results are organized by spec file. - Add SpecResult FieldDef with specRef, results, passedCount/failedCount - TestRun.specResults replaces TestRun.results + TestRun.specRef - TestRun passedCount/failedCount aggregate across all SpecResults - Parsing groups Playwright top-level suites into SpecResult entries - Templates show results organized by spec in isolated view - Strip ambient HOST_URL from process.env in harness to prevent shell env vars from breaking the hermetic test seal Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 61f1df810b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

packages/software-factory/scripts/lib/test-run-parsing.ts

github-actions · 2026-03-31T15:47:29Z

Realm Server Test Results

1 files ±0 1 suites ±0 11m 28s ⏱️ -39s
804 tests +2 804 ✅ +2 0 💤 ±0 0 ❌ ±0
875 runs +2 875 ✅ +2 0 💤 ±0 0 ❌ ±0

Results for commit 61f1df8. ± Comparison against base commit 20a185a.

github-actions · 2026-03-31T15:48:44Z

Host Test Results

1 files ±0 1 suites ±0 2h 12m 52s ⏱️ - 1m 32s
2 062 tests ±0 2 047 ✅ ±0 15 💤 ±0 0 ❌ ±0
2 077 runs ±0 2 062 ✅ ±0 15 💤 ±0 0 ❌ ±0

Results for commit 61f1df8. ± Comparison against base commit 20a185a.

Copilot

Pull request overview

This PR introduces a new SpecResult grouping layer for test runs so TestRun results can be organized by spec/suite, and updates parsing + card/document wiring to use specResults instead of a flat results list. It also hardens the test harness environment by sanitizing an ambient HOST_URL that could break hermetic test execution.

Changes:

Add SpecResult FieldDef and switch TestRun to specResults with roll-up passedCount/failedCount.
Update test-run parsing/execution/card document creation to emit and consume specResults.
Sanitize process.env.HOST_URL in the harness to prevent inherited shell env from leaking into child processes.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
packages/software-factory/tests/factory-test-realm.test.ts	Updates unit tests to assert against `specResults` instead of `results`.
packages/software-factory/src/harness/shared.ts	Removes ambient `HOST_URL` from the harness process environment.
packages/software-factory/scripts/lib/test-run-types.ts	Replaces `results` with `specResults` in the serialized TestRun types; adds `SpecResultData`.
packages/software-factory/scripts/lib/test-run-parsing.ts	Parses Playwright JSON into grouped `specResults`; updates failure/status/count derivation accordingly.
packages/software-factory/scripts/lib/test-run-execution.ts	Updates resume logic and error completion paths to use `specResults`.
packages/software-factory/scripts/lib/test-run-cards.ts	Builds TestRun documents with pre-populated `specResults` (pending entries) and writes completion with `specResults`.
packages/software-factory/scripts/lib/factory-test-realm.ts	Re-exports `SpecResultData` from the library entrypoint.
packages/software-factory/realm/test-results.gts	Adds `SpecResult` FieldDef, updates `TestRun` card schema/UI to render “Results by Spec”, and rolls up counts.
packages/software-factory/docs/software-factory-testing-strategy.md	Documents the new `specResults` structure and grouping behavior.
packages/software-factory/docs/one-shot-factory-go-plan.md	Updates the card definition overview to include `SpecResult` and the roll-up model.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

packages/software-factory/src/harness/shared.ts

packages/software-factory/realm/test-results.gts

packages/software-factory/scripts/lib/test-run-types.ts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ment, port timeout - Gate `delete process.env.HOST_URL` behind NODE_ENV === 'test' - SpecResult.Embedded.total uses results.length instead of passedCount + failedCount (shows correct total during running state when tests are pending) - Update stale doc comment on TestRunAttributes to reference specResults - Add comment on extractGroupedPlaywrightResults noting single-project assumption - Bump waitForPortFree timeout from 10s to 30s to reduce CI port contention flakes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ENV guard - SpecResult.Embedded shows "running..." instead of pass counts while any test result is still pending; shows counts once all tests complete - Revert NODE_ENV guard on HOST_URL deletion — harness/shared.ts is only imported by test infrastructure, and NODE_ENV is not guaranteed to be 'test' when the harness runs (Playwright global setup doesn't set it) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…stRun fetch retry - Smoke test now runs both passing and failing specs in a single executeTestRunFromRealm call, producing one TestRun with two SpecResults - Add retry logic to completeTestRun read — after a long spawnSync (Playwright), TCP connections can be stale causing the first fetch to fail with "fetch failed" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nt as realm name - createRealm now calls addRealmToMatrixAccountData even when the realm already exists, ensuring it always appears in the Boxel dashboard - Smoke test uses the URL endpoint segment as the realm display name instead of a hardcoded 'Smoke Test Realm' Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The fitted Project card and failure pre blocks could overflow their sections, causing visual clipping of the "Failures" heading. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Prevents the embedded card view from overflowing its section. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Wrap fitted Project/Ticket cards in a max-height container to prevent them from overflowing into the Failures section below. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The fitted format renders in a fixed-height container that clips content. Embedded format flows naturally and shows the full card content. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Playwright error messages contain terminal color codes (ESC[31m etc.) that render as garbled text in the card UI. Strip them during parsing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector bot reviewed Mar 31, 2026

View reviewed changes

packages/software-factory/scripts/lib/test-run-parsing.ts Show resolved Hide resolved

Merge remote-tracking branch 'origin/main' into consolidate-test-specs

daf0d73

habdelra requested a review from Copilot March 31, 2026 22:01

Copilot started reviewing on behalf of habdelra March 31, 2026 22:02 View session

Copilot AI reviewed Mar 31, 2026

View reviewed changes

packages/software-factory/src/harness/shared.ts Show resolved Hide resolved

packages/software-factory/realm/test-results.gts Outdated Show resolved Hide resolved

packages/software-factory/scripts/lib/test-run-types.ts Show resolved Hide resolved

habdelra and others added 13 commits March 31, 2026 18:12

Fix prettier formatting in test-results.gts

2a5b9ec

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use 2 Playwright workers consistently (local and CI)

6a1cb4c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix CSS overflow in TestRun isolated view

177aa6d

The fitted Project card and failure pre blocks could overflow their sections, causing visual clipping of the "Failures" heading. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use fitted format for Project/Ticket in TestRun isolated view

f0c6376

Prevents the embedded card view from overflowing its section. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Constrain linked card height in TestRun isolated view

cf85daf

Wrap fitted Project/Ticket cards in a max-height container to prevent them from overflowing into the Failures section below. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix linked card container: use margin instead of max-height clipping

fdd599c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use embedded format for linked cards in TestRun isolated view

6e1ea96

The fitted format renders in a fixed-height container that clips content. Embedded format flows naturally and shows the full card content. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix prettier: single quotes in template format attributes

f6bbdb8

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Strip ANSI escape codes from Playwright error messages

bc18a85

Playwright error messages contain terminal color codes (ESC[31m etc.) that render as garbled text in the card UI. Strip them during parsing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

habdelra requested a review from a team April 1, 2026 00:11

lukemelia approved these changes Apr 1, 2026

View reviewed changes

habdelra merged commit 9137c09 into main Apr 1, 2026
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce SpecResult to group test results by spec#4289

Introduce SpecResult to group test results by spec#4289
habdelra merged 15 commits intomainfrom
consolidate-test-specs

habdelra commented Mar 31, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

github-actions bot commented Mar 31, 2026

Uh oh!

github-actions bot commented Mar 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

habdelra commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Try it out

Prerequisites

Run the smoke test

What to expect for TestRun artifacts

Card structure

Test plan

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

github-actions bot commented Mar 31, 2026

Realm Server Test Results

Uh oh!

github-actions bot commented Mar 31, 2026

Host Test Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

habdelra commented Mar 31, 2026 •

edited

Loading