braintrustdata · Luca Forstner (lforst) · May 20, 2026 · May 18, 2026 · May 18, 2026 · May 19, 2026
diff --git a/.agents/skills/e2e-tests/SKILL.md b/.agents/skills/e2e-tests/SKILL.md
@@ -60,6 +60,7 @@ Cassettes mock provider HTTP responses (OpenAI, Anthropic, ...) so external-prov
 - Run every scenario through `withScenarioHarness(...)`.
 - Keep reusable logic in `e2e/helpers/`. Keep one-off fixtures and scenario-specific files inside the scenario directory.
 - Snapshot stable contracts, not raw noise. Use `normalizeForSnapshot(...)` before inline snapshots and `formatJsonFileSnapshot(...)` plus file snapshots for larger payloads or version matrices.
+- For span-tree snapshots, use `matchSpanTreeSnapshot(...)`. It writes and asserts paired `.span-tree.json` and `.span-tree.txt` files from the same normalized span tree. The JSON file is the structural contract that is easiest to parse mechanically; the TXT file is the ASCII tree that is easiest to review by eye. Keep them in sync by updating both through the e2e update/record commands, and never hand-edit only one side of the pair.
 - When a scenario family already has `assertions.ts`, keep version- or provider-specific test setup in `scenario.test.ts` and reuse the shared assertions file.
 - Keep the CI e2e summary up to date. If a scenario version matrix or `variantKey` changes, update `e2e/config/pr-comment-scenarios.json` in the same change and follow the established pattern used by other versioned scenarios: one summary row per version, not separate wrapped/auto rows unless that pattern already exists for the scenario family.
 - Run new or updated scenarios three times in a row before considering snapshots stable.

diff --git a/AGENTS.md b/AGENTS.md
@@ -59,9 +59,14 @@ Each scenario runs the SDK in a subprocess against a mock Braintrust server and
 
 ```bash
 pnpm run test:e2e                 # Run all e2e scenarios (from repo root)
+pnpm run test:e2e:update          # Update e2e snapshots without re-recording cassettes
 pnpm run test:e2e:record          # Re-record provider cassettes and update snapshots
 ```
 
+When adding or modifying e2e tests, run the relevant e2e verification twice before stopping so flakes are caught proactively. After running `pnpm run test:e2e:update` or `pnpm run test:e2e:record`, always run the normal e2e tests afterward to verify there is no snapshot drift or unstable output.
+
+Span-tree snapshots are paired: `*.span-tree.json` is the structural contract, and `*.span-tree.txt` is the human-readable ASCII tree generated from the same normalized spans. Both files are asserted and should be updated together through `pnpm run test:e2e:update` or `pnpm run test:e2e:record`; do not hand-edit only one side of the pair.
+
 **From repo root:**
 
 ```bash

diff --git a/e2e/README.md b/e2e/README.md
@@ -34,7 +34,7 @@ Any extra files needed only by one scenario stay in that scenario folder. Anythi
 - `scenario-installer.ts` - Installs optional scenario-local dependencies from a colocated `package.json` into a shared cache and links them into prepared scenario copies.
 - `mock-braintrust-server.ts` - Captures requests, merged log payloads, and parsed span-like events.
 - `normalize.ts` - Makes snapshots deterministic by normalizing ids, timestamps, paths, and mock-server URLs.
-- `trace-selectors.ts` / `trace-summary.ts` - Helpers for finding spans and snapshotting only the relevant shape.
+- `trace-selectors.ts` / `span-tree.ts` / `trace-summary.ts` - Helpers for finding spans and snapshotting stable, human-readable trace trees.
 - `scenario-runtime.ts` - Shared runtime utilities used by scenario entrypoints.
 - `openai.ts` - Shared scenario lists and assertions for OpenAI wrapper and hook coverage across v4/v5/v6.
 - `wrapper-contract.ts` - Helpers for snapshotting wrapper span contracts and filtering payload rows by root span id.
@@ -66,7 +66,7 @@ The main utilities you'll use in test files:
 - `events()`, `payloads()`, `requestCursor()`, `requestsAfter()` - Lower-level access for ingestion payloads and HTTP request flow assertions.
 - `testRunId` - Useful when a scenario or assertion needs the exact run marker.
 
-Use `normalizeForSnapshot(...)` before snapshotting. It replaces timestamps and ids with stable tokens and strips machine-specific paths and localhost ports.
+Prefer `matchSpanTreeSnapshot(...)` for span snapshots. It asserts both a structural `.span-tree.json` snapshot and a human-readable `.span-tree.txt` tree beside it. Both files are generated from the same normalized span tree and include stable span attributes, input, output, expected values, scores, tags, metadata, metrics, and errors. Use `normalizeForSnapshot(...)` for non-span JSON snapshots; it replaces timestamps and ids with stable tokens and strips machine-specific paths and localhost ports.
 
 ### Provider scenario cassettes
 
@@ -79,7 +79,7 @@ Wrapper scenarios often create a root span with `testRunId` metadata and then le
 - Use `events()` rather than `testRunEvents()` to inspect the full trace tree.
 - Find the scenario root span first.
 - Scope raw payload snapshots by `root_span_id` using `payloadRowsForRootSpan(...)`.
-- Pair a normalized `span-events` snapshot with a normalized `log-payloads` snapshot.
+- Prefer normalized span-tree snapshots from `matchSpanTreeSnapshot(...)`. The `.json` sibling is the structural contract, and the `.txt` sibling is the ASCII tree for review; both are asserted and should be updated together.
 - If the wrapper has an explicit support matrix, reuse one shared test across version-specific scenario entries instead of duplicating the assertions. The AI SDK wrapper scenario uses this for supported v3-v6 package combinations.
 
 ### Runner-wrapper scenario pattern
@@ -122,6 +122,7 @@ Scenario-local manifests are optional and should stay slim. They are only for sc
 
 ```bash
 pnpm run test:e2e # Run all e2e tests
+pnpm run test:e2e:update # Update snapshots in cassette replay mode
 pnpm run test:e2e:record # Re-record provider cassettes and update snapshots
 pnpm run test:e2e:record -- <name> # Re-record one scenario from the repo root
 pnpm run test:e2e:canary # Run canary e2e tests

diff --git a/e2e/helpers/normalize.ts b/e2e/helpers/normalize.ts
@@ -20,12 +20,23 @@ const UUID_REGEX =
   /^[0-9a-f]{8}-[0-9a-f]{4}-[1-8][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}$/i;
 const UUID_SUBSTRING_REGEX =
   /[0-9a-f]{8}-[0-9a-f]{4}-[1-8][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}/gi;
-const TIME_KEYS = new Set(["created", "date", "start", "end"]);
+const TIME_KEYS = new Set([
+  "completed_at",
+  "created",
+  "created_at",
+  "date",
+  "end",
+  "expires_at",
+  "start",
+  "started_at",
+  "updated_at",
+]);
 const SPAN_ID_KEYS = new Set(["id", "span_id", "root_span_id"]);
 const ZERO_NUMBER_KEYS = new Set([
   "avgLogprobs",
   "caller_lineno",
   "duration",
+  "github_copilot.context_window.current",
   "time_to_first_token",
 ]);
 const XACT_VERSION_KEYS = new Set([
@@ -46,7 +57,13 @@ const DYNAMIC_HEADER_KEYS = new Set([
   "x-ratelimit-reset-tokens",
   "x-request-id",
 ]);
-const PROVIDER_ID_KEYS = new Set(["itemId", "responseId", "toolCallId"]);
+const PROVIDER_ID_KEYS = new Set([
+  "agentId",
+  "claude_agent_sdk.task_id",
+  "itemId",
+  "responseId",
+  "toolCallId",
+]);
 const PROJECT_ID_KEYS = new Set(["project_id", "projectId"]);
 const PROJECT_NAME_KEYS = new Set(["project_name", "projectName"]);
 const HELPERS_DIR = path.dirname(fileURLToPath(import.meta.url));
@@ -219,7 +236,12 @@ function normalizeValue(
   }
 
   if (typeof value === "number") {
-    if (currentKey && ZERO_NUMBER_KEYS.has(currentKey)) {
+    if (
+      currentKey &&
+      (ZERO_NUMBER_KEYS.has(currentKey) ||
+        currentKey.endsWith("_ms") ||
+        currentKey.endsWith("Ms"))
+    ) {
       return 0;
     }
     if (currentKey && TIME_KEYS.has(currentKey)) {
@@ -240,6 +262,14 @@ function normalizeValue(
       return normalizeCallerFilename(value);
     }
 
+    if (currentKey === "openai_codex.working_directory") {
+      const normalizedPath = value.replace(/\\/g, "/");
+      const match = normalizedPath.match(
+        /\/braintrust-codex-e2e-[^/]+\/([^/]+)$/,
+      );
+      return match ? `<tmp>/braintrust-codex-e2e/${match[1]}` : "<tmp>";
+    }
+
     if (currentKey === "_xact_id") {
       return tokenFor(tokenMaps.xacts, value, "xact");
     }