Skip to content

feat(otel): Add distributed tracing across app and workspace server#2668

Draft
pauldambra wants to merge 5 commits into
mainfrom
posthog-code/otel-tracing
Draft

feat(otel): Add distributed tracing across app and workspace server#2668
pauldambra wants to merge 5 commits into
mainfrom
posthog-code/otel-tracing

Conversation

@pauldambra

Copy link
Copy Markdown
Member

Problem

We lacked end-to-end visibility into requests flowing between the Electron main process and the workspace server. This made it difficult to diagnose latency and trace tRPC calls across process boundaries.

Changes

  • Added OpenTelemetry distributed tracing across the app and workspace server.
  • Introduced a new otel-trace.ts utility in the main process exposing initOtelTracing, getMainTracer, and shutdownOtelTracing, backed by the workspace server's node-tracing module.
  • Initialized OTEL tracing at app startup in main/index.ts and wired graceful shutdown of the trace exporter into the app lifecycle service.
  • Added a tRPC tracing middleware (tracingMonitor) that wraps each call via traceTrpcCall, composed alongside the existing call-rate monitor on publicProcedure.
  • Tracing is gated on PostHog configuration (VITE_POSTHOG_API_KEY / VITE_POSTHOG_API_HOST) and returns null when not configured.
  • Added the required OpenTelemetry dependencies (@opentelemetry/api, context-zone, exporter-trace-otlp-http, sdk-trace-base, sdk-trace-node, sdk-trace-web).
  • Added unit tests (otel-trace.test.ts) covering initialization gating, provider registration, and shutdown flush behavior.

How did you test this?

  • Added and ran unit tests in apps/code/src/main/utils/otel-trace.test.ts covering:
    • Returning null when the API key or host is missing/invalid.
    • Registering a provider and returning a tracer when configured.
    • Flushing and shutting down the provider on shutdown.
    • No-op shutdown when the provider was never created.

Automatic notifications

  • Publish to changelog?
  • Alert Sales and Marketing teams?

Created with PostHog Code

@github-actions

github-actions Bot commented Jun 14, 2026

Copy link
Copy Markdown

React Doctor found no issues in the changed files. 🎉

Reviewed by React Doctor for commit 4e9526a.

@greptile-apps

greptile-apps Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Comments Outside Diff (1)

  1. packages/workspace-server/src/serve.ts, line 30-33 (link)

    P1 Fire-and-forget shutdown loses spans

    void shutdownOtelTracing() launches the forceFlush + provider.shutdown() async chain but never awaits it. The process can exit (either immediately on the process.exit(0) path or after SHUTDOWN_GRACE_MS = 3 000 ms) before the flush completes, silently dropping the last batch of spans. The main-process counterpart in app-lifecycle/service.ts correctly awaits shutdownOtelTracing() — the workspace-server shutdown should do the same.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: packages/workspace-server/src/serve.ts
    Line: 30-33
    
    Comment:
    **Fire-and-forget shutdown loses spans**
    
    `void shutdownOtelTracing()` launches the `forceFlush` + `provider.shutdown()` async chain but never awaits it. The process can exit (either immediately on the `process.exit(0)` path or after `SHUTDOWN_GRACE_MS = 3 000 ms`) before the flush completes, silently dropping the last batch of spans. The main-process counterpart in `app-lifecycle/service.ts` correctly `await`s `shutdownOtelTracing()` — the workspace-server shutdown should do the same.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
packages/workspace-server/src/serve.ts:30-33
**Fire-and-forget shutdown loses spans**

`void shutdownOtelTracing()` launches the `forceFlush` + `provider.shutdown()` async chain but never awaits it. The process can exit (either immediately on the `process.exit(0)` path or after `SHUTDOWN_GRACE_MS = 3 000 ms`) before the flush completes, silently dropping the last batch of spans. The main-process counterpart in `app-lifecycle/service.ts` correctly `await`s `shutdownOtelTracing()` — the workspace-server shutdown should do the same.

### Issue 2 of 3
packages/workspace-server/src/with-span.test.ts:1-62
**Missing success-path test with an active tracer**

The test suite covers "no tracer → pass-through", "no tracer → error propagation", and "tracer + error path", but never exercises the happy path where a tracer *is* present and `fn` resolves successfully. That path is where attribute injection, `span.end()` on success, and the return value threading all need verification. A fourth case — e.g. confirming `span.end` is called and the resolved value is returned unchanged — would complete coverage.

### Issue 3 of 3
packages/workspace-server/src/services/agent/agent.ts:687-689
The span wraps the entire `agent.run()` async operation (connection, streaming, etc.) but is named `"agent.run.start"`. A "start" suffix implies the span records only the initiation phase, which will mislead anyone reading traces. `"agent.run"` matches the pattern used by the surrounding `git.sync.*` spans and more accurately represents the scope.

```suggestion
      const acpConnection = await withSpan(
        "agent.run",
        {
```

Reviews (1): Last reviewed commit: "feat(otel): Add distributed tracing acro..." | Re-trigger Greptile

Comment thread packages/workspace-server/src/with-span.test.ts Outdated
Comment thread packages/workspace-server/src/services/agent/agent.ts

@pauldambra pauldambra left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QA Swarm review complete — see inline comments. (Diff base note: an earlier pass was run against a stale local main; this review reflects the authoritative PR diff against origin/main. Two findings about an agent.ts error-masking change and a duplicate initializePostHog were stale-diff artifacts and have been dropped.)

Comment thread packages/workspace-server/src/serve.ts Outdated
Comment thread apps/code/src/renderer/utils/otel-trace.ts
Comment thread packages/workspace-server/src/node-tracing.ts Outdated
Comment thread packages/workspace-server/src/with-span.ts Outdated
Comment thread apps/code/src/renderer/utils/otel-trace.ts Outdated
Comment thread apps/code/src/renderer/utils/otel-trace.ts Outdated
Comment thread packages/workspace-client/src/client.ts
Comment thread packages/workspace-server/src/services/git/service.ts
Comment thread packages/workspace-server/src/node-tracing.ts Outdated
@pauldambra

Copy link
Copy Markdown
Member Author

Note

🤖 Automated comment by QA Swarm — not written by a human

Multi-perspective review: qa-team, paul-reviewer, xp-reviewer, security-audit

Verdict: ⚠️ REQUEST CHANGES

Well-structured, well-gated tracing plumbing that no-ops cleanly when unconfigured, with a genuine DRY win (the shared node factory) and good parameterised tests. One real correctness bug holds it back from a clean pass: buffered spans are dropped on shutdown.

Key findings

🟠 HIGH

  • serve.ts:39 — shutdown drops spans. void shutdownOtelTracing() is fire-and-forget before a synchronous process.exit(0); the flush never completes, so every clean workspace-server shutdown loses up to 2s of buffered spans. Await the flush within the grace window.

🟡 MEDIUM

  • renderer otel-trace.ts:128 — renderer flush is unreliable (pagehide-only, fire-and-forget, no shutdown()); recent renderer spans routinely lost.
  • node-tracing.ts:67 ↔ renderer — exporter/sampler shape duplicated across node + web; /i/v1/traces and scheduledDelayMillis: 2000 are magic numbers not shared even though the ratio is. (convergent: paul + xp)
  • with-span.ts:9 — inconsistent tracer acquisition: withSpan reaches into the singleton, traceTrpcCall takes the tracer as a param. Pick one.
  • renderer otel-trace.ts — no colocated test for the most branch-heavy init site (gating, url-guard, longtask, navigation start/end, profiler threshold).

🟢 LOW / ⚪ NIT

  • Gating asymmetry (renderer defaults host + needs only key; node needs both) → possible half-instrumented traces.
  • git.remote exported (name only today — confirm a resolved URL can't reach it).
  • traceparent injected unconditionally on every workspace call (add a "no header when disabled" test).
  • URL build: double-slash on trailing-slash host; new URL() doesn't assert https: scheme.
  • No log line at init → at 2% sampling, hard to tell "tracing flowing" from "silently broken."
  • Renderer→main traces are not linked to main/workspace-server (electron-trpc IPC has no traceparent hop) — already documented as a deliberate deferral in this PR.

Convergence (highest confidence)

  • node↔web provider duplication — flagged independently by paul and xp.
  • renderer→main propagation gap — flagged by qa-team and security-audit (both note it's an acknowledged, non-blocking gap).

Reviewer summaries

Reviewer Assessment
🔍 qa-team Correct gating + sampling; the shutdown-flush handling (server + renderer) is the one real defect — request changes on that, rest is LOW/NIT.
👤 paul No correctness blockers in his read; "don't let it drift" (dup provider setup, no init log) + a "confirm no header when disabled" test. Stamp-with-comments.
📐 xp Genuine DRY win; tighten withSpan/traceTrpcCall consistency and add the missing renderer test. "Make it right" polish on working code.
🛡 security-audit No exploitable findings. Bearer key is a public phc_ ingestion key; span attributes deliberately low-sensitivity; inbound traceparent is behind the shared-secret auth. Two LOW hardening notes only.

Diff-base note: an earlier pass ran against a stale local main and surfaced two findings (an agent.ts error-masking change, a duplicate initializePostHog) that are not in the authoritative PR diff — those were dropped from this review.


Automated by QA Swarm — not a human review

- serve.ts: await the span flush (bounded by the grace timer) before
  process.exit so a clean workspace-server shutdown no longer drops
  buffered spans
- shared: buildTracesEndpoint() strips trailing slash and asserts https
  (localhost http allowed for dev); OTEL_BATCH_DELAY_MS shares the batch
  delay across node + web so the magic number lives once
- align renderer gating to require both key AND host (matches node + the
  log transport) so a key-only build can't half-instrument
- init log line (enabled/disabled) on all three tracer inits
- harden renderer flush: also flush on visibilitychange -> hidden
- withSpan now takes the tracer explicitly, matching traceTrpcCall
- tests: buildTracesEndpoint, renderer otel-trace gating + navigation +
  profiler threshold, workspace-client no-traceparent-when-disabled

Generated-By: PostHog Code
Task-Id: afd9489f-f2f8-4582-ab17-ace0d331ca0b
Covers the happy path: tracer present, fn resolves, span ends, value
returned unchanged, no exception/error-status recorded.

Generated-By: PostHog Code
Task-Id: afd9489f-f2f8-4582-ab17-ace0d331ca0b
Comment thread packages/shared/src/constants.ts Fixed
Strip trailing slashes with a linear loop instead of /\/+$/, which
CodeQL flagged as a polynomial-time ReDoS on uncontrolled input.

Generated-By: PostHog Code
Task-Id: afd9489f-f2f8-4582-ab17-ace0d331ca0b
react-doctor's Socket check flags vitest@4.0.10 (known CVEs); adding it
to a package that had no test runner registered a new baseline
diagnostic. Revert the test-runner addition and keep
tracePropagationHeaders private. The no-op-when-disabled behaviour is
guaranteed by OTel's default NoopTextMapPropagator regardless.

Generated-By: PostHog Code
Task-Id: afd9489f-f2f8-4582-ab17-ace0d331ca0b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants