feat: Sessions - bidirectional durable agent streams by ericallam · Pull Request #3417 · triggerdotdev/trigger.dev

ericallam · 2026-04-20T14:18:13Z

⚠️ Not released yet. This PR is the server-side foundation only. The SDK changes that customers will actually use (chat.agent migration, chat.createStartSessionAction, useTriggerChatTransport updates) live on a separate branch and ship together in an upcoming @trigger.dev/sdk prerelease. Until that prerelease is published, this surface is reachable only via direct HTTP.

What this gives Trigger.dev users

A new first-class primitive, Session, for durable, task-bound, bidirectional I/O that outlives any single run. Sessions are the run manager for chat.agent going forward, and they unblock anything else that needs "one identifier, many runs over time" with a stable channel pair the client can write to and subscribe to.

Use cases unblocked

Chat agents that persist across many runs. One session per chat (keyed on your own chatId via externalId), turns 1..N attach to the same Session, the UI subscribes once and keeps receiving output as new runs take over.
Approval loops and long-running tasks with user feedback. The task waits on .in, the client writes to .in, the server enforces no-writes-after-close.
Workflow progress streams that live past the run. Subscribe to .out after the task finishes to replay history.
Resume-next-day flows. A session is a durable row, not a transient stream. Send a message a day later and the server triggers a fresh run on the same session.

How it works (Session-as-run-manager)

A Session row is task-bound (taskIdentifier + triggerConfig are required) and owns its current run via currentRunId + currentRunVersion for optimistic claim. Three trigger paths:

Session create — POST /api/v1/sessions creates the row and triggers the first run synchronously.
Append-time probe — POST /realtime/v1/sessions/:session/in/append checks if the current run is alive; if it has terminated (idle exit, crash, etc.), the server triggers a new run before processing the append.
End-and-continue handoff — POST /api/v1/sessions/:session/end-and-continue, called by the running agent, triggers a fresh run and atomically swaps currentRunId. Used by chat.requestUpgrade() for version handoffs.

Every triggered run is recorded in the SessionRun audit table with a reason (initial, continuation, upgrade, manual).

Public API surface

Control plane

POST /api/v1/sessions — create. Idempotent on (env, externalId). Triggers the first run, returns the session and a session-scoped public access token. Returns 409 if the upserted row is already closed.
GET /api/v1/sessions/:session — retrieve by friendlyId (session_abc...) or by your own externalId (server disambiguates by prefix).
GET /api/v1/sessions — list with filters (type, tag, taskIdentifier, externalId, derived status ACTIVE/CLOSED/EXPIRED, created-at range) and cursor pagination. Backed by ClickHouse.
PATCH /api/v1/sessions/:session — update tags / metadata / externalId.
POST /api/v1/sessions/:session/close — terminate. Idempotent, hard-blocks new server-brokered writes.
POST /api/v1/sessions/:session/end-and-continue — agent-only handoff to a fresh run.

Realtime

PUT /realtime/v1/sessions/:session/:io — initialize a channel. Returns S2 credentials in headers so high-throughput clients can write direct to S2.
GET /realtime/v1/sessions/:session/:io — SSE subscribe. Supports Last-Event-ID resume and an opt-in X-Peek-Settled: 1 header that fast-closes the stream when the upstream is already settled (trigger:turn-complete), eliminating long-poll wait on reconnect-on-reload paths.
POST /realtime/v1/sessions/:session/:io/append — server-side appends.
POST /api/v1/runs/:runFriendlyId/session-streams/wait — runs wait on a session stream as a waitpoint, with a race-check to avoid suspending if data already landed.

Auth scopes

sessions is a new resource type. read:sessions:{id}, write:sessions:{id}, admin:sessions:{id} flow through the existing JWT validator. Session-scoped public access tokens minted by the server replace browser-held trigger-task tokens for chat-style flows — the browser never sees a run identifier or a run-scoped token in steady state.

What's coming after this PR

SDK + chat.agent migration: separate branch, separate PR, ships in the next @trigger.dev/sdk prerelease alongside this server deploy. Customers using the prerelease chat.agent will follow the upgrade guide.
Dashboard surfaces: dedicated agent list, agent playground, agent view on the run dashboard. Tracking separately.

Implementation notes

Postgres Session table: scalar scoping columns (projectId, runtimeEnvironmentId, environmentType, organizationId) without FKs, matching the January TaskRun FK-removal decision. Point-lookup indexes only — list queries go to ClickHouse. Terminal markers (closedAt, expiresAt) are write-once.
ClickHouse sessions_v1: ReplacingMergeTree, partitioned by month, ordered by (org_id, project_id, environment_id, created_at, session_id). Tags indexed via tokenbf_v1 skip index.
SessionsReplicationService: mirrors RunsReplicationService exactly — leader-locked logical replication consumer, ConcurrentFlushScheduler, retry with exponential backoff + jitter, identical metric shape. Dedicated slot + publication so the two consume independently.
S2 keys: sessions/{addressingKey}/{out|in}. The existing runs/{runId}/{streamId} key format for run-scoped streams is untouched.
Optimistic claim: ensureRunForSession triggers a run upfront (cheap to cancel if it loses the race), then attempts an updateMany keyed on currentRunVersion. Loser cancels its triggered run and reuses the winner's. No DB lock held across the trigger.

What did NOT change

Run-scoped streams.pipe / streams.input and the existing /realtime/v1/streams/{runId}/... routes are unchanged. Sessions are net-new — not a reshaping of the current streams API.

Deploy notes

Set SESSION_REPLICATION_CLICKHOUSE_URL and SESSION_REPLICATION_ENABLED=1 to enable the replication consumer.
The Session table needs REPLICA IDENTITY FULL set on the prod source DB before the publication is created (same one-time DDL we did for TaskRun). Required for delete events to carry full column values.
Cross-form authorization on the GET /api/v1/sessions/:session loader (a JWT minted for either form authorizes both URL forms). Action routes are URL-form-specific, matching how the SDK mints PATs.

Verification

Webapp typecheck clean (10/10).
apps/webapp/test/sessionsReplicationService.test.ts — round-trip tests for insert/update/delete through Postgres logical replication into ClickHouse via testcontainers.
Live end-to-end against local dev: create + retrieve (both forms) + update + close, .out.initialize + .out.append x2 + .in.send + .out.subscribe over SSE, list with all filter combinations + pagination, end-and-continue swap, X-Peek-Settled fast-close (verified in browser via reconnect-on-reload and via curl). Replicated row lands in ClickHouse within ~1s.
Multi-round Devin + CodeRabbit review feedback addressed (read-after-write paths use prisma writer, info-leak on auth-routes masked as 403, peek-settled discriminator parsing fix, etc.).

Test plan

pnpm run typecheck --filter webapp
pnpm run test --filter webapp ./test/sessionsReplicationService.test.ts --run
Start the webapp with SESSION_REPLICATION_CLICKHOUSE_URL and SESSION_REPLICATION_ENABLED=1. Confirm the slot and publication auto-create on boot.
POST /api/v1/sessions and verify the row replicates to trigger_dev.sessions_v1 within a couple of seconds.
POST /api/v1/sessions/:id/close, then confirm POST /realtime/v1/sessions/:id/out/append returns 400.
Reuse a closed session's externalId on POST /api/v1/sessions and confirm 409.
GET /realtime/v1/sessions/:id/out with X-Peek-Settled: 1 after a turn completes and confirm X-Session-Settled: true response header + immediate close.

changeset-bot · 2026-04-20T14:18:22Z

🦋 Changeset detected

Latest commit: 188fa43

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 29 packages

Name	Type
@trigger.dev/core	Patch
@trigger.dev/build	Patch
trigger.dev	Patch
@trigger.dev/python	Patch
@trigger.dev/redis-worker	Patch
@trigger.dev/schema-to-json	Patch
@trigger.dev/sdk	Patch
@internal/cache	Patch
@internal/clickhouse	Patch
@internal/llm-model-catalog	Patch
@internal/redis	Patch
@internal/replication	Patch
@internal/run-engine	Patch
@internal/schedule-engine	Patch
@internal/testcontainers	Patch
@internal/tracing	Patch
@internal/tsql	Patch
@internal/zod-worker	Patch
d3-chat	Patch
references-d3-openai-agents	Patch
references-nextjs-realtime	Patch
references-realtime-hooks-test	Patch
references-realtime-streams	Patch
references-telemetry	Patch
@internal/sdk-compat-tests	Patch
@trigger.dev/react-hooks	Patch
@trigger.dev/rsc	Patch
@trigger.dev/database	Patch
@trigger.dev/otlp-importer	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

coderabbitai · 2026-04-20T14:18:33Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Introduces a durable Session primitive end-to-end: a new Prisma Session model and migration, a ClickHouse sessions_v1 table and query/insert helpers, ClickHouse-backed SessionsRepository, a SessionsReplicationService that streams Postgres logical replication into ClickHouse (with retry/ack/flush/leader-lock logic), session-friendly ID export (SessionId) and API Zod schemas, multiple REST and realtime routes for session CRUD, streaming and append, session-stream waitpoint support with Redis-backed pending sets, environment config and startup wiring, helper utilities, and end-to-end replication tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 35.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: Sessions - bidirectional durable agent streams' clearly summarizes the main change, specifying the new Sessions feature with its core capability of bidirectional streaming.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The pull request provides a comprehensive description covering objectives, use cases, public API surface, implementation notes, and verification steps.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/tri-8627-session-primitive-server-side-schema-routes-clickhouse

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Durable, typed, bidirectional I/O primitive that outlives a single run. Ship target is agent/chat use cases; run-scoped streams.pipe/streams.input are untouched and do not create Session rows. Postgres - New Session table: id, friendlyId, externalId, type (plain string), denormalised project/environment/organization scalar columns (no FKs), taskIdentifier, tags String[], metadata Json, closedAt, closedReason, expiresAt, timestamps - Point-lookup indexes only (friendlyId unique, (env, externalId) unique, expiresAt). List queries are served from ClickHouse so Postgres stays minimal and insert-heavy. Control-plane API - POST /api/v1/sessions create (idempotent via externalId) - GET /api/v1/sessions list with filters (type, tag, taskIdentifier, externalId, status ACTIVE|CLOSED|EXPIRED, period/from/to) and cursor pagination, ClickHouse-backed - GET /api/v1/sessions/:session retrieve — polymorphic: `session_` prefix hits friendlyId, otherwise externalId - PATCH /api/v1/sessions/:session update tags/metadata/externalId - POST /api/v1/sessions/:session/close terminal close (idempotent) Realtime (S2-backed) - PUT /realtime/v1/sessions/:session/:io returns S2 creds - GET /realtime/v1/sessions/:session/:io SSE subscribe - POST /realtime/v1/sessions/:session/:io/append server-side append - S2 key format: sessions/{friendlyId}/{out|in} Auth - sessions added to ResourceTypes. read:sessions:{id}, write:sessions:{id}, admin:sessions:{id} scopes work via existing JWT validation. ClickHouse - sessions_v1 ReplacingMergeTree table - SessionsReplicationService mirrors RunsReplicationService exactly: logical replication with leader-locked consumer, ConcurrentFlushScheduler, retry with exponential backoff + jitter, identical metric shape. Dedicated slot + publication (sessions_to_clickhouse_v1[_publication]). - SessionsRepository + ClickHouseSessionsRepository expose list, count, tags with cursor pagination keyed by (created_at DESC, session_id DESC). - Derived status (ACTIVE/CLOSED/EXPIRED) computed from closed_at + expires_at; in-memory fallback on list results to catch pre-replication writes. Verification - Webapp typecheck 10/10 - Core + SDK build 3/3 - sessionsReplicationService.test.ts integration tests 2/2 (insert + update round-trip via testcontainers) - Live round-trip against local dev: create -> retrieve (friendlyId and externalId) -> out.initialize -> out.append x2 -> in.send -> out.subscribe (receives records) -> close -> ClickHouse sessions_v1 shows the replicated row with closed_reason - Live list smoke: tag, type, status CLOSED, externalId, and cursor pagination

…te/update The session_ prefix identifies internal friendlyIds. Allowing it in a user-supplied externalId would misroute subsequent GET/PATCH/close requests through resolveSessionByIdOrExternalId to a friendlyId lookup, returning null or the wrong session. Reject at the schema boundary so both routes surface a clean 422.

Without allowJWT/corsStrategy, frontend clients holding public access tokens hit 401 on GET /api/v1/sessions and browser preflights fail. Matches the single-session GET/PATCH/close routes and the runs list endpoint.

- Derive isCached from the upsert result (id mismatch = pre-existing row) instead of doing a separate findFirst first. The pre-check was racy — two concurrent first-time POSTs could both return 201 with isCached: false. Using the returned row's id is atomic and saves a round-trip. - Scope the list endpoint's authorization to the standard action/resource pattern (matches api.v1.runs.ts): task-scoped JWTs can list sessions filtered by their task, and broader super-scopes (read:sessions, read:all, admin) authorize unfiltered listing. - Log and swallow unexpected errors on POST rather than returning the raw error.message. Prisma/internal messages can leak column names and query fragments.

Give Session channels run-engine waitpoint semantics so a task can suspend while idle on a session channel and resume when an external client sends a record — parallel to what streams.input offers run-scoped streams. Webapp - POST /api/v1/runs/:runFriendlyId/session-streams/wait — creates a manual waitpoint attached to {sessionId, io} and race-checks the S2 stream starting at lastSeqNum so pre-arrived data fires it immediately. Mirrors the existing input-stream waitpoint route. - sessionStreamWaitpointCache.server.ts — Redis set keyed on {sessionFriendlyId, io}, drained atomically on each append so concurrent multi-tab waiters all wake together. - realtime.v1.sessions.$session.$io.append now drains pending waitpoints after every record lands and completes each with the appended body. - S2RealtimeStreams.readSessionStreamRecords — session-channel parallel of readRecords, feeds the race-check path. Core - CreateSessionStreamWaitpoint request/response schemas alongside the existing Session CRUD schemas. Server API contract only — the client ApiClient + SDK wrapper ship on the AI-chat branch.

Two fixes needed by browser clients hitting the public session API (TriggerChatTransport's direct accessToken path, WebSocket-less session drivers, anything origin'd off the dashboard): - POST /api/v1/sessions: allowJWT: true + corsStrategy: "all" on the action. Pre-fix, the create endpoint only accepted secret-key auth, so any browser-originated sessions.create(...) 401'd. The loader (list) already had these; matches that shape. - POST /realtime/v1/sessions/:session/:io/append: export both { action, loader } so Remix routes the OPTIONS preflight to the route builder's CORS handler. With only { action } exported, the preflight returns 400 'No loader for route' and Chrome surfaces the follow-up POST as net::ERR_FAILED. Same pattern as /api/v1/tasks/:id/trigger (which already exports both). Validated by an end-to-end UI smoke on references/ai-chat: new chat → send → streamed assistant reply in ~4s → second turn reuses the same session + run, lastEventId advances 10 → 21.

Nine fixes from CodeRabbit + Devin review: - api.v1.sessions.$session.close.ts: - Export { action, loader } so CORS preflight reaches the route builder's OPTIONS handler. Same fix already applied to the append route — Devin caught that I'd missed this one. Without the loader, browser clients hitting POST /close fail preflight. - Switch to `prisma.session.updateMany({ where: { id, closedAt: null }, ... })` so concurrent closes can't overwrite the original `closedAt` / `closedReason`. Loser hits count === 0 and re-reads the winning row — closedness is write-once at the DB level. (CodeRabbit: TOCTOU.) - entry.server.tsx: Wrap the async `sessionsReplicationInstance.shutdown` in a sync handler with `.catch(...)`. SIGTERM/SIGINT fire during process teardown and a rejection from `_replicationClient.stop()` would become an unhandled promise rejection. Matches the pattern in `dynamicFlushScheduler.server.ts`. (CodeRabbit: unhandled rejection risk.) - api.v1.runs.$runFriendlyId.session-streams.wait.ts: - Swallowed race-check catch now logs `warn` with sessionFriendlyId / io / waitpointId / error. Silent failures in the S2-read / engine-complete / cache-remove path were indistinguishable from the expected cache-drain-on-append fast path. - Outer 500 path no longer forwards `error.message` (Prisma / engine / S2 internals could leak). Logs server-side and returns a generic "Something went wrong"; 422 ServiceValidationError path unchanged. (CodeRabbit: info-leak + logging gap.) - realtime.v1.sessions.$session.$io.ts: Add `method: "PUT"` to the route config so the route builder enforces method validation before the handler runs. Removed the now-redundant `request.method !== "PUT"` check inside the handler. (CodeRabbit: defense-in-depth.) - services/sessionsRepository/sessionsRepository.server.ts: `ISessionsRepository` is now a `type` alias, per repo coding guideline ("use types over interfaces"). Structural-typing means implementing classes don't need source changes. (CodeRabbit.) - services/sessionStreamWaitpointCache.server.ts: Replace separate SADD + PEXPIRE with a single atomic Lua script. Solves two distinct concerns at once: 1. Partial-failure window (CodeRabbit): if SADD succeeded and PEXPIRE failed, the key would persist with no TTL. The Lua script fails both or succeeds both. 2. TTL-race (Devin, twice): each waitpoint registers with its own `ttlMs` derived from the caller's timeout. The old code called PEXPIRE unconditionally, so a short-TTL registration would shrink the shared key's TTL below a longer-TTL sibling — evicting the sibling from Redis and degrading the append-path fast drain to engine-timeout-only. The script only PEXPIREs if the new TTL is greater than the current PTTL (or the key has no TTL yet), so the key lives as long as the longest-TTL member. Outstanding: one unresolved thread asking to rename `CloseSessionRequestBody.reason` → `closedReason` for symmetry with the DB column. Holding that for an API-taste call — will follow up. Validated: `pnpm run typecheck --filter webapp` clean.

Devin catch on #3417 — the ClickHouse sessions list was slicing `sessionIds.slice(1, size + 1)` on the backward path, which skipped the item closest to the cursor and surfaced the sentinel (the `size+1`th item that proves hasMore=true) to the user. Trace, with items c01…c11 and cursor=c07 (page size 3): - Backward query: `session_id > c07 ORDER BY ASC LIMIT 4` → `[c08, c09, c10, c11]`. Legitimate content is the first three (`[c08, c09, c10]`); `c11` is the sentinel. - Previous slice: `[c09, c10, c11]` → displayed DESC `[c11, c10, c09]` — user never sees c08, sees sentinel c11 instead. Fix: collapse both directions to `sessionIds.slice(0, size)`. The sentinel is always the last item regardless of direction, so the two branches had no reason to diverge. Cursor computations (`previousCursor = reversed.at(1)`, `nextCursor = reversed.at(size)`) already line up with the corrected slice — no change needed there. Verified: webapp typecheck clean.

/realtime/v1/sessions/:session/:io=out now peeks the tail record in S2 at connection time. When the tail chunk is trigger:turn-complete, the agent has finished a turn and is either idle-waiting on .in or has exited — either way no more chunks will arrive without further user action. In that case the downstream S2 read switches to wait=0 so the SSE drains and closes in ~1s instead of long-polling for 60s, and the response carries X-Session-Settled: true so the client can tell the close is terminal rather than a normal 60s cycle. Mid-turn tails (streaming UIMessageChunks in flight) fall through to the existing wait=60 long-poll. Crashed-mid-turn is indistinguishable from live-streaming at this point and gets the same 60s retry loop as today — that's a separate hardening, not in scope here. The peek uses GET /records?tail_offset=1&count=1&wait=0 (single-digit ms on S2), then unwraps the agent-side envelope written by StreamsWriterV2: record.body parses to {data: <chunk>, id}, where <chunk> is the raw UIMessageChunk object. No double-parse on data. 404 / 416 from the peek (stream never written / empty stream) short- circuit to settled=false so first-connect on a freshly-created session keeps the long-poll semantics the agent's first chunks depend on. Verified end-to-end against an idle chat-agent-smoke session: caught- up reconnect (Last-Event-ID = tail) closes in 1.08s with the header; behind reconnect (Last-Event-ID < tail) drains remaining records then closes in 0.94s with the header; empty-stream reconnect keeps the 60s long-poll behavior unchanged.

Session is now the run manager for chat.agent and any future task-bound session. Atomically creates the row + triggers the first run + tracks the current run via optimistic claim, with a SessionRun audit log for provenance. Schema: - Session gains `taskIdentifier`, `triggerConfig` (JSON), `currentRunId` (non-FK), `currentRunVersion` (monotonic int for optimistic claim). - New SessionRun audit table — one row per run a session triggers, with `reason: "initial" | "continuation" | "upgrade" | "manual"`. Lifecycle: - `POST /api/v1/sessions`: idempotent on `(env, externalId)`, refreshes triggerConfig on cache hit, runs `ensureRunForSession` (probe + optimistic claim), returns a session-scoped PAT. JWT auth path dropped — secret-key only. The customer's server is the only entry point for session creation. - `POST /api/v1/sessions/:s/end-and-continue`: server-orchestrated handoff (cancels current run, triggers a fresh one, swaps currentRunId via `updateMany where currentRunVersion`). Powers `chat.requestUpgrade()` from inside the agent runtime. - `POST /realtime/v1/sessions/:s/:io/append`: probe + ensureRunForSession before append so messages arriving while no run is alive boot one transparently. Cross-form addressing on write paths: - `createActionApiRoute` now runs `findResource` before `authorization`, matching `createLoaderApiRoute`. Action routes get an optional `resource` argument on `authorization.resource()` — backwards-compatible (existing 4-arg callbacks unchanged). - Append + end-and-continue use the new ordering to authorize against `{paramSession, friendlyId, externalId}` so a JWT minted for either form authorizes either URL form. Helpers: - `mintSessionToken.server.ts`: server-side session-PAT factory (`read:sessions:{key} + write:sessions:{key}`, 1h TTL). - `sessionRunManager.server.ts`: `ensureRunForSession` (probe + claim) and `swapSessionRun` (force handoff with optimistic claim + cancel-on-loss). Pre-mutation existence reads switched to `$replica` (close, end-and- continue, PATCH).

Three fixes after pushing the Sessions-as-run-manager commit: - `api.v1.sessions.$session.end-and-continue.ts` was destructuring only `{ action }` from `createActionApiRoute`, which means Remix had no handler for OPTIONS preflight on this route. Browser CORS would 405. Sibling routes (`close.ts`) already export `{ action, loader }`. Fix: destructure and export both. - `ensureRunForSession`'s pathological "lost the claim race AND the winner's run was already terminal" branch recursed without bound. In practice progress through the run engine bounds it, but a misconfigured task that crashes before being dequeued could blow the stack. Add a hidden `_attempt` counter, throw `SessionRunManagerError` once it exceeds 3. - `sessionsReplicationService.test.ts` was failing in CI because the sessions-as-run-manager schema migration made `taskIdentifier` and `triggerConfig` required on `Session`. The two `prisma.session.create` calls in the test predate the migration. Add the now-required fields to both fixtures.

Two fixes from Devin review on the sessions-as-run-manager commit: - `SessionItem.currentRunId`'s contract is the `run_*` friendlyId, but `serializeSession` returns the raw Prisma cuid. The `POST /sessions` create path overrides correctly via a TaskRun lookup, but GET, PATCH, and the three return paths in close.ts were passing the cuid through. A consumer using `currentRunId` from those endpoints in a downstream `GET /api/v1/runs/:runId` call would 404. Add a `serializeSessionWithFriendlyRunId` helper next to `serializeSession` that resolves via `$replica.taskRun.findFirst` (TaskRun friendlyIds are immutable, so replica lag is harmless), and switch the five affected return sites to use it. List endpoints stay on `serializeSession` to avoid N+1 lookups when paginating. The create endpoint keeps its existing manual lookup because it also needs the friendlyId for the response's `runId` field, and `session.currentRunId` is stale relative to the post-`ensureRunForSession` claim outcome. - Drop dead `lastChunkType` recomputation in `streamResponseFromSessionStream`. The variable was bound but never used; the conditional below it re-evaluated the same expression. Use the bound value in the condition.

Collapse `session-out-settled-signal.md` and `sessions-public-api-cors.md` into the single `session-primitive.md`, and rewrite that one to a high- level two-sentence summary that covers everything actually shipping in this PR (sessions-as-run-manager, end-and-continue, waitpoints, etc.). The CORS/JWT-on-create story is also out of date now that POST /api/v1/sessions is secret-key only.

…friendlyId Switch the two read-after-write taskRun lookups (POST /api/v1/sessions and POST /api/v1/sessions/:s/end-and-continue) from $replica back to prisma. Both reads happen immediately after triggering a run on the writer; replica lag would null the result and turn a successful create into a 500, or fall back to leaking the internal cuid in the end-and-continue response.

…n sessionRunManager The lost-race re-read in ensureRunForSession and swapSessionRun reads the Session row that the winner just wrote on the writer. Reading from $replica could return pre-race state and either (1) cause ensureRunForSession to recurse with a stale currentRunVersion, fail the next claim, and waste runs until max-attempts; or (2) cause swapSessionRun to return swapped: false with the calling run's own id, misleading the caller into thinking it is still authoritative.

The S2 record envelope wraps the agent-written chunk as {data: <chunkAsString>, id: partId} because StreamsWriterV2 hands appendPart an already-stringified chunk. The peek-settled check treated envelope.data as an object, so typeof === 'object' always returned false and the trigger:turn-complete sentinel was never matched. Reconnect-on-reload silently degraded to the full long-poll path. Parse envelope.data once more so the type discriminator is surfaced.

… run lookup Same read-after-write pattern as the other lost-race re-reads: the run was just triggered on the writer milliseconds before, so a $replica.findFirst can return null due to replication lag. The null silently no-ops the cancellation and leaks an orphan run that no session will ever claim.

When the upsert path returns a previously-closed row, return 409 before ensureRunForSession fires. Otherwise we'd trigger a fresh run on a closed session that can't receive .in input (append handler rejects writes to closed sessions), wasting compute on a run that exits the moment it tries to read. close is one-way; callers must use a different externalId to start a new session.

The race-check in api.v1.runs.$runFriendlyId.session-streams.wait was selecting the realtime stream instance via run.realtimeStreamsVersion, but session streams are always v2 (S2) — the writer (appendPartToSessionStream) and the SSE subscribe both hardcode v2. For a v1 run the race-check silently fell back to a non-S2 instance, the instanceof check missed, and the optimization was skipped. Hardcode v2 for parity with the rest of the session surface.

…ized routes createActionApiRoute now runs findResource before authorization so the auth scope check can expand to alternate identifiers of the resolved resource (Sessions are addressable by both friendlyId and externalId). Side-effect: an authenticated-but-underscoped caller could probe resource existence by observing 404 vs 403. Mask the 404 as 403 with the same response shape as the auth-failed branch when the route declares authorization, so the two cases are indistinguishable to callers without scopes. Routes without authorization keep returning 404.

Previous fix unconditionally returned 403 when findResource was null on a route with authorization, breaking PRIVATE-key callers (e.g. server SDK) hitting the existing api.v2.runs.cancel route — they always pass authorization but the new code returned 403 with a factually wrong message ('Unauthorized: missing required scopes') even though they had full permissions. New ordering: run authorization first (with the resolved resource as the 5th arg, so cross-form session auth still works), then check resource-null → 404. This gives: - PRIVATE key + missing resource: auth passes → 404 (correct) - Underscoped JWT + missing resource: auth fails (resource not in scope) → 403 (no info leak vs existing resource) - Underscoped JWT + existing resource: auth fails → 403 (unchanged) Only auth callbacks that destructure the resource (loader for realtime.v1.sessions.$session.$io) need to handle null — they all already do, since findResource was already nullable in pre-PR loaders.

## Summary 8 new features, 18 improvements, 11 bug fixes. ## Breaking changes - Add server-side deprecation gate for deploys from v3 CLI versions (gated by `DEPRECATE_V3_CLI_DEPLOYS_ENABLED`). v4 CLI deploys are unaffected. ([#3415](#3415)) ## Improvements - Add `--no-browser` flag to `init` and `login` to skip auto-opening the browser during authentication. Also error loudly when `init` is run without `--yes` under non-TTY stdin (previously default-and-exited silently, leaving the project half-initialized). Both commands now show an `Examples` section in `--help`. ([#3483](#3483)) - Add `isReplay` boolean to the run context (`ctx.run.isReplay`), derived from the existing `replayedFromTaskRunFriendlyId` database field. Defaults to `false` for backwards compatibility. ([#3454](#3454)) - Redact the `resolveWaitpoint` runtime log so it only emits `id` and `type` instead of the full completed waitpoint. Previously the log printed the entire waitpoint (including `output`) to stdout in production runs, which could leak sensitive payloads. The value returned by `wait.forToken()` is unchanged. ([#3490](#3490)) - Add `SessionId` friendly ID generator and schemas for the new durable Session primitive. Exported from `@trigger.dev/core/v3/isomorphic` alongside `RunId`, `BatchId`, etc. Ships the `CreateSessionStreamWaitpoint` request/response schemas alongside the main Session CRUD. ([#3417](#3417)) - Truncate large error stacks and messages to prevent OOM crashes. Stack traces are capped at 50 frames (keeping top 5 + bottom 45 with an omission notice), individual stack lines at 1024 chars, and error messages at 1000 chars. Applied in parseError, sanitizeError, and OTel span recording. ([#3405](#3405)) ## Server changes These changes affect the self-hosted Docker image and Trigger.dev Cloud: - Add a "Back office" tab to `/admin` and a per-organization detail page at `/admin/back-office/orgs/:orgId`. The first action available on that page is editing the org's API rate limit: admins can save a `tokenBucket` override (refill rate, interval, max tokens) and see a plain-English preview of the resulting sustained rate and burst allowance. Writes are audit-logged via the server logger. ([#3434](#3434)) - Optional `DEPLOY_REGISTRY_ECR_DEFAULT_REPOSITORY_POLICY` env var to apply a default repository policy when the webapp creates new ECR repos ([#3467](#3467)) - Ship the Errors page to all users, with a polish + bug-fix pass: pinned "No channel" item in the Slack alert channel picker, viewer-timezone alert timestamps via Slack's `<!date^>` token, Activity sparkline peak tooltip, centered loading spinner and bug-icon empty state on the error detail page, ellipsis on the Configure alerts trigger. ([#3477](#3477)) - Configure the set of machine presets to build boot snapshots for at deploy time via `COMPUTE_TEMPLATE_MACHINE_PRESETS` (CSV of preset names, default `small-1x`). Use `COMPUTE_TEMPLATE_MACHINE_PRESETS_REQUIRED` (CSV, default = full PRESETS list) to scope which preset failures fail a required-mode deploy. Optional preset failures are logged and don't block the deploy. ([#3492](#3492)) - Regenerating a RuntimeEnvironment API key no longer invalidates the previous key immediately. The old key is recorded in a new `RevokedApiKey` table with a 24 hour grace window, and `findEnvironmentByApiKey` falls back to it when the submitted key doesn't match any live environment. The grace window can be ended early (or extended) by updating `expiresAt` on the row. ([#3420](#3420)) - Add the `Session` primitive — a durable, task-bound, bidirectional I/O channel that outlives a single run and acts as the run manager for `chat.agent`. Ships the Postgres `Session` + `SessionRun` tables, ClickHouse `sessions_v1` + replication service, the `sessions` JWT scope, and the public CRUD + realtime routes (`/api/v1/sessions`, `/realtime/v1/sessions/:session/:io`) including `end-and-continue` for server-orchestrated run handoffs and session-stream waitpoints. ([#3417](#3417)) - Add `KUBERNETES_POD_DNS_NDOTS_OVERRIDE_ENABLED` flag (off by default) that overrides the cluster default and sets `dnsConfig.options.ndots` on runner pods (defaulting to 2, configurable via `KUBERNETES_POD_DNS_NDOTS`). Kubernetes defaults pods to `ndots: 5`, so any name with fewer than 5 dots — including typical external domains like `api.example.com` — is first walked through every entry in the cluster search list (`<ns>.svc.cluster.local`, `svc.cluster.local`, `cluster.local`) before being tried as-is, turning one resolution into 4+ CoreDNS queries (×2 with A+AAAA). Using a lower `ndots` value reduces DNS query amplification in the `cluster.local` zone. Note: before enabling, make sure no code path relies on search-list expansion for names with dots ≥ the configured value — those names will hit their as-is form first and could resolve externally before falling back to the cluster search path. ([#3441](#3441)) - Vercel integration option to disable auto promotions ([#3376](#3376)) - Make it clear in the admin that feature flags are global and should rarely be changed. ([#3408](#3408)) - Admin worker groups API: add GET loader and expose more fields on POST. ([#3390](#3390)) - Add 60s fresh / 60s stale SWR cache to `getEntitlement` in `platform.v3.server.ts`. Eliminates a synchronous billing-service HTTP round trip on every trigger. Reuses the existing `platformCache` (LRU memory + Redis) pattern already used for `limits` and `usage`. Cache key is `${orgId}`. Errors return a permissive `{ hasAccess: true }` fallback (existing behavior) and are also cached to prevent thundering-herd on billing outages. ([#3388](#3388)) - Show a `MicroVM` badge next to the region name on the regions page. ([#3407](#3407)) - Increase default maximum project count per organization from 10 to 25 ([#3409](#3409)) - Merge execution snapshot creation into the dequeue taskRun.update transaction, reducing 2 DB commits to 1 per dequeue operation ([#3395](#3395)) - Add per-worker Node.js heap metrics to the OTel meter — `nodejs.memory.heap.used`, `nodejs.memory.heap.total`, `nodejs.memory.heap.limit`, `nodejs.memory.external`, `nodejs.memory.array_buffers`, `nodejs.memory.rss`. Host-metrics only publishes RSS, which overstates V8 heap by the external + native footprint; these give direct heap visibility per cluster worker so `NODE_MAX_OLD_SPACE_SIZE` can be sized against observed heap peaks rather than RSS. ([#3437](#3437)) - Tag Prisma spans with `db.datasource: "writer" | "replica"` so monitors and trace queries can distinguish the writer pool from the replica pool. Applies to all `prisma:engine:*` spans (including `prisma:engine:connection` used by the connection-pool monitors) and the outer `prisma:client:operation` span. ([#3422](#3422)) - Clarify the cross-region intent in the Terraform and AI-prompt helpers on the Add Private Connection page. Both already default `supported_regions` to `["us-east-1", "eu-central-1"]`; added an inline comment / parenthetical so the user understands why both regions are listed (Trigger.dev runs in both, so the service must be consumable from either). ([#3465](#3465)) - Add `RUN_ENGINE_READ_REPLICA_SNAPSHOTS_SINCE_ENABLED` flag (default off) to route the Prisma reads inside `RunEngine.getSnapshotsSince` through the read-only replica client. Offloads the snapshot polling queries (fired by every running task runner) from the primary. When disabled, behavior is unchanged. ([#3423](#3423)) - Stop creating TaskRunTag records and _TaskRunToTaskRunTag join table entries during task triggering. The denormalized runTags string array on TaskRun already stores tag names, making the M2M relation redundant write overhead. ([#3369](#3369)) - Stop writing per-tick state (`lastScheduledTimestamp`, `nextScheduledTimestamp`, `lastRunTriggeredAt`) on `TaskSchedule` and `TaskScheduleInstance`. The schedule engine now carries the previous fire time forward via the worker queue payload, eliminating ~270K dead-tuple-driven autovacuums per year on these hot tables and the associated `IO:XactSync` mini-spikes on the writer. Customer-facing `payload.lastTimestamp` semantics are unchanged. ([#3476](#3476)) - Replace the expensive DISTINCT query for task filter dropdowns with a dedicated TaskIdentifier registry table backed by Redis. Environments migrate automatically on their next deploy, with a transparent fallback to the legacy query for unmigrated environments. Also fixes duplicate dropdown entries when a task changes trigger source, and adds active/archived grouping for removed tasks. Moves BackgroundWorkerTask reads in the trigger hot path to the read replica. ([#3368](#3368)) - Public Access Tokens (PATs) minted before an API key rotation now keep working during the 24h grace window. `validatePublicJwtKey` falls back to any non-expired `RevokedApiKey` rows for the signing environment when the primary signature check against the env's current `apiKey` fails. The fallback query only runs on the failure path, so the hot success path is unchanged. ([#3464](#3464)) - Batch items that hit the environment queue size limit now fast-fail without retries and without creating pre-failed TaskRuns. ([#3352](#3352)) - Show the cancel button in the runs list for runs in `DEQUEUED` status. `DEQUEUED` was missing from `NON_FINAL_RUN_STATUSES` so the list hid the button even though the single run page allowed it. ([#3421](#3421)) - Reduce 5xx feedback loops on hot debounce keys by quantizing `delayUntil`, adding an unlocked fast-path skip, and gracefully handling redlock contention in `handleDebounce` so the SDK no longer retries into a herd. ([#3453](#3453)) - Fix RSS memory leak in the realtime proxy routes. `/realtime/v1/runs`, `/realtime/v1/runs/:id`, and `/realtime/v1/batches/:id` called `fetch()` into Electric with no abort signal, so when a client disconnected mid long-poll, undici kept the upstream socket open and buffered response chunks that would never be consumed — retained only in RSS, invisible to V8 heap tooling. Thread `getRequestAbortSignal()` through `RealtimeClient.streamRun/streamRuns/streamBatch` to `longPollingFetch` and cancel the upstream body in the error path. Isolated reproducer showed ~44 KB retained per leaked request; signal propagation releases it cleanly. ([#3442](#3442)) - Fix memory leak where every aborted SSE connection pinned the full request/response graph on Node 20, caused by `AbortSignal.any()` in `sse.ts` retaining its source signals indefinitely (see nodejs/node#54614, nodejs/node#55351). Also clear the `setTimeout(abort)` timer in `entry.server.tsx` so successful HTML renders don't pin the React tree for 30s per request. ([#3430](#3430)) - Preserve filters on the queues page when submitting modal actions. ([#3471](#3471)) - Fix Redis connection leak in realtime streams and broken abort signal propagation. **Redis connections**: Non-blocking methods (ingestData, appendPart, getLastChunkIndex) now share a single Redis connection instead of creating one per request. streamResponse still uses dedicated connections (required for XREAD BLOCK) but now tears them down immediately via disconnect() instead of graceful quit(), with a 15s inactivity fallback. **Abort signal**: request.signal is broken in Remix/Express due to a Node.js undici GC bug (nodejs/node#55428) that severs the signal chain when Remix clones the Request internally. Added getRequestAbortSignal() wired to Express res.on("close") via httpAsyncStorage, which fires reliably on client disconnect. All SSE/streaming routes updated to use it. ([#3399](#3399)) - Prevent dashboard crash (React error #31) when span accessory item text is not a string. Filters out malformed accessory items in SpanCodePathAccessory instead of passing objects to React as children. ([#3400](#3400)) - Upgrade Remix packages from 2.1.0 to 2.17.4 to address security vulnerabilities in React Router ([#3372](#3372)) - Fix Vercel integration settings page (remove redundant section toggles) and improve the Vercel onboarding flow so the modal closes after connecting a GitHub repo and the marketplace `next` URL is preserved across the GitHub app install redirect. ([#3424](#3424)) <details> <summary>Raw changeset output</summary> # Releases ## @trigger.dev/build@4.4.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.5` ## trigger.dev@4.4.5 ### Patch Changes - Add `--no-browser` flag to `init` and `login` to skip auto-opening the browser during authentication. Also error loudly when `init` is run without `--yes` under non-TTY stdin (previously default-and-exited silently, leaving the project half-initialized). Both commands now show an `Examples` section in `--help`. ([#3483](#3483)) - Updated dependencies: - `@trigger.dev/core@4.4.5` - `@trigger.dev/build@4.4.5` - `@trigger.dev/schema-to-json@4.4.5` ## @trigger.dev/core@4.4.5 ### Patch Changes - Add `isReplay` boolean to the run context (`ctx.run.isReplay`), derived from the existing `replayedFromTaskRunFriendlyId` database field. Defaults to `false` for backwards compatibility. ([#3454](#3454)) - Redact the `resolveWaitpoint` runtime log so it only emits `id` and `type` instead of the full completed waitpoint. Previously the log printed the entire waitpoint (including `output`) to stdout in production runs, which could leak sensitive payloads. The value returned by `wait.forToken()` is unchanged. ([#3490](#3490)) - Add `SessionId` friendly ID generator and schemas for the new durable Session primitive. Exported from `@trigger.dev/core/v3/isomorphic` alongside `RunId`, `BatchId`, etc. Ships the `CreateSessionStreamWaitpoint` request/response schemas alongside the main Session CRUD. ([#3417](#3417)) - Truncate large error stacks and messages to prevent OOM crashes. Stack traces are capped at 50 frames (keeping top 5 + bottom 45 with an omission notice), individual stack lines at 1024 chars, and error messages at 1000 chars. Applied in parseError, sanitizeError, and OTel span recording. ([#3405](#3405)) ## @trigger.dev/python@4.4.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.5` - `@trigger.dev/build@4.4.5` - `@trigger.dev/sdk@4.4.5` ## @trigger.dev/react-hooks@4.4.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.5` ## @trigger.dev/redis-worker@4.4.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.5` ## @trigger.dev/rsc@4.4.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.5` ## @trigger.dev/schema-to-json@4.4.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.5` ## @trigger.dev/sdk@4.4.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.5` </details> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>