Skip to content

feat: Sessions - bidirectional durable agent streams#3417

Merged
ericallam merged 23 commits intomainfrom
feature/tri-8627-session-primitive-server-side-schema-routes-clickhouse
Apr 28, 2026
Merged

feat: Sessions - bidirectional durable agent streams#3417
ericallam merged 23 commits intomainfrom
feature/tri-8627-session-primitive-server-side-schema-routes-clickhouse

Conversation

@ericallam
Copy link
Copy Markdown
Member

@ericallam ericallam commented Apr 20, 2026

⚠️ Not released yet. This PR is the server-side foundation only. The SDK changes that customers will actually use (chat.agent migration, chat.createStartSessionAction, useTriggerChatTransport updates) live on a separate branch and ship together in an upcoming @trigger.dev/sdk prerelease. Until that prerelease is published, this surface is reachable only via direct HTTP.

What this gives Trigger.dev users

A new first-class primitive, Session, for durable, task-bound, bidirectional I/O that outlives any single run. Sessions are the run manager for chat.agent going forward, and they unblock anything else that needs "one identifier, many runs over time" with a stable channel pair the client can write to and subscribe to.

Use cases unblocked

  • Chat agents that persist across many runs. One session per chat (keyed on your own chatId via externalId), turns 1..N attach to the same Session, the UI subscribes once and keeps receiving output as new runs take over.
  • Approval loops and long-running tasks with user feedback. The task waits on .in, the client writes to .in, the server enforces no-writes-after-close.
  • Workflow progress streams that live past the run. Subscribe to .out after the task finishes to replay history.
  • Resume-next-day flows. A session is a durable row, not a transient stream. Send a message a day later and the server triggers a fresh run on the same session.

How it works (Session-as-run-manager)

A Session row is task-bound (taskIdentifier + triggerConfig are required) and owns its current run via currentRunId + currentRunVersion for optimistic claim. Three trigger paths:

  1. Session createPOST /api/v1/sessions creates the row and triggers the first run synchronously.
  2. Append-time probePOST /realtime/v1/sessions/:session/in/append checks if the current run is alive; if it has terminated (idle exit, crash, etc.), the server triggers a new run before processing the append.
  3. End-and-continue handoffPOST /api/v1/sessions/:session/end-and-continue, called by the running agent, triggers a fresh run and atomically swaps currentRunId. Used by chat.requestUpgrade() for version handoffs.

Every triggered run is recorded in the SessionRun audit table with a reason (initial, continuation, upgrade, manual).

Public API surface

Control plane

  • POST /api/v1/sessions — create. Idempotent on (env, externalId). Triggers the first run, returns the session and a session-scoped public access token. Returns 409 if the upserted row is already closed.
  • GET /api/v1/sessions/:session — retrieve by friendlyId (session_abc...) or by your own externalId (server disambiguates by prefix).
  • GET /api/v1/sessions — list with filters (type, tag, taskIdentifier, externalId, derived status ACTIVE/CLOSED/EXPIRED, created-at range) and cursor pagination. Backed by ClickHouse.
  • PATCH /api/v1/sessions/:session — update tags / metadata / externalId.
  • POST /api/v1/sessions/:session/close — terminate. Idempotent, hard-blocks new server-brokered writes.
  • POST /api/v1/sessions/:session/end-and-continue — agent-only handoff to a fresh run.

Realtime

  • PUT /realtime/v1/sessions/:session/:io — initialize a channel. Returns S2 credentials in headers so high-throughput clients can write direct to S2.
  • GET /realtime/v1/sessions/:session/:io — SSE subscribe. Supports Last-Event-ID resume and an opt-in X-Peek-Settled: 1 header that fast-closes the stream when the upstream is already settled (trigger:turn-complete), eliminating long-poll wait on reconnect-on-reload paths.
  • POST /realtime/v1/sessions/:session/:io/append — server-side appends.
  • POST /api/v1/runs/:runFriendlyId/session-streams/wait — runs wait on a session stream as a waitpoint, with a race-check to avoid suspending if data already landed.

Auth scopes

sessions is a new resource type. read:sessions:{id}, write:sessions:{id}, admin:sessions:{id} flow through the existing JWT validator. Session-scoped public access tokens minted by the server replace browser-held trigger-task tokens for chat-style flows — the browser never sees a run identifier or a run-scoped token in steady state.

What's coming after this PR

  • SDK + chat.agent migration: separate branch, separate PR, ships in the next @trigger.dev/sdk prerelease alongside this server deploy. Customers using the prerelease chat.agent will follow the upgrade guide.
  • Dashboard surfaces: dedicated agent list, agent playground, agent view on the run dashboard. Tracking separately.

Implementation notes

  • Postgres Session table: scalar scoping columns (projectId, runtimeEnvironmentId, environmentType, organizationId) without FKs, matching the January TaskRun FK-removal decision. Point-lookup indexes only — list queries go to ClickHouse. Terminal markers (closedAt, expiresAt) are write-once.
  • ClickHouse sessions_v1: ReplacingMergeTree, partitioned by month, ordered by (org_id, project_id, environment_id, created_at, session_id). Tags indexed via tokenbf_v1 skip index.
  • SessionsReplicationService: mirrors RunsReplicationService exactly — leader-locked logical replication consumer, ConcurrentFlushScheduler, retry with exponential backoff + jitter, identical metric shape. Dedicated slot + publication so the two consume independently.
  • S2 keys: sessions/{addressingKey}/{out|in}. The existing runs/{runId}/{streamId} key format for run-scoped streams is untouched.
  • Optimistic claim: ensureRunForSession triggers a run upfront (cheap to cancel if it loses the race), then attempts an updateMany keyed on currentRunVersion. Loser cancels its triggered run and reuses the winner's. No DB lock held across the trigger.

What did NOT change

Run-scoped streams.pipe / streams.input and the existing /realtime/v1/streams/{runId}/... routes are unchanged. Sessions are net-new — not a reshaping of the current streams API.

Deploy notes

  • Set SESSION_REPLICATION_CLICKHOUSE_URL and SESSION_REPLICATION_ENABLED=1 to enable the replication consumer.
  • The Session table needs REPLICA IDENTITY FULL set on the prod source DB before the publication is created (same one-time DDL we did for TaskRun). Required for delete events to carry full column values.
  • Cross-form authorization on the GET /api/v1/sessions/:session loader (a JWT minted for either form authorizes both URL forms). Action routes are URL-form-specific, matching how the SDK mints PATs.

Verification

  • Webapp typecheck clean (10/10).
  • apps/webapp/test/sessionsReplicationService.test.ts — round-trip tests for insert/update/delete through Postgres logical replication into ClickHouse via testcontainers.
  • Live end-to-end against local dev: create + retrieve (both forms) + update + close, .out.initialize + .out.append x2 + .in.send + .out.subscribe over SSE, list with all filter combinations + pagination, end-and-continue swap, X-Peek-Settled fast-close (verified in browser via reconnect-on-reload and via curl). Replicated row lands in ClickHouse within ~1s.
  • Multi-round Devin + CodeRabbit review feedback addressed (read-after-write paths use prisma writer, info-leak on auth-routes masked as 403, peek-settled discriminator parsing fix, etc.).

Test plan

  • pnpm run typecheck --filter webapp
  • pnpm run test --filter webapp ./test/sessionsReplicationService.test.ts --run
  • Start the webapp with SESSION_REPLICATION_CLICKHOUSE_URL and SESSION_REPLICATION_ENABLED=1. Confirm the slot and publication auto-create on boot.
  • POST /api/v1/sessions and verify the row replicates to trigger_dev.sessions_v1 within a couple of seconds.
  • POST /api/v1/sessions/:id/close, then confirm POST /realtime/v1/sessions/:id/out/append returns 400.
  • Reuse a closed session's externalId on POST /api/v1/sessions and confirm 409.
  • GET /realtime/v1/sessions/:id/out with X-Peek-Settled: 1 after a turn completes and confirm X-Session-Settled: true response header + immediate close.

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Apr 20, 2026

🦋 Changeset detected

Latest commit: 188fa43

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 29 packages
Name Type
@trigger.dev/core Patch
@trigger.dev/build Patch
trigger.dev Patch
@trigger.dev/python Patch
@trigger.dev/redis-worker Patch
@trigger.dev/schema-to-json Patch
@trigger.dev/sdk Patch
@internal/cache Patch
@internal/clickhouse Patch
@internal/llm-model-catalog Patch
@internal/redis Patch
@internal/replication Patch
@internal/run-engine Patch
@internal/schedule-engine Patch
@internal/testcontainers Patch
@internal/tracing Patch
@internal/tsql Patch
@internal/zod-worker Patch
d3-chat Patch
references-d3-openai-agents Patch
references-nextjs-realtime Patch
references-realtime-hooks-test Patch
references-realtime-streams Patch
references-telemetry Patch
@internal/sdk-compat-tests Patch
@trigger.dev/react-hooks Patch
@trigger.dev/rsc Patch
@trigger.dev/database Patch
@trigger.dev/otlp-importer Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Introduces a durable Session primitive end-to-end: a new Prisma Session model and migration, a ClickHouse sessions_v1 table and query/insert helpers, ClickHouse-backed SessionsRepository, a SessionsReplicationService that streams Postgres logical replication into ClickHouse (with retry/ack/flush/leader-lock logic), session-friendly ID export (SessionId) and API Zod schemas, multiple REST and realtime routes for session CRUD, streaming and append, session-stream waitpoint support with Redis-backed pending sets, environment config and startup wiring, helper utilities, and end-to-end replication tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: Sessions - bidirectional durable agent streams' clearly summarizes the main change, specifying the new Sessions feature with its core capability of bidirectional streaming.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request provides a comprehensive description covering objectives, use cases, public API surface, implementation notes, and verification steps.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/tri-8627-session-primitive-server-side-schema-routes-clickhouse

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

@ericallam ericallam force-pushed the feature/tri-8627-session-primitive-server-side-schema-routes-clickhouse branch from 2210fe2 to 4cadc19 Compare April 23, 2026 09:10
devin-ai-integration[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

Durable, typed, bidirectional I/O primitive that outlives a single run.
Ship target is agent/chat use cases; run-scoped streams.pipe/streams.input
are untouched and do not create Session rows.

Postgres
- New Session table: id, friendlyId, externalId, type (plain string),
  denormalised project/environment/organization scalar columns (no FKs),
  taskIdentifier, tags String[], metadata Json, closedAt, closedReason,
  expiresAt, timestamps
- Point-lookup indexes only (friendlyId unique, (env, externalId) unique,
  expiresAt). List queries are served from ClickHouse so Postgres stays
  minimal and insert-heavy.

Control-plane API
- POST   /api/v1/sessions           create (idempotent via externalId)
- GET    /api/v1/sessions           list with filters (type, tag,
                                     taskIdentifier, externalId, status
                                     ACTIVE|CLOSED|EXPIRED, period/from/to)
                                     and cursor pagination, ClickHouse-backed
- GET    /api/v1/sessions/:session  retrieve — polymorphic: `session_` prefix
                                     hits friendlyId, otherwise externalId
- PATCH  /api/v1/sessions/:session  update tags/metadata/externalId
- POST   /api/v1/sessions/:session/close  terminal close (idempotent)

Realtime (S2-backed)
- PUT    /realtime/v1/sessions/:session/:io           returns S2 creds
- GET    /realtime/v1/sessions/:session/:io           SSE subscribe
- POST   /realtime/v1/sessions/:session/:io/append    server-side append
- S2 key format: sessions/{friendlyId}/{out|in}

Auth
- sessions added to ResourceTypes. read:sessions:{id},
  write:sessions:{id}, admin:sessions:{id} scopes work via existing JWT
  validation.

ClickHouse
- sessions_v1 ReplacingMergeTree table
- SessionsReplicationService mirrors RunsReplicationService exactly:
  logical replication with leader-locked consumer, ConcurrentFlushScheduler,
  retry with exponential backoff + jitter, identical metric shape.
  Dedicated slot + publication (sessions_to_clickhouse_v1[_publication]).
- SessionsRepository + ClickHouseSessionsRepository expose list, count,
  tags with cursor pagination keyed by (created_at DESC, session_id DESC).
- Derived status (ACTIVE/CLOSED/EXPIRED) computed from closed_at + expires_at;
  in-memory fallback on list results to catch pre-replication writes.

Verification
- Webapp typecheck 10/10
- Core + SDK build 3/3
- sessionsReplicationService.test.ts integration tests 2/2 (insert + update
  round-trip via testcontainers)
- Live round-trip against local dev: create -> retrieve (friendlyId and
  externalId) -> out.initialize -> out.append x2 -> in.send -> out.subscribe
  (receives records) -> close -> ClickHouse sessions_v1 shows the replicated
  row with closed_reason
- Live list smoke: tag, type, status CLOSED, externalId, and cursor pagination
…te/update

The session_ prefix identifies internal friendlyIds. Allowing it in a
user-supplied externalId would misroute subsequent GET/PATCH/close
requests through resolveSessionByIdOrExternalId to a friendlyId lookup,
returning null or the wrong session. Reject at the schema boundary so
both routes surface a clean 422.
Without allowJWT/corsStrategy, frontend clients holding public access
tokens hit 401 on GET /api/v1/sessions and browser preflights fail.
Matches the single-session GET/PATCH/close routes and the runs list
endpoint.
- Derive isCached from the upsert result (id mismatch = pre-existing row)
  instead of doing a separate findFirst first. The pre-check was racy —
  two concurrent first-time POSTs could both return 201 with
  isCached: false. Using the returned row's id is atomic and saves a
  round-trip.

- Scope the list endpoint's authorization to the standard action/resource
  pattern (matches api.v1.runs.ts): task-scoped JWTs can list sessions
  filtered by their task, and broader super-scopes (read:sessions,
  read:all, admin) authorize unfiltered listing.

- Log and swallow unexpected errors on POST rather than returning the
  raw error.message. Prisma/internal messages can leak column names and
  query fragments.
Give Session channels run-engine waitpoint semantics so a task can
suspend while idle on a session channel and resume when an external
client sends a record — parallel to what streams.input offers
run-scoped streams.

Webapp
- POST /api/v1/runs/:runFriendlyId/session-streams/wait — creates a
  manual waitpoint attached to {sessionId, io} and race-checks the S2
  stream starting at lastSeqNum so pre-arrived data fires it
  immediately. Mirrors the existing input-stream waitpoint route.
- sessionStreamWaitpointCache.server.ts — Redis set keyed on
  {sessionFriendlyId, io}, drained atomically on each append so
  concurrent multi-tab waiters all wake together.
- realtime.v1.sessions.$session.$io.append now drains pending
  waitpoints after every record lands and completes each with the
  appended body.
- S2RealtimeStreams.readSessionStreamRecords — session-channel
  parallel of readRecords, feeds the race-check path.

Core
- CreateSessionStreamWaitpoint request/response schemas alongside
  the existing Session CRUD schemas. Server API contract only —
  the client ApiClient + SDK wrapper ship on the AI-chat branch.
Two fixes needed by browser clients hitting the public session API
(TriggerChatTransport's direct accessToken path, WebSocket-less
session drivers, anything origin'd off the dashboard):

- POST /api/v1/sessions: allowJWT: true + corsStrategy: "all" on
  the action. Pre-fix, the create endpoint only accepted secret-key
  auth, so any browser-originated sessions.create(...) 401'd. The
  loader (list) already had these; matches that shape.
- POST /realtime/v1/sessions/:session/:io/append: export both
  { action, loader } so Remix routes the OPTIONS preflight to the
  route builder's CORS handler. With only { action } exported, the
  preflight returns 400 'No loader for route' and Chrome surfaces
  the follow-up POST as net::ERR_FAILED. Same pattern as
  /api/v1/tasks/:id/trigger (which already exports both).

Validated by an end-to-end UI smoke on references/ai-chat:
new chat → send → streamed assistant reply in ~4s → second turn
reuses the same session + run, lastEventId advances 10 → 21.
@ericallam ericallam force-pushed the feature/tri-8627-session-primitive-server-side-schema-routes-clickhouse branch from f4406d7 to 4f2c0e7 Compare April 23, 2026 17:07
devin-ai-integration[bot]

This comment was marked as resolved.

Nine fixes from CodeRabbit + Devin review:

- api.v1.sessions.$session.close.ts:
  - Export { action, loader } so CORS preflight reaches the route
    builder's OPTIONS handler. Same fix already applied to the
    append route — Devin caught that I'd missed this one. Without
    the loader, browser clients hitting POST /close fail preflight.
  - Switch to `prisma.session.updateMany({ where: { id, closedAt:
    null }, ... })` so concurrent closes can't overwrite the
    original `closedAt` / `closedReason`. Loser hits count === 0 and
    re-reads the winning row — closedness is write-once at the DB
    level. (CodeRabbit: TOCTOU.)

- entry.server.tsx:
  Wrap the async `sessionsReplicationInstance.shutdown` in a sync
  handler with `.catch(...)`. SIGTERM/SIGINT fire during process
  teardown and a rejection from `_replicationClient.stop()` would
  become an unhandled promise rejection. Matches the pattern in
  `dynamicFlushScheduler.server.ts`. (CodeRabbit: unhandled rejection
  risk.)

- api.v1.runs.$runFriendlyId.session-streams.wait.ts:
  - Swallowed race-check catch now logs `warn` with
    sessionFriendlyId / io / waitpointId / error. Silent failures in
    the S2-read / engine-complete / cache-remove path were
    indistinguishable from the expected cache-drain-on-append fast
    path.
  - Outer 500 path no longer forwards `error.message` (Prisma /
    engine / S2 internals could leak). Logs server-side and returns
    a generic "Something went wrong"; 422 ServiceValidationError
    path unchanged. (CodeRabbit: info-leak + logging gap.)

- realtime.v1.sessions.$session.$io.ts:
  Add `method: "PUT"` to the route config so the route builder
  enforces method validation before the handler runs. Removed the
  now-redundant `request.method !== "PUT"` check inside the handler.
  (CodeRabbit: defense-in-depth.)

- services/sessionsRepository/sessionsRepository.server.ts:
  `ISessionsRepository` is now a `type` alias, per repo coding
  guideline ("use types over interfaces"). Structural-typing means
  implementing classes don't need source changes. (CodeRabbit.)

- services/sessionStreamWaitpointCache.server.ts:
  Replace separate SADD + PEXPIRE with a single atomic Lua script.
  Solves two distinct concerns at once:

  1. Partial-failure window (CodeRabbit): if SADD succeeded and
     PEXPIRE failed, the key would persist with no TTL. The Lua
     script fails both or succeeds both.
  2. TTL-race (Devin, twice): each waitpoint registers with its own
     `ttlMs` derived from the caller's timeout. The old code called
     PEXPIRE unconditionally, so a short-TTL registration would
     shrink the shared key's TTL below a longer-TTL sibling —
     evicting the sibling from Redis and degrading the append-path
     fast drain to engine-timeout-only. The script only PEXPIREs if
     the new TTL is greater than the current PTTL (or the key has
     no TTL yet), so the key lives as long as the longest-TTL
     member.

Outstanding: one unresolved thread asking to rename
`CloseSessionRequestBody.reason` → `closedReason` for symmetry with
the DB column. Holding that for an API-taste call — will follow up.

Validated: `pnpm run typecheck --filter webapp` clean.
devin-ai-integration[bot]

This comment was marked as resolved.

Devin catch on #3417 — the ClickHouse sessions list was slicing
`sessionIds.slice(1, size + 1)` on the backward path, which skipped
the item closest to the cursor and surfaced the sentinel (the
`size+1`th item that proves hasMore=true) to the user.

Trace, with items c01…c11 and cursor=c07 (page size 3):
- Backward query: `session_id > c07 ORDER BY ASC LIMIT 4` →
  `[c08, c09, c10, c11]`. Legitimate content is the first three
  (`[c08, c09, c10]`); `c11` is the sentinel.
- Previous slice: `[c09, c10, c11]` → displayed DESC `[c11, c10, c09]`
  — user never sees c08, sees sentinel c11 instead.

Fix: collapse both directions to `sessionIds.slice(0, size)`. The
sentinel is always the last item regardless of direction, so the two
branches had no reason to diverge. Cursor computations
(`previousCursor = reversed.at(1)`, `nextCursor = reversed.at(size)`)
already line up with the corrected slice — no change needed there.

Verified: webapp typecheck clean.
/realtime/v1/sessions/:session/:io=out now peeks the tail record in S2
at connection time. When the tail chunk is trigger:turn-complete, the
agent has finished a turn and is either idle-waiting on .in or has
exited — either way no more chunks will arrive without further user
action. In that case the downstream S2 read switches to wait=0 so the
SSE drains and closes in ~1s instead of long-polling for 60s, and the
response carries X-Session-Settled: true so the client can tell the
close is terminal rather than a normal 60s cycle.

Mid-turn tails (streaming UIMessageChunks in flight) fall through to
the existing wait=60 long-poll. Crashed-mid-turn is indistinguishable
from live-streaming at this point and gets the same 60s retry loop as
today — that's a separate hardening, not in scope here.

The peek uses GET /records?tail_offset=1&count=1&wait=0 (single-digit
ms on S2), then unwraps the agent-side envelope written by
StreamsWriterV2: record.body parses to {data: <chunk>, id}, where
<chunk> is the raw UIMessageChunk object. No double-parse on data.

404 / 416 from the peek (stream never written / empty stream) short-
circuit to settled=false so first-connect on a freshly-created session
keeps the long-poll semantics the agent's first chunks depend on.

Verified end-to-end against an idle chat-agent-smoke session: caught-
up reconnect (Last-Event-ID = tail) closes in 1.08s with the header;
behind reconnect (Last-Event-ID < tail) drains remaining records then
closes in 0.94s with the header; empty-stream reconnect keeps the 60s
long-poll behavior unchanged.
devin-ai-integration[bot]

This comment was marked as resolved.

Session is now the run manager for chat.agent and any future task-bound
session. Atomically creates the row + triggers the first run + tracks
the current run via optimistic claim, with a SessionRun audit log for
provenance.

Schema:
- Session gains `taskIdentifier`, `triggerConfig` (JSON), `currentRunId`
  (non-FK), `currentRunVersion` (monotonic int for optimistic claim).
- New SessionRun audit table — one row per run a session triggers,
  with `reason: "initial" | "continuation" | "upgrade" | "manual"`.

Lifecycle:
- `POST /api/v1/sessions`: idempotent on `(env, externalId)`, refreshes
  triggerConfig on cache hit, runs `ensureRunForSession` (probe +
  optimistic claim), returns a session-scoped PAT. JWT auth path
  dropped — secret-key only. The customer's server is the only entry
  point for session creation.
- `POST /api/v1/sessions/:s/end-and-continue`: server-orchestrated
  handoff (cancels current run, triggers a fresh one, swaps
  currentRunId via `updateMany where currentRunVersion`). Powers
  `chat.requestUpgrade()` from inside the agent runtime.
- `POST /realtime/v1/sessions/:s/:io/append`: probe + ensureRunForSession
  before append so messages arriving while no run is alive boot one
  transparently.

Cross-form addressing on write paths:
- `createActionApiRoute` now runs `findResource` before `authorization`,
  matching `createLoaderApiRoute`. Action routes get an optional
  `resource` argument on `authorization.resource()` —
  backwards-compatible (existing 4-arg callbacks unchanged).
- Append + end-and-continue use the new ordering to authorize against
  `{paramSession, friendlyId, externalId}` so a JWT minted for either
  form authorizes either URL form.

Helpers:
- `mintSessionToken.server.ts`: server-side session-PAT factory
  (`read:sessions:{key} + write:sessions:{key}`, 1h TTL).
- `sessionRunManager.server.ts`: `ensureRunForSession` (probe + claim)
  and `swapSessionRun` (force handoff with optimistic claim +
  cancel-on-loss).

Pre-mutation existence reads switched to `$replica` (close, end-and-
continue, PATCH).
devin-ai-integration[bot]

This comment was marked as resolved.

Three fixes after pushing the Sessions-as-run-manager commit:

- `api.v1.sessions.$session.end-and-continue.ts` was destructuring only
  `{ action }` from `createActionApiRoute`, which means Remix had no
  handler for OPTIONS preflight on this route. Browser CORS would 405.
  Sibling routes (`close.ts`) already export `{ action, loader }`. Fix:
  destructure and export both.

- `ensureRunForSession`'s pathological "lost the claim race AND the
  winner's run was already terminal" branch recursed without bound. In
  practice progress through the run engine bounds it, but a misconfigured
  task that crashes before being dequeued could blow the stack. Add a
  hidden `_attempt` counter, throw `SessionRunManagerError` once it
  exceeds 3.

- `sessionsReplicationService.test.ts` was failing in CI because the
  sessions-as-run-manager schema migration made `taskIdentifier` and
  `triggerConfig` required on `Session`. The two `prisma.session.create`
  calls in the test predate the migration. Add the now-required fields
  to both fixtures.
devin-ai-integration[bot]

This comment was marked as resolved.

Two fixes from Devin review on the sessions-as-run-manager commit:

- `SessionItem.currentRunId`'s contract is the `run_*` friendlyId, but
  `serializeSession` returns the raw Prisma cuid. The `POST /sessions`
  create path overrides correctly via a TaskRun lookup, but GET, PATCH,
  and the three return paths in close.ts were passing the cuid through.
  A consumer using `currentRunId` from those endpoints in a downstream
  `GET /api/v1/runs/:runId` call would 404. Add a
  `serializeSessionWithFriendlyRunId` helper next to `serializeSession`
  that resolves via `$replica.taskRun.findFirst` (TaskRun friendlyIds
  are immutable, so replica lag is harmless), and switch the five
  affected return sites to use it. List endpoints stay on
  `serializeSession` to avoid N+1 lookups when paginating. The create
  endpoint keeps its existing manual lookup because it also needs the
  friendlyId for the response's `runId` field, and `session.currentRunId`
  is stale relative to the post-`ensureRunForSession` claim outcome.

- Drop dead `lastChunkType` recomputation in
  `streamResponseFromSessionStream`. The variable was bound but never
  used; the conditional below it re-evaluated the same expression.
  Use the bound value in the condition.
Collapse `session-out-settled-signal.md` and `sessions-public-api-cors.md`
into the single `session-primitive.md`, and rewrite that one to a high-
level two-sentence summary that covers everything actually shipping in
this PR (sessions-as-run-manager, end-and-continue, waitpoints, etc.).
The CORS/JWT-on-create story is also out of date now that POST
/api/v1/sessions is secret-key only.
devin-ai-integration[bot]

This comment was marked as resolved.

…friendlyId

Switch the two read-after-write taskRun lookups (POST /api/v1/sessions
and POST /api/v1/sessions/:s/end-and-continue) from $replica back to
prisma. Both reads happen immediately after triggering a run on the
writer; replica lag would null the result and turn a successful create
into a 500, or fall back to leaking the internal cuid in the
end-and-continue response.
devin-ai-integration[bot]

This comment was marked as resolved.

…n sessionRunManager

The lost-race re-read in ensureRunForSession and swapSessionRun reads
the Session row that the winner just wrote on the writer. Reading from
$replica could return pre-race state and either (1) cause
ensureRunForSession to recurse with a stale currentRunVersion, fail the
next claim, and waste runs until max-attempts; or (2) cause
swapSessionRun to return swapped: false with the calling run's own id,
misleading the caller into thinking it is still authoritative.
devin-ai-integration[bot]

This comment was marked as resolved.

The S2 record envelope wraps the agent-written chunk as
{data: <chunkAsString>, id: partId} because StreamsWriterV2 hands
appendPart an already-stringified chunk. The peek-settled check
treated envelope.data as an object, so typeof === 'object' always
returned false and the trigger:turn-complete sentinel was never
matched. Reconnect-on-reload silently degraded to the full long-poll
path. Parse envelope.data once more so the type discriminator is
surfaced.
devin-ai-integration[bot]

This comment was marked as resolved.

… run lookup

Same read-after-write pattern as the other lost-race re-reads:
the run was just triggered on the writer milliseconds before, so a
$replica.findFirst can return null due to replication lag. The null
silently no-ops the cancellation and leaks an orphan run that no
session will ever claim.
devin-ai-integration[bot]

This comment was marked as resolved.

When the upsert path returns a previously-closed row, return 409 before
ensureRunForSession fires. Otherwise we'd trigger a fresh run on a
closed session that can't receive .in input (append handler rejects
writes to closed sessions), wasting compute on a run that exits the
moment it tries to read. close is one-way; callers must use a different
externalId to start a new session.
The race-check in api.v1.runs.$runFriendlyId.session-streams.wait was
selecting the realtime stream instance via run.realtimeStreamsVersion,
but session streams are always v2 (S2) — the writer (appendPartToSessionStream)
and the SSE subscribe both hardcode v2. For a v1 run the race-check
silently fell back to a non-S2 instance, the instanceof check missed,
and the optimization was skipped. Hardcode v2 for parity with the rest
of the session surface.
…ized routes

createActionApiRoute now runs findResource before authorization so the
auth scope check can expand to alternate identifiers of the resolved
resource (Sessions are addressable by both friendlyId and externalId).
Side-effect: an authenticated-but-underscoped caller could probe
resource existence by observing 404 vs 403. Mask the 404 as 403 with
the same response shape as the auth-failed branch when the route
declares authorization, so the two cases are indistinguishable to
callers without scopes. Routes without authorization keep returning
404.
devin-ai-integration[bot]

This comment was marked as resolved.

Previous fix unconditionally returned 403 when findResource was null on
a route with authorization, breaking PRIVATE-key callers (e.g. server
SDK) hitting the existing api.v2.runs.cancel route — they always pass
authorization but the new code returned 403 with a factually wrong
message ('Unauthorized: missing required scopes') even though they had
full permissions.

New ordering: run authorization first (with the resolved resource as
the 5th arg, so cross-form session auth still works), then check
resource-null → 404. This gives:
- PRIVATE key + missing resource: auth passes → 404 (correct)
- Underscoped JWT + missing resource: auth fails (resource not in
  scope) → 403 (no info leak vs existing resource)
- Underscoped JWT + existing resource: auth fails → 403 (unchanged)

Only auth callbacks that destructure the resource (loader for
realtime.v1.sessions.$session.$io) need to handle null — they all
already do, since findResource was already nullable in pre-PR
loaders.
@ericallam ericallam merged commit c69e939 into main Apr 28, 2026
43 checks passed
@ericallam ericallam deleted the feature/tri-8627-session-primitive-server-side-schema-routes-clickhouse branch April 28, 2026 11:35
@github-actions github-actions Bot mentioned this pull request Apr 28, 2026
ericallam pushed a commit that referenced this pull request May 1, 2026
## Summary
8 new features, 18 improvements, 11 bug fixes.

## Breaking changes
- Add server-side deprecation gate for deploys from v3 CLI versions
(gated by `DEPRECATE_V3_CLI_DEPLOYS_ENABLED`). v4 CLI deploys are
unaffected.
([#3415](#3415))

## Improvements
- Add `--no-browser` flag to `init` and `login` to skip auto-opening the
browser during authentication. Also error loudly when `init` is run
without `--yes` under non-TTY stdin (previously default-and-exited
silently, leaving the project half-initialized). Both commands now show
an `Examples` section in `--help`.
([#3483](#3483))
- Add `isReplay` boolean to the run context (`ctx.run.isReplay`),
derived from the existing `replayedFromTaskRunFriendlyId` database
field. Defaults to `false` for backwards compatibility.
([#3454](#3454))
- Redact the `resolveWaitpoint` runtime log so it only emits `id` and
`type` instead of the full completed waitpoint. Previously the log
printed the entire waitpoint (including `output`) to stdout in
production runs, which could leak sensitive payloads. The value returned
by `wait.forToken()` is unchanged.
([#3490](#3490))
- Add `SessionId` friendly ID generator and schemas for the new durable
Session primitive. Exported from `@trigger.dev/core/v3/isomorphic`
alongside `RunId`, `BatchId`, etc. Ships the
`CreateSessionStreamWaitpoint` request/response schemas alongside the
main Session CRUD.
([#3417](#3417))
- Truncate large error stacks and messages to prevent OOM crashes. Stack
traces are capped at 50 frames (keeping top 5 + bottom 45 with an
omission notice), individual stack lines at 1024 chars, and error
messages at 1000 chars. Applied in parseError, sanitizeError, and OTel
span recording.
([#3405](#3405))

## Server changes

These changes affect the self-hosted Docker image and Trigger.dev Cloud:

- Add a "Back office" tab to `/admin` and a per-organization detail page
at `/admin/back-office/orgs/:orgId`. The first action available on that
page is editing the org's API rate limit: admins can save a
`tokenBucket` override (refill rate, interval, max tokens) and see a
plain-English preview of the resulting sustained rate and burst
allowance. Writes are audit-logged via the server logger.
([#3434](#3434))
- Optional `DEPLOY_REGISTRY_ECR_DEFAULT_REPOSITORY_POLICY` env var to
apply a default repository policy when the webapp creates new ECR repos
([#3467](#3467))
- Ship the Errors page to all users, with a polish + bug-fix pass:
pinned "No channel" item in the Slack alert channel picker,
viewer-timezone alert timestamps via Slack's `<!date^>` token, Activity
sparkline peak tooltip, centered loading spinner and bug-icon empty
state on the error detail page, ellipsis on the Configure alerts
trigger.
([#3477](#3477))
- Configure the set of machine presets to build boot snapshots for at
deploy time via `COMPUTE_TEMPLATE_MACHINE_PRESETS` (CSV of preset names,
default `small-1x`). Use `COMPUTE_TEMPLATE_MACHINE_PRESETS_REQUIRED`
(CSV, default = full PRESETS list) to scope which preset failures fail a
required-mode deploy. Optional preset failures are logged and don't
block the deploy.
([#3492](#3492))
- Regenerating a RuntimeEnvironment API key no longer invalidates the
previous key immediately. The old key is recorded in a new
`RevokedApiKey` table with a 24 hour grace window, and
`findEnvironmentByApiKey` falls back to it when the submitted key
doesn't match any live environment. The grace window can be ended early
(or extended) by updating `expiresAt` on the row.
([#3420](#3420))
- Add the `Session` primitive — a durable, task-bound, bidirectional I/O
channel that outlives a single run and acts as the run manager for
`chat.agent`. Ships the Postgres `Session` + `SessionRun` tables,
ClickHouse `sessions_v1` + replication service, the `sessions` JWT
scope, and the public CRUD + realtime routes (`/api/v1/sessions`,
`/realtime/v1/sessions/:session/:io`) including `end-and-continue` for
server-orchestrated run handoffs and session-stream waitpoints.
([#3417](#3417))
- Add `KUBERNETES_POD_DNS_NDOTS_OVERRIDE_ENABLED` flag (off by default)
that overrides the cluster default and sets `dnsConfig.options.ndots` on
runner pods (defaulting to 2, configurable via
`KUBERNETES_POD_DNS_NDOTS`). Kubernetes defaults pods to `ndots: 5`, so
any name with fewer than 5 dots — including typical external domains
like `api.example.com` — is first walked through every entry in the
cluster search list (`<ns>.svc.cluster.local`, `svc.cluster.local`,
`cluster.local`) before being tried as-is, turning one resolution into
4+ CoreDNS queries (×2 with A+AAAA). Using a lower `ndots` value reduces
DNS query amplification in the `cluster.local` zone.
  
Note: before enabling, make sure no code path relies on search-list
expansion for names with dots ≥ the configured value — those names will
hit their as-is form first and could resolve externally before falling
back to the cluster search path.
([#3441](#3441))
- Vercel integration option to disable auto promotions
([#3376](#3376))
- Make it clear in the admin that feature flags are global and should
rarely be changed.
([#3408](#3408))
- Admin worker groups API: add GET loader and expose more fields on
POST. ([#3390](#3390))
- Add 60s fresh / 60s stale SWR cache to `getEntitlement` in
`platform.v3.server.ts`. Eliminates a synchronous billing-service HTTP
round trip on every trigger. Reuses the existing `platformCache` (LRU
memory + Redis) pattern already used for `limits` and `usage`. Cache key
is `${orgId}`. Errors return a permissive `{ hasAccess: true }` fallback
(existing behavior) and are also cached to prevent thundering-herd on
billing outages.
([#3388](#3388))
- Show a `MicroVM` badge next to the region name on the regions page.
([#3407](#3407))
- Increase default maximum project count per organization from 10 to 25
([#3409](#3409))
- Merge execution snapshot creation into the dequeue taskRun.update
transaction, reducing 2 DB commits to 1 per dequeue operation
([#3395](#3395))
- Add per-worker Node.js heap metrics to the OTel meter —
`nodejs.memory.heap.used`, `nodejs.memory.heap.total`,
`nodejs.memory.heap.limit`, `nodejs.memory.external`,
`nodejs.memory.array_buffers`, `nodejs.memory.rss`. Host-metrics only
publishes RSS, which overstates V8 heap by the external + native
footprint; these give direct heap visibility per cluster worker so
`NODE_MAX_OLD_SPACE_SIZE` can be sized against observed heap peaks
rather than RSS.
([#3437](#3437))
- Tag Prisma spans with `db.datasource: "writer" | "replica"` so
monitors and trace queries can distinguish the writer pool from the
replica pool. Applies to all `prisma:engine:*` spans (including
`prisma:engine:connection` used by the connection-pool monitors) and the
outer `prisma:client:operation` span.
([#3422](#3422))
- Clarify the cross-region intent in the Terraform and AI-prompt helpers
on the Add Private Connection page. Both already default
`supported_regions` to `["us-east-1", "eu-central-1"]`; added an inline
comment / parenthetical so the user understands why both regions are
listed (Trigger.dev runs in both, so the service must be consumable from
either).
([#3465](#3465))
- Add `RUN_ENGINE_READ_REPLICA_SNAPSHOTS_SINCE_ENABLED` flag (default
off) to route the Prisma reads inside `RunEngine.getSnapshotsSince`
through the read-only replica client. Offloads the snapshot polling
queries (fired by every running task runner) from the primary. When
disabled, behavior is unchanged.
([#3423](#3423))
- Stop creating TaskRunTag records and _TaskRunToTaskRunTag join table
entries during task triggering. The denormalized runTags string array on
TaskRun already stores tag names, making the M2M relation redundant
write overhead.
([#3369](#3369))
- Stop writing per-tick state (`lastScheduledTimestamp`,
`nextScheduledTimestamp`, `lastRunTriggeredAt`) on `TaskSchedule` and
`TaskScheduleInstance`. The schedule engine now carries the previous
fire time forward via the worker queue payload, eliminating ~270K
dead-tuple-driven autovacuums per year on these hot tables and the
associated `IO:XactSync` mini-spikes on the writer. Customer-facing
`payload.lastTimestamp` semantics are unchanged.
([#3476](#3476))
- Replace the expensive DISTINCT query for task filter dropdowns with a
dedicated TaskIdentifier registry table backed by Redis. Environments
migrate automatically on their next deploy, with a transparent fallback
to the legacy query for unmigrated environments. Also fixes duplicate
dropdown entries when a task changes trigger source, and adds
active/archived grouping for removed tasks. Moves BackgroundWorkerTask
reads in the trigger hot path to the read replica.
([#3368](#3368))
- Public Access Tokens (PATs) minted before an API key rotation now keep
working during the 24h grace window. `validatePublicJwtKey` falls back
to any non-expired `RevokedApiKey` rows for the signing environment when
the primary signature check against the env's current `apiKey` fails.
The fallback query only runs on the failure path, so the hot success
path is unchanged.
([#3464](#3464))
- Batch items that hit the environment queue size limit now fast-fail
without
retries and without creating pre-failed TaskRuns.
([#3352](#3352))
- Show the cancel button in the runs list for runs in `DEQUEUED` status.
`DEQUEUED` was missing from `NON_FINAL_RUN_STATUSES` so the list hid the
button even though the single run page allowed it.
([#3421](#3421))
- Reduce 5xx feedback loops on hot debounce keys by quantizing
`delayUntil`,
  adding an unlocked fast-path skip, and gracefully handling redlock
contention in `handleDebounce` so the SDK no longer retries into a herd.
([#3453](#3453))
- Fix RSS memory leak in the realtime proxy routes. `/realtime/v1/runs`,
`/realtime/v1/runs/:id`, and `/realtime/v1/batches/:id` called `fetch()`
into Electric with no abort signal, so when a client disconnected mid
long-poll, undici kept the upstream socket open and buffered response
chunks that would never be consumed — retained only in RSS, invisible to
V8 heap tooling. Thread `getRequestAbortSignal()` through
`RealtimeClient.streamRun/streamRuns/streamBatch` to `longPollingFetch`
and cancel the upstream body in the error path. Isolated reproducer
showed ~44 KB retained per leaked request; signal propagation releases
it cleanly.
([#3442](#3442))
- Fix memory leak where every aborted SSE connection pinned the full
request/response graph on Node 20, caused by `AbortSignal.any()` in
`sse.ts` retaining its source signals indefinitely (see
nodejs/node#54614, nodejs/node#55351). Also clear the
`setTimeout(abort)` timer in `entry.server.tsx` so successful HTML
renders don't pin the React tree for 30s per request.
([#3430](#3430))
- Preserve filters on the queues page when submitting modal actions.
([#3471](#3471))
- Fix Redis connection leak in realtime streams and broken abort signal
propagation.
  
**Redis connections**: Non-blocking methods (ingestData, appendPart,
getLastChunkIndex) now share a single Redis connection instead of
creating one per request. streamResponse still uses dedicated
connections (required for XREAD BLOCK) but now tears them down
immediately via disconnect() instead of graceful quit(), with a 15s
inactivity fallback.
  
**Abort signal**: request.signal is broken in Remix/Express due to a
Node.js undici GC bug (nodejs/node#55428) that severs the signal chain
when Remix clones the Request internally. Added getRequestAbortSignal()
wired to Express res.on("close") via httpAsyncStorage, which fires
reliably on client disconnect. All SSE/streaming routes updated to use
it. ([#3399](#3399))
- Prevent dashboard crash (React error #31) when span accessory item
text is not a string. Filters out malformed accessory items in
SpanCodePathAccessory instead of passing objects to React as children.
([#3400](#3400))
- Upgrade Remix packages from 2.1.0 to 2.17.4 to address security
vulnerabilities in React Router
([#3372](#3372))
- Fix Vercel integration settings page (remove redundant section
toggles) and improve the Vercel onboarding flow so the modal closes
after connecting a GitHub repo and the marketplace `next` URL is
preserved across the GitHub app install redirect.
([#3424](#3424))

<details>
<summary>Raw changeset output</summary>

# Releases
## @trigger.dev/build@4.4.5

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.4.5`

## trigger.dev@4.4.5

### Patch Changes

- Add `--no-browser` flag to `init` and `login` to skip auto-opening the
browser during authentication. Also error loudly when `init` is run
without `--yes` under non-TTY stdin (previously default-and-exited
silently, leaving the project half-initialized). Both commands now show
an `Examples` section in `--help`.
([#3483](#3483))
-   Updated dependencies:
    -   `@trigger.dev/core@4.4.5`
    -   `@trigger.dev/build@4.4.5`
    -   `@trigger.dev/schema-to-json@4.4.5`

## @trigger.dev/core@4.4.5

### Patch Changes

- Add `isReplay` boolean to the run context (`ctx.run.isReplay`),
derived from the existing `replayedFromTaskRunFriendlyId` database
field. Defaults to `false` for backwards compatibility.
([#3454](#3454))
- Redact the `resolveWaitpoint` runtime log so it only emits `id` and
`type` instead of the full completed waitpoint. Previously the log
printed the entire waitpoint (including `output`) to stdout in
production runs, which could leak sensitive payloads. The value returned
by `wait.forToken()` is unchanged.
([#3490](#3490))
- Add `SessionId` friendly ID generator and schemas for the new durable
Session primitive. Exported from `@trigger.dev/core/v3/isomorphic`
alongside `RunId`, `BatchId`, etc. Ships the
`CreateSessionStreamWaitpoint` request/response schemas alongside the
main Session CRUD.
([#3417](#3417))
- Truncate large error stacks and messages to prevent OOM crashes. Stack
traces are capped at 50 frames (keeping top 5 + bottom 45 with an
omission notice), individual stack lines at 1024 chars, and error
messages at 1000 chars. Applied in parseError, sanitizeError, and OTel
span recording.
([#3405](#3405))

## @trigger.dev/python@4.4.5

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.4.5`
    -   `@trigger.dev/build@4.4.5`
    -   `@trigger.dev/sdk@4.4.5`

## @trigger.dev/react-hooks@4.4.5

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.4.5`

## @trigger.dev/redis-worker@4.4.5

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.4.5`

## @trigger.dev/rsc@4.4.5

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.4.5`

## @trigger.dev/schema-to-json@4.4.5

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.4.5`

## @trigger.dev/sdk@4.4.5

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.4.5`

</details>

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants