feat(temporal): opt-in continue-as-new for long-lived agent workflows#447
Open
danielmillerp wants to merge 1 commit into
Open
feat(temporal): opt-in continue-as-new for long-lived agent workflows#447danielmillerp wants to merge 1 commit into
danielmillerp wants to merge 1 commit into
Conversation
5d63a08 to
4170651
Compare
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
4170651 to
891ef6d
Compare
891ef6d to
1fb74c4
Compare
1fb74c4 to
22e7358
Compare
22e7358 to
ad68bd8
Compare
ad68bd8 to
65ab89a
Compare
65ab89a to
43f62c2
Compare
Long-lived chat/session agents run as a single Temporal workflow that stays open indefinitely, so their event history grows until it hits Temporal's ~50k-event / 50MB limit and the workflow stalls. This adds an opt-in continue-as-new path that recycles the history so a session can stay open forever, plus the discipline of keeping messages/state outside workflow state so they survive the recycle. SDK (BaseWorkflow): - should_continue_as_new(): recycle decision (Temporal's is_continue_as_new_ suggested() or a configurable WORKFLOW_MAX_HISTORY_LENGTH threshold). - drain_and_continue_as_new(): waits all_handlers_finished (so an in-flight turn is never lost/duplicated at the boundary) then continue_as_new. - run_until_complete(): drop-in replacement for the usual wait_condition(timeout=None) tail; gated once behind workflow.patched() so in-flight pre-patch workflows keep the old behaviour (no non-determinism on replay). Identical behaviour unless WORKFLOW_CONTINUE_AS_NEW_ENABLED is set. - conversation_from_messages(): rebuild the conversation from the adk.messages ledger after a recycle (messages live in adk.messages, not workflow state). Config (default off, so existing agents are unaffected): - WORKFLOW_CONTINUE_AS_NEW_ENABLED (bool) - WORKFLOW_MAX_HISTORY_LENGTH (int|None) Examples: all 13 long-lived Temporal tutorial agents adopt run_until_complete. Message-based chat agents rebuild conversation from adk.messages; harness agents with an opaque session handle (claude-code, codex, claude-sdk) or rich history (pydantic-ai via ModelMessagesTypeAdapter, langgraph) persist their non-message state to adk.state and re-hydrate on recycle. Every adk.state / adk.messages round-trip is guarded by the enabled flag, so the default path is byte-for-byte unchanged. Note: continue-as-new bounds history SIZE; it does NOT extend the chain-wide WORKFLOW_EXECUTION_TIMEOUT_SECONDS (raise that to keep workflows long-lived). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
43f62c2 to
9d71bb7
Compare
Comment on lines
+244
to
+251
| messages = [] | ||
| page_number = 1 | ||
| while True: | ||
| page = await adk.messages.list( | ||
| task_id=task_id, | ||
| limit=_CONVERSATION_PAGE_SIZE, | ||
| page_number=page_number, | ||
| ) |
There was a problem hiding this comment.
The messages SDK's offset pagination is zero-based, but this restore loop starts at 1. When a workflow continues as new, chats with 200 or fewer ledger messages fetch the empty second page and restore an empty conversation. Longer chats drop the first 200 messages from model context. Please start from page 0, or otherwise match the messages API page base, before rebuilding the conversation.
Artifacts
Repro: focused BaseWorkflow zero-based pagination script
- Contains supporting evidence from the run (text/x-python; charset=utf-8).
Stack trace captured during the T-Rex run
- Keeps the raw stack trace available without making the summary code-heavy.
Ran code and verified through T-Rex
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agentex/lib/core/temporal/workflows/workflow.py
Line: 244-251
Comment:
**Use zero-based pages**
The messages SDK's offset pagination is zero-based, but this restore loop starts at `1`. When a workflow continues as new, chats with 200 or fewer ledger messages fetch the empty second page and restore an empty conversation. Longer chats drop the first 200 messages from model context. Please start from page 0, or otherwise match the messages API page base, before rebuilding the conversation.
How can I resolve this? If you propose a fix, please make it concise.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Long-lived chat/session agents (e.g. the Emu/FDD researcher) run as a single Temporal workflow that stays open indefinitely. Their event history grows until it hits Temporal's ~50k-event / 50MB limit and the workflow stalls — this is the root cause behind "chats die / state outgrows the 2MB payload" (P0 for EY).
This PR adds an opt-in continue-as-new path so a session can stay open forever by recycling its history, plus the discipline of keeping messages/state outside workflow state so they survive the recycle.
SDK —
BaseWorkflowhelpers (opt-in)should_continue_as_new()— recycle decision: Temporal'sis_continue_as_new_suggested()or a configurableWORKFLOW_MAX_HISTORY_LENGTHthreshold.drain_and_continue_as_new()— waitsall_handlers_finished(so an in-flight turn isn't lost/duplicated at the boundary), thencontinue_as_new.run_until_complete()— drop-in replacement for the usualwait_condition(timeout=None)tail. Gated once behindworkflow.patched()so in-flight pre-patch workflows keep the old behaviour and don't hit a non-determinism error on replay.conversation_from_messages()— rebuild the conversation from theadk.messagesledger after a recycle (messages live inadk.messages, not workflow state).Config (default OFF — existing agents unaffected)
WORKFLOW_CONTINUE_AS_NEW_ENABLED(bool)WORKFLOW_MAX_HISTORY_LENGTH(int | None)Examples
All 13 long-lived Temporal tutorial agents adopt
run_until_complete:adk.messages.adk.stateand re-hydrate on recycle: opaque session handles for claude-sdk (090), claude-code (140), codex (150); richModelMessagehistory for pydantic-ai (110, viaModelMessagesTypeAdapter); langgraph (130) rebuilds from the ledger.Every
adk.state/adk.messagesround-trip is guarded by the enabled flag, so the default path is byte-for-byte unchanged.Verification
tests/lib/core/temporal/test_base_workflow_continue_as_new.py(5 passing).tests/lib/core/temporalsuite: 8 passed, no regressions.py_compile+ruffclean across all 16 changed files.Follow-ups (not in this PR)
drain_and_continue_as_newagainst a Temporal test server.🤖 Generated with Claude Code
Greptile Summary
This PR adds opt-in continue-as-new support for long-running Temporal agent workflows. The main changes are:
adk.messagesafter workflow history is recycled.adk.statepersistence for agents with opaque session handles or non-text model history.run_until_completepath.Confidence Score: 4/5
The continue-as-new support is mostly contained and opt-in, but the conversation restore path needs attention before relying on recycled workflows.
The main implementation and tests cover the recycle decision logic, and the remaining issue is localized to message pagination during conversation rehydration.
src/agentex/lib/core/temporal/workflows/workflow.py
What T-Rex did
Prompt To Fix All With AI
Reviews (8): Last reviewed commit: "feat(temporal): opt-in continue-as-new f..." | Re-trigger Greptile