fix(workflow-executor): add per-invocation AI timeout to surface hanging provider errors [PRD-409]#1609
Open
matthv wants to merge 2 commits into
Open
Conversation
…ing provider errors [PRD-409] When the AI provider hangs (no response, internal retries, or holds the connection open), the previous code relied on the global STEP_TIMEOUT_MS (default 5 min) to fail the step. From the user's perspective this looks like an infinite spinner. Add a dedicated timeout on each AI invocation (default 60s, configurable via AI_INVOKE_TIMEOUT_MS) using AbortController + signal so the underlying HTTP request is actually cancelled. On timeout, throws the new AiInvokeTimeoutError, which BaseStepExecutor.execute() converts to an error outcome with a user-friendly message — the orchestrator then sets context.error on the step and the frontend exits its isLoading state immediately. fixes PRD-409 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 new issue
|
|
Coverage Impact ⬆️ Merging this pull request will increase total coverage on Modified Files with Diff Coverage (3)
🛟 Help
|
Replace the manual AbortController + setTimeout in invokeWithTools with LangChain's native `timeout` call option, which it converts to an AbortSignal.timeout(ms) and forwards to the underlying HTTP request (real cancellation, not just a race). Lowers invokeWithTools complexity. Map the resulting TimeoutError/AbortError to AiInvokeTimeoutError to keep the user-facing message. Lower the default to 30s. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
When the AI provider hangs (no response, internal retries, or holds the connection open), the previous code relied on the global
STEP_TIMEOUT_MS(default 5 min) to fail the step. From the user's perspective this looks like an infinite spinner.This PR adds a dedicated timeout on each AI invocation (default 30s, configurable via
AI_INVOKE_TIMEOUT_MS) by passing LangChain's nativetimeoutcall option tomodel.invoke. LangChain converts it into anAbortSignal.timeout(ms)and forwards it to the underlying HTTP request, so a hanging provider is actually cancelled — not merely raced.On timeout, the abort surfaces as a
TimeoutError/AbortError, whichinvokeWithToolsmaps to the newAiInvokeTimeoutError.BaseStepExecutor.execute()then converts it to an error outcome with a user-friendly message — the orchestrator setscontext.erroron the step and the frontend exits itsisLoadingstate immediately.Why delegate to LangChain instead of a manual AbortController
An earlier version wired up
AbortController+setTimeoutby hand. LangChain already does exactly this internally when given atimeoutcall option (verified in@langchain/coreensureConfig→AbortSignal.timeout→ forwarded assignalto the request). Delegating removes the manual timer plumbing and lowersinvokeWithToolscomplexity, while still producing a real request cancellation. Thetimeoutcall option is in milliseconds.Why not just lower STEP_TIMEOUT_MS globally
STEP_TIMEOUT_MScovers more than the AI call (it also covers slow agent fetches, DB lookups, etc.). Lowering it globally would kill legitimately slow non-AI work. A dedicated AI timeout is more surgical.Changes
defaults.ts: newDEFAULT_AI_INVOKE_TIMEOUT_MS = 30_000errors.ts: newAiInvokeTimeoutError extends WorkflowExecutorErrorwith provider-specific user messagebase-step-executor.ts:invokeWithToolspasses{ timeout: aiInvokeTimeoutMs }tomodel.invoke, and maps the resultingTimeoutError/AbortErrortoAiInvokeTimeoutErrorRunnerConfig→StepContextConfig→ExecutionContextcli-core.ts: parseAI_INVOKE_TIMEOUT_MSenv varAiInvokeTimeoutError,{ timeout }passed as the 2nd arg, disabled when unset/<=0 (abort not mapped), non-abort errors rethrown as-isfixes PRD-409
Test plan
workflow-executortest suite passes (base-step-executor.test.ts: 45 tests, incl. the 6 above)tsc --noEmitcleanSIMULATE_AI_HANG=1 AI_INVOKE_TIMEOUT_MS=10000, the frontend shows the new user message after 10s instead of spinning for 5min🤖 Generated with Claude Code
Note
Add per-invocation AI timeout to surface hanging provider errors in workflow executor
aiInvokeTimeoutMs(default 60,000ms) to the workflow executor'sRunnerConfig,ExecutionContext, andExecutorOptions, configurable via theAI_INVOKE_TIMEOUT_MSenvironment variable.BaseStepExecutor.invokeWithTools, wraps AI provider calls with anAbortControllertimer; if the provider hangs past the timeout, the invocation is aborted and throwsAiInvokeTimeoutError.AiInvokeTimeoutErrorwith a user-facing retry message to distinguish timeout failures from other AI errors.aiInvokeTimeoutMsto0or leaving it unset disables the timeout, preserving existing behavior.Macroscope summarized 1718cb4.