fix(vllm): don't stream raw tool-call markup as content when a tool parser is active#10346
fix(vllm): don't stream raw tool-call markup as content when a tool parser is active#10346pos-ei-don wants to merge 1 commit into
Conversation
…arser is active When a tool_parser is configured and the request carries tools, the streaming loop emitted every text delta as delta.content — including the model's raw tool-call markup (e.g. <tool_call>...) — because extract_tool_calls only runs on the full output after the stream. Clients streaming a tool call therefore saw the unparsed tool-call syntax as assistant content. Buffer the text while a tool parser is active for the request; the existing end-of-stream chat_delta already carries the parsed tool_calls (or the cleaned content), which the Go side converts to SSE deltas. Non-tool-parser streaming is unchanged. Add a server-less regression test covering both the tool-call case (no raw markup leaked as content) and the plain-text case (content delivered exactly once — guards against double-emitting the buffered content). Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
b7f0b50 to
3ae349c
Compare
|
Updated after end-to-end verification on the live backend (NVIDIA GB10 / arm64 / CUDA 13, vLLM 0.23.0, qwen3_coder tool parser, streaming over the OpenAI API):
Also added a server-less regression test ( |
…ming When a tool parser is active for a tool-enabled streaming request, mudler#10346 buffers the entire generation and surfaces it on the final chunk to prevent raw tool-call markup from leaking as delta.content. This is correct but turns the request into effectively non-streaming for plain-text responses — the client sees nothing until the model stops. Every concrete tool parser shipped with vLLM 0.23+ already implements extract_tool_calls_streaming (Granite4, Qwen3Coder, DeepSeekV31, Jamba, Ernie45, Hermes2Pro, llama3_json, mistral, …). Use it: instantiate the parser before the streaming loop and call its streaming method per delta, emitting DeltaMessage(content=…) or DeltaMessage(tool_calls=[…]) when the parser is ready. Falls back to the existing mudler#10346 buffer path when: - the parser does not have extract_tool_calls_streaming, OR - extract_tool_calls_streaming raises mid-stream (logged, the rest of the request finishes via post-loop extract_tool_calls). Tests (TestStreamingToolParser): 1. Buffer path: no markup leaked, no content duplication 2. Native streaming: plain-text response streams progressively 3. Native streaming: tool_call structured, no markup leaked 4. Native streaming exception → graceful fallback, no markup, no crash 5. No tool parser → unchanged per-delta content stream E2E verified against qwen3_coder on vLLM 0.23.0 (NVIDIA GB10 / arm64 / CUDA 13).
…ser path Self-contained stdlib-only script that measures time-to-first-token (TTFT) for the vLLM backend's two streaming scenarios: - tool_call: request mentions a tool; model is expected to call it - plain_text: request offers a tool but explicitly asks for prose Use this to compare: - the buffer-all path (mudler#10346) → plain_text TTFT ≈ total response time - the native-streaming path (this PR) → plain_text TTFT ≈ true first-token time python examples/vllm-bench/ttft_streaming_tool_parser.py \\ --url http://localhost:8080 --model my-coder --runs 3 Lives under examples/ so it does not interfere with the test suite.
|
Superseded by #10351, which goes one step further: instead of buffering the entire generation, it delegates to Same correctness as this PR (no raw tool-call markup in Closing in favour of #10351. |
|
Superseded by #10351, which takes a better approach: progressive streaming via the parser's |
Description
When the vLLM backend has a
tool_parserconfigured (options: [tool_parser:...]) and a request includestools, streaming responses leak the model's raw tool-call markup as assistant content.The streaming loop yields every text delta as
delta.content:But the tool parser (
extract_tool_calls) only runs on the full text after the stream completes. So for a model that emits e.g.<tool_call>{"name": ...}</tool_call>, the client receives that raw markup asdelta.contentchunks during the stream, and the structuredtool_callsonly arrive in the final chunk.Reproduce (a tool parser such as
qwen3_coder, streaming request withtools):delta.contentchunks contain<tool_call>...,finish_reason: stop.delta.tool_callschunks with name/arguments,finish_reason: tool_calls.Verified on NVIDIA GB10 / arm64 / CUDA 13, vLLM 0.23.0, with the
qwen3_codertool parser driving a coding agent over the streaming OpenAI API.Fix
Buffer the text while a tool parser is active for the request, and let the existing end-of-stream
ChatDeltacarry the parsedtool_calls(which the Go side already converts to SSEdelta.tool_callsviaToolCallsFromChatDeltas/buildDeferredToolCallChunks). When the parser finds no tool call, flush the buffered content as a single content delta before the final chunk. Non-tool-parser streaming is unchanged.