feat(vllm): progressive streaming via parser.extract_tool_calls_streaming (follow-up to #10346)#10351
Open
pos-ei-don wants to merge 6 commits into
Open
feat(vllm): progressive streaming via parser.extract_tool_calls_streaming (follow-up to #10346)#10351pos-ei-don wants to merge 6 commits into
pos-ei-don wants to merge 6 commits into
Conversation
…arser is active When a tool_parser is configured and the request carries tools, the streaming loop emitted every text delta as delta.content — including the model's raw tool-call markup (e.g. <tool_call>...) — because extract_tool_calls only runs on the full output after the stream. Clients streaming a tool call therefore saw the unparsed tool-call syntax as assistant content. Buffer the text while a tool parser is active for the request; the existing end-of-stream chat_delta already carries the parsed tool_calls (or the cleaned content), which the Go side converts to SSE deltas. Non-tool-parser streaming is unchanged. Add a server-less regression test covering both the tool-call case (no raw markup leaked as content) and the plain-text case (content delivered exactly once — guards against double-emitting the buffered content). Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
…ool parser (Case 3, mudler#582)
…ve prefix (TDD, Option B state machine, mudler#582)
…ming When a tool parser is active for a tool-enabled streaming request, mudler#10346 buffers the entire generation and surfaces it on the final chunk to prevent raw tool-call markup from leaking as delta.content. This is correct but turns the request into effectively non-streaming for plain-text responses — the client sees nothing until the model stops. Every concrete tool parser shipped with vLLM 0.23+ already implements extract_tool_calls_streaming (Granite4, Qwen3Coder, DeepSeekV31, Jamba, Ernie45, Hermes2Pro, llama3_json, mistral, …). Use it: instantiate the parser before the streaming loop and call its streaming method per delta, emitting DeltaMessage(content=…) or DeltaMessage(tool_calls=[…]) when the parser is ready. Falls back to the existing mudler#10346 buffer path when: - the parser does not have extract_tool_calls_streaming, OR - extract_tool_calls_streaming raises mid-stream (logged, the rest of the request finishes via post-loop extract_tool_calls). Tests (TestStreamingToolParser): 1. Buffer path: no markup leaked, no content duplication 2. Native streaming: plain-text response streams progressively 3. Native streaming: tool_call structured, no markup leaked 4. Native streaming exception → graceful fallback, no markup, no crash 5. No tool parser → unchanged per-delta content stream E2E verified against qwen3_coder on vLLM 0.23.0 (NVIDIA GB10 / arm64 / CUDA 13).
…ser path Self-contained stdlib-only script that measures time-to-first-token (TTFT) for the vLLM backend's two streaming scenarios: - tool_call: request mentions a tool; model is expected to call it - plain_text: request offers a tool but explicitly asks for prose Use this to compare: - the buffer-all path (mudler#10346) → plain_text TTFT ≈ total response time - the native-streaming path (this PR) → plain_text TTFT ≈ true first-token time python examples/vllm-bench/ttft_streaming_tool_parser.py \\ --url http://localhost:8080 --model my-coder --runs 3 Lives under examples/ so it does not interfere with the test suite.
The long-text scenario shows the buffering vs streaming difference most dramatically: with the buffer-all path, the client receives nothing for 20+ seconds and then the entire 1500-token response at once. With native streaming, the first token arrives in tens of milliseconds and the response flows progressively.
Contributor
Author
|
Migration note — no breaking changes
Nobody is broken by this merge. Anyone who switches |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Follow-up to #10346: replace the buffer-all path with native progressive streaming.
When the vLLM backend has a
tool_parserconfigured and a streaming chatcompletion request includes
tools, PR #10346 stops streaming any deltasand surfaces parsed
tool_calls(or the cleaned content) on the final chunkonly. This is correct for tool-call responses, but it turns the
plain-text-with-tools case into effectively non-streaming: the client
receives nothing until the model finishes.
Every concrete tool parser shipped with vLLM 0.23+ already implements
extract_tool_calls_streaming— includingqwen3coder,granite4,deepseekv31,jamba,ernie45,hermes2pro,llama3_json,mistral,and ~30 more (verified by walking
vllm.tool_parsers.*and checking eachclass against
ToolParser.extract_tool_calls_streaming). This PR delegatesto that interface for the streaming branch.
E2E numbers (qwen3_coder · NVIDIA GB10 / arm64 / vLLM 0.23.0, 3 runs avg)
¹ Buffer-all sends everything in the final chunk →
ttf_content ≈ total_time.² First run includes cold-start (0.98 s); stabilises at ~0.54 s by run 3.
The long-text scenario is where the difference is dramatic: with buffer-all
the client sees nothing for ~18 seconds and then a 1100-token wall of text;
with native streaming the first token arrives in 0.22 s and the response
flows progressively. Total wall-clock is unchanged — only the perceived
reactivity changes.
Bonus: tool_call is now emitted as a proper OpenAI-style incremental
stream —
tool_calls[0].function.namefirst, thenargumentsas a separatedelta — instead of one bundled chunk at the end:
Fix shape
ChatCompletionRequest(onlytoolsis read by parsers), setnative_streamingifhasattr(tp_instance, "extract_tool_calls_streaming").native_streamingis on, call the parser'sstreaming method per delta with the full 7-parameter signature
(
previous_text,current_text,delta_text,previous_token_ids,current_token_ids,delta_token_ids,request). Map the returnedDeltaMessagetoChatDelta(content / reasoning_content / tool_calls).None→ suppress this delta.request id) or is absent (future Python backends, custom parsers). The
existing post-loop
extract_tool_callsblock builds the final chunk —same correctness as a non-streaming response.
parser found no tool call (plain text response), flush the buffered
content as one content delta before the final metadata chunk, and clear
chat_delta.contentso the metadata chunk does not repeat it. This alsofixes the trade-off in fix(vllm): don't stream raw tool-call markup as content when a tool parser is active #10346 where plain-text-with-tools responses
arrived empty.
byte-identical to the pre-fix(vllm): don't stream raw tool-call markup as content when a tool parser is active #10346 baseline.
Trade-offs / things to know
extract_tool_calls_streamingCPU costvllm.entrypoints.openai.chat_completion.protocol<think>...</think><tool_call>..., the streaming tool parser sees the reasoning markup as content. This is unchanged from #10346 (which had the same composition issue, just hidden by buffer-all). Treating reasoning-during-streaming properly is out of scope for this PR — see #10000 for the post-stream reasoning side.Tests (
backend/python/vllm/test.py, new classTestStreamingToolParser)Six server-less mock tests that exercise
_predict(streaming=True)directly:tool_callsemitted, no markup leakAll six green on vLLM 0.23.0 (NVIDIA GB10 / arm64 / CUDA 13).
Bench script —
examples/vllm-bench/Self-contained stdlib-only TTFT benchmark with three scenarios (
tool_call,plain_text_short,plain_text_long). Reviewers can verify theimprovement on their own model:
python examples/vllm-bench/ttft_streaming_tool_parser.py \ --url http://localhost:8080 --model my-coder --runs 3The
plain_text_longscenario (1500-token GIL explanation) shows thebuffering vs streaming difference most dramatically — see numbers above.
Signed commits