feat(vllm): progressive streaming via parser.extract_tool_calls_streaming (follow-up to #10346) by pos-ei-don · Pull Request #10351 · mudler/LocalAI

pos-ei-don · 2026-06-15T21:15:07Z

Description

Follow-up to #10346: replace the buffer-all path with native progressive streaming.

When the vLLM backend has a tool_parser configured and a streaming chat
completion request includes tools, PR #10346 stops streaming any deltas
and surfaces parsed tool_calls (or the cleaned content) on the final chunk
only. This is correct for tool-call responses, but it turns the
plain-text-with-tools case into effectively non-streaming: the client
receives nothing until the model finishes.

Every concrete tool parser shipped with vLLM 0.23+ already implements
extract_tool_calls_streaming — including qwen3coder, granite4,
deepseekv31, jamba, ernie45, hermes2pro, llama3_json, mistral,
and ~30 more (verified by walking vllm.tool_parsers.* and checking each
class against ToolParser.extract_tool_calls_streaming). This PR delegates
to that interface for the streaming branch.

E2E numbers (qwen3_coder · NVIDIA GB10 / arm64 / vLLM 0.23.0, 3 runs avg)

Scenario	Buffer-all (#10346)	This PR	Delta
Plain text (~70 tokens), TTFT content	1.32 s	0.226 s	5.8× faster
Plain text (~70 tokens), content chunks	2	71	per-token stream
Plain text (~1100 tokens), TTFT content	~18 s¹	0.221 s	~80× faster
Plain text (~1100 tokens), content chunks	1¹	~1135	per-token stream
Tool call, TTFT tool	0.51 s	0.54 s²	unchanged
Tool call, raw markup leak	none ✓	none ✓	—

¹ Buffer-all sends everything in the final chunk → ttf_content ≈ total_time.
² First run includes cold-start (0.98 s); stabilises at ~0.54 s by run 3.

The long-text scenario is where the difference is dramatic: with buffer-all
the client sees nothing for ~18 seconds and then a 1100-token wall of text;
with native streaming the first token arrives in 0.22 s and the response
flows progressively. Total wall-clock is unchanged — only the perceived
reactivity changes.

Bonus: tool_call is now emitted as a proper OpenAI-style incremental
stream — tool_calls[0].function.name first, then arguments as a separate
delta — instead of one bundled chunk at the end:

chunk 1: tool_calls=[{id, type:function, function:{name:"get_weather", arguments:""}}]
chunk 2: tool_calls=[{function:{arguments:'{"city": "Paris"}'}}]
chunk 3: finish_reason="tool_calls"

Fix shape

Instantiate the tool parser before the streaming loop, build a minimal
ChatCompletionRequest (only tools is read by parsers), set
native_streaming if hasattr(tp_instance, "extract_tool_calls_streaming").
Inside the loop, when native_streaming is on, call the parser's
streaming method per delta with the full 7-parameter signature
(previous_text, current_text, delta_text, previous_token_ids,
current_token_ids, delta_token_ids, request). Map the returned
DeltaMessage to ChatDelta(content / reasoning_content / tool_calls).
None → suppress this delta.
Fallback to buffer when the streaming method raises (logged with the
request id) or is absent (future Python backends, custom parsers). The
existing post-loop extract_tool_calls block builds the final chunk —
same correctness as a non-streaming response.
Buffer-fallback content flush: when has_tool_parser is set and the
parser found no tool call (plain text response), flush the buffered
content as one content delta before the final metadata chunk, and clear
chat_delta.content so the metadata chunk does not repeat it. This also
fixes the trade-off in fix(vllm): don't stream raw tool-call markup as content when a tool parser is active #10346 where plain-text-with-tools responses
arrived empty.
No change to the no-tool-parser path — per-delta content stream is
byte-identical to the pre-fix(vllm): don't stream raw tool-call markup as content when a tool parser is active #10346 baseline.

Trade-offs / things to know

Concern	Status
New code surface (~80 lines vs ~4 in #10346)	6 mock tests cover all paths; E2E verified
Per-delta `extract_tool_calls_streaming` CPU cost	Negligible vs. token generation; vLLM's own OpenAI frontend does the same
New import path `vllm.entrypoints.openai.chat_completion.protocol`	Wrapped in try/except; falls back to buffer if absent (no crash)
Parser exceptions mid-stream	Logged with request id; rest of stream uses buffer fallback; final chunk still correct
Tokens-list tracking per delta	O(N); trivial
Reasoning + Tool parser composition	Currently the reasoning parser runs post-stream. When both parsers are active and a model emits e.g. `<think>...</think><tool_call>...`, the streaming tool parser sees the reasoning markup as content. This is unchanged from #10346 (which had the same composition issue, just hidden by buffer-all). Treating reasoning-during-streaming properly is out of scope for this PR — see #10000 for the post-stream reasoning side.
Backwards compatibility	Parser without streaming method → buffer fallback = #10346 behaviour. Non-tool-parser path → byte-identical to pre-#10346.

Tests (backend/python/vllm/test.py, new class TestStreamingToolParser)

Six server-less mock tests that exercise _predict(streaming=True) directly:

Buffer fallback: tool call → no markup as content; tool_call name present
Buffer fallback: plain text → content delivered exactly once (no dupe)
Native streaming: plain text → content streams progressively
Native streaming: tool call → structured tool_calls emitted, no markup leak
Native streaming exception → graceful fallback, no leak, no crash
No tool parser → unchanged per-delta content stream (regression guard)

All six green on vLLM 0.23.0 (NVIDIA GB10 / arm64 / CUDA 13).

Bench script — examples/vllm-bench/

Self-contained stdlib-only TTFT benchmark with three scenarios (tool_call,
plain_text_short, plain_text_long). Reviewers can verify the
improvement on their own model:

python examples/vllm-bench/ttft_streaming_tool_parser.py \
    --url http://localhost:8080 --model my-coder --runs 3

The plain_text_long scenario (1500-token GIL explanation) shows the
buffering vs streaming difference most dramatically — see numbers above.

Signed commits

Yes, I signed my commits.

…arser is active When a tool_parser is configured and the request carries tools, the streaming loop emitted every text delta as delta.content — including the model's raw tool-call markup (e.g. <tool_call>...) — because extract_tool_calls only runs on the full output after the stream. Clients streaming a tool call therefore saw the unparsed tool-call syntax as assistant content. Buffer the text while a tool parser is active for the request; the existing end-of-stream chat_delta already carries the parsed tool_calls (or the cleaned content), which the Go side converts to SSE deltas. Non-tool-parser streaming is unchanged. Add a server-less regression test covering both the tool-call case (no raw markup leaked as content) and the plain-text case (content delivered exactly once — guards against double-emitting the buffered content). Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>

…ool parser (Case 3, mudler#582)

…ve prefix (TDD, Option B state machine, mudler#582)

…ming When a tool parser is active for a tool-enabled streaming request, mudler#10346 buffers the entire generation and surfaces it on the final chunk to prevent raw tool-call markup from leaking as delta.content. This is correct but turns the request into effectively non-streaming for plain-text responses — the client sees nothing until the model stops. Every concrete tool parser shipped with vLLM 0.23+ already implements extract_tool_calls_streaming (Granite4, Qwen3Coder, DeepSeekV31, Jamba, Ernie45, Hermes2Pro, llama3_json, mistral, …). Use it: instantiate the parser before the streaming loop and call its streaming method per delta, emitting DeltaMessage(content=…) or DeltaMessage(tool_calls=[…]) when the parser is ready. Falls back to the existing mudler#10346 buffer path when: - the parser does not have extract_tool_calls_streaming, OR - extract_tool_calls_streaming raises mid-stream (logged, the rest of the request finishes via post-loop extract_tool_calls). Tests (TestStreamingToolParser): 1. Buffer path: no markup leaked, no content duplication 2. Native streaming: plain-text response streams progressively 3. Native streaming: tool_call structured, no markup leaked 4. Native streaming exception → graceful fallback, no markup, no crash 5. No tool parser → unchanged per-delta content stream E2E verified against qwen3_coder on vLLM 0.23.0 (NVIDIA GB10 / arm64 / CUDA 13).

…ser path Self-contained stdlib-only script that measures time-to-first-token (TTFT) for the vLLM backend's two streaming scenarios: - tool_call: request mentions a tool; model is expected to call it - plain_text: request offers a tool but explicitly asks for prose Use this to compare: - the buffer-all path (mudler#10346) → plain_text TTFT ≈ total response time - the native-streaming path (this PR) → plain_text TTFT ≈ true first-token time python examples/vllm-bench/ttft_streaming_tool_parser.py \\ --url http://localhost:8080 --model my-coder --runs 3 Lives under examples/ so it does not interfere with the test suite.

The long-text scenario shows the buffering vs streaming difference most dramatically: with the buffer-all path, the client receives nothing for 20+ seconds and then the entire 1500-token response at once. With native streaming, the first token arrives in tens of milliseconds and the response flows progressively.

pos-ei-don · 2026-06-15T21:26:31Z

Migration note — no breaking changes

Existing setup	Still works?	Action
`tool_parser:` configured, client uses `stream: false`	✅ unchanged	Optional: switch to `stream: true` for the UX win
`tool_parser:` configured, client filters `<tool_call>` markup out of `delta.content` locally	✅ unchanged (filter just never matches)	Workaround can be removed
No `tool_parser:` configured, client parses tool markup itself	✅ byte-identical (plain-streaming path unchanged)	None
Custom Python backend without `extract_tool_calls_streaming`	✅ falls back to the #10346 buffer path	None
Mixed with structured outputs / `response_format` / JSON-mode	✅ orthogonal pipelines, no interaction	None

Nobody is broken by this merge. Anyone who switches stream: false → stream: true after the merge gets the TTFT improvement; anyone who keeps it the way it was gets identical behaviour to today.

pos-ei-don and others added 6 commits June 15, 2026 17:46

test(vllm): add expectedFailure test for progressive streaming with t…

baeb452

…ool parser (Case 3, mudler#582)

test(vllm): add Cases 4+5 — marker split across chunks + false-positi…

7df4131

…ve prefix (TDD, Option B state machine, mudler#582)

pos-ei-don mentioned this pull request Jun 15, 2026

fix(vllm): don't stream raw tool-call markup as content when a tool parser is active #10346

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(vllm): progressive streaming via parser.extract_tool_calls_streaming (follow-up to #10346)#10351

feat(vllm): progressive streaming via parser.extract_tool_calls_streaming (follow-up to #10346)#10351
pos-ei-don wants to merge 6 commits into
mudler:masterfrom
pos-ei-don:wip/streaming-tool-parser-follow-up

pos-ei-don commented Jun 15, 2026

Uh oh!

pos-ei-don commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

pos-ei-don commented Jun 15, 2026

Uh oh!

pos-ei-don commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant