Skip to content

feat(vllm): progressive streaming via parser.extract_tool_calls_streaming (follow-up to #10346)#10351

Open
pos-ei-don wants to merge 6 commits into
mudler:masterfrom
pos-ei-don:wip/streaming-tool-parser-follow-up
Open

feat(vllm): progressive streaming via parser.extract_tool_calls_streaming (follow-up to #10346)#10351
pos-ei-don wants to merge 6 commits into
mudler:masterfrom
pos-ei-don:wip/streaming-tool-parser-follow-up

Conversation

@pos-ei-don

Copy link
Copy Markdown
Contributor

Description

Follow-up to #10346: replace the buffer-all path with native progressive streaming.

When the vLLM backend has a tool_parser configured and a streaming chat
completion request includes tools, PR #10346 stops streaming any deltas
and surfaces parsed tool_calls (or the cleaned content) on the final chunk
only. This is correct for tool-call responses, but it turns the
plain-text-with-tools case into effectively non-streaming: the client
receives nothing until the model finishes.

Every concrete tool parser shipped with vLLM 0.23+ already implements
extract_tool_calls_streaming — including qwen3coder, granite4,
deepseekv31, jamba, ernie45, hermes2pro, llama3_json, mistral,
and ~30 more (verified by walking vllm.tool_parsers.* and checking each
class against ToolParser.extract_tool_calls_streaming). This PR delegates
to that interface for the streaming branch.

E2E numbers (qwen3_coder · NVIDIA GB10 / arm64 / vLLM 0.23.0, 3 runs avg)

Scenario Buffer-all (#10346) This PR Delta
Plain text (~70 tokens), TTFT content 1.32 s 0.226 s 5.8× faster
Plain text (~70 tokens), content chunks 2 71 per-token stream
Plain text (~1100 tokens), TTFT content ~18 s¹ 0.221 s ~80× faster
Plain text (~1100 tokens), content chunks ~1135 per-token stream
Tool call, TTFT tool 0.51 s 0.54 s² unchanged
Tool call, raw markup leak none ✓ none ✓

¹ Buffer-all sends everything in the final chunk → ttf_content ≈ total_time.
² First run includes cold-start (0.98 s); stabilises at ~0.54 s by run 3.

The long-text scenario is where the difference is dramatic: with buffer-all
the client sees nothing for ~18 seconds and then a 1100-token wall of text;
with native streaming the first token arrives in 0.22 s and the response
flows progressively. Total wall-clock is unchanged — only the perceived
reactivity changes.

Bonus: tool_call is now emitted as a proper OpenAI-style incremental
stream — tool_calls[0].function.name first, then arguments as a separate
delta — instead of one bundled chunk at the end:

chunk 1: tool_calls=[{id, type:function, function:{name:"get_weather", arguments:""}}]
chunk 2: tool_calls=[{function:{arguments:'{"city": "Paris"}'}}]
chunk 3: finish_reason="tool_calls"

Fix shape

  • Instantiate the tool parser before the streaming loop, build a minimal
    ChatCompletionRequest (only tools is read by parsers), set
    native_streaming if hasattr(tp_instance, "extract_tool_calls_streaming").
  • Inside the loop, when native_streaming is on, call the parser's
    streaming method per delta with the full 7-parameter signature
    (previous_text, current_text, delta_text, previous_token_ids,
    current_token_ids, delta_token_ids, request). Map the returned
    DeltaMessage to ChatDelta(content / reasoning_content / tool_calls).
    None → suppress this delta.
  • Fallback to buffer when the streaming method raises (logged with the
    request id) or is absent (future Python backends, custom parsers). The
    existing post-loop extract_tool_calls block builds the final chunk —
    same correctness as a non-streaming response.
  • Buffer-fallback content flush: when has_tool_parser is set and the
    parser found no tool call (plain text response), flush the buffered
    content as one content delta before the final metadata chunk, and clear
    chat_delta.content so the metadata chunk does not repeat it. This also
    fixes the trade-off in fix(vllm): don't stream raw tool-call markup as content when a tool parser is active #10346 where plain-text-with-tools responses
    arrived empty.
  • No change to the no-tool-parser path — per-delta content stream is
    byte-identical to the pre-fix(vllm): don't stream raw tool-call markup as content when a tool parser is active #10346 baseline.

Trade-offs / things to know

Concern Status
New code surface (~80 lines vs ~4 in #10346) 6 mock tests cover all paths; E2E verified
Per-delta extract_tool_calls_streaming CPU cost Negligible vs. token generation; vLLM's own OpenAI frontend does the same
New import path vllm.entrypoints.openai.chat_completion.protocol Wrapped in try/except; falls back to buffer if absent (no crash)
Parser exceptions mid-stream Logged with request id; rest of stream uses buffer fallback; final chunk still correct
Tokens-list tracking per delta O(N); trivial
Reasoning + Tool parser composition Currently the reasoning parser runs post-stream. When both parsers are active and a model emits e.g. <think>...</think><tool_call>..., the streaming tool parser sees the reasoning markup as content. This is unchanged from #10346 (which had the same composition issue, just hidden by buffer-all). Treating reasoning-during-streaming properly is out of scope for this PR — see #10000 for the post-stream reasoning side.
Backwards compatibility Parser without streaming method → buffer fallback = #10346 behaviour. Non-tool-parser path → byte-identical to pre-#10346.

Tests (backend/python/vllm/test.py, new class TestStreamingToolParser)

Six server-less mock tests that exercise _predict(streaming=True) directly:

  1. Buffer fallback: tool call → no markup as content; tool_call name present
  2. Buffer fallback: plain text → content delivered exactly once (no dupe)
  3. Native streaming: plain text → content streams progressively
  4. Native streaming: tool call → structured tool_calls emitted, no markup leak
  5. Native streaming exception → graceful fallback, no leak, no crash
  6. No tool parser → unchanged per-delta content stream (regression guard)

All six green on vLLM 0.23.0 (NVIDIA GB10 / arm64 / CUDA 13).

Bench script — examples/vllm-bench/

Self-contained stdlib-only TTFT benchmark with three scenarios (tool_call,
plain_text_short, plain_text_long). Reviewers can verify the
improvement on their own model:

python examples/vllm-bench/ttft_streaming_tool_parser.py \
    --url http://localhost:8080 --model my-coder --runs 3

The plain_text_long scenario (1500-token GIL explanation) shows the
buffering vs streaming difference most dramatically — see numbers above.

Signed commits

  • Yes, I signed my commits.

pos-ei-don and others added 6 commits June 15, 2026 17:46
…arser is active

When a tool_parser is configured and the request carries tools, the streaming
loop emitted every text delta as delta.content — including the model's raw
tool-call markup (e.g. <tool_call>...) — because extract_tool_calls only runs
on the full output after the stream. Clients streaming a tool call therefore
saw the unparsed tool-call syntax as assistant content.

Buffer the text while a tool parser is active for the request; the existing
end-of-stream chat_delta already carries the parsed tool_calls (or the cleaned
content), which the Go side converts to SSE deltas. Non-tool-parser streaming
is unchanged.

Add a server-less regression test covering both the tool-call case (no raw
markup leaked as content) and the plain-text case (content delivered exactly
once — guards against double-emitting the buffered content).

Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
…ming

When a tool parser is active for a tool-enabled streaming request,
mudler#10346 buffers the entire generation and surfaces it on the final
chunk to prevent raw tool-call markup from leaking as delta.content.
This is correct but turns the request into effectively non-streaming
for plain-text responses — the client sees nothing until the model
stops.

Every concrete tool parser shipped with vLLM 0.23+ already implements
extract_tool_calls_streaming (Granite4, Qwen3Coder, DeepSeekV31, Jamba,
Ernie45, Hermes2Pro, llama3_json, mistral, …). Use it: instantiate
the parser before the streaming loop and call its streaming method per
delta, emitting DeltaMessage(content=…) or DeltaMessage(tool_calls=[…])
when the parser is ready.

Falls back to the existing mudler#10346 buffer path when:
  - the parser does not have extract_tool_calls_streaming, OR
  - extract_tool_calls_streaming raises mid-stream (logged, the
    rest of the request finishes via post-loop extract_tool_calls).

Tests (TestStreamingToolParser):
  1. Buffer path: no markup leaked, no content duplication
  2. Native streaming: plain-text response streams progressively
  3. Native streaming: tool_call structured, no markup leaked
  4. Native streaming exception → graceful fallback, no markup, no crash
  5. No tool parser → unchanged per-delta content stream

E2E verified against qwen3_coder on vLLM 0.23.0 (NVIDIA GB10 / arm64 / CUDA 13).
…ser path

Self-contained stdlib-only script that measures time-to-first-token (TTFT)
for the vLLM backend's two streaming scenarios:

  - tool_call:  request mentions a tool; model is expected to call it
  - plain_text: request offers a tool but explicitly asks for prose

Use this to compare:
  - the buffer-all path (mudler#10346)         → plain_text TTFT ≈ total response time
  - the native-streaming path (this PR)  → plain_text TTFT ≈ true first-token time

  python examples/vllm-bench/ttft_streaming_tool_parser.py \\
      --url http://localhost:8080 --model my-coder --runs 3

Lives under examples/ so it does not interfere with the test suite.
The long-text scenario shows the buffering vs streaming difference most
dramatically: with the buffer-all path, the client receives nothing for
20+ seconds and then the entire 1500-token response at once. With native
streaming, the first token arrives in tens of milliseconds and the
response flows progressively.
@pos-ei-don

Copy link
Copy Markdown
Contributor Author

Migration note — no breaking changes

Existing setup Still works? Action
tool_parser: configured, client uses stream: false ✅ unchanged Optional: switch to stream: true for the UX win
tool_parser: configured, client filters <tool_call> markup out of delta.content locally ✅ unchanged (filter just never matches) Workaround can be removed
No tool_parser: configured, client parses tool markup itself ✅ byte-identical (plain-streaming path unchanged) None
Custom Python backend without extract_tool_calls_streaming ✅ falls back to the #10346 buffer path None
Mixed with structured outputs / response_format / JSON-mode ✅ orthogonal pipelines, no interaction None

Nobody is broken by this merge. Anyone who switches stream: falsestream: true after the merge gets the TTFT improvement; anyone who keeps it the way it was gets identical behaviour to today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant