Skip to content

fix(vllm): don't stream raw tool-call markup as content when a tool parser is active#10346

Closed
pos-ei-don wants to merge 1 commit into
mudler:masterfrom
pos-ei-don:fix/vllm-stream-tool-call-buffering
Closed

fix(vllm): don't stream raw tool-call markup as content when a tool parser is active#10346
pos-ei-don wants to merge 1 commit into
mudler:masterfrom
pos-ei-don:fix/vllm-stream-tool-call-buffering

Conversation

@pos-ei-don

Copy link
Copy Markdown
Contributor

Description

When the vLLM backend has a tool_parser configured (options: [tool_parser:...]) and a request includes tools, streaming responses leak the model's raw tool-call markup as assistant content.

The streaming loop yields every text delta as delta.content:

if streaming:
    delta_iteration_text = iteration_text.removeprefix(generated_text)
    yield backend_pb2.Reply(
        message=bytes(delta_iteration_text, encoding='utf-8'),
        chat_deltas=[backend_pb2.ChatDelta(content=delta_iteration_text)],
    )

But the tool parser (extract_tool_calls) only runs on the full text after the stream completes. So for a model that emits e.g. <tool_call>{"name": ...}</tool_call>, the client receives that raw markup as delta.content chunks during the stream, and the structured tool_calls only arrive in the final chunk.

Reproduce (a tool parser such as qwen3_coder, streaming request with tools):

  • Before: delta.content chunks contain <tool_call>..., finish_reason: stop.
  • After: delta.tool_calls chunks with name/arguments, finish_reason: tool_calls.

Verified on NVIDIA GB10 / arm64 / CUDA 13, vLLM 0.23.0, with the qwen3_coder tool parser driving a coding agent over the streaming OpenAI API.

Fix

Buffer the text while a tool parser is active for the request, and let the existing end-of-stream ChatDelta carry the parsed tool_calls (which the Go side already converts to SSE delta.tool_calls via ToolCallsFromChatDeltas / buildDeferredToolCallChunks). When the parser finds no tool call, flush the buffered content as a single content delta before the final chunk. Non-tool-parser streaming is unchanged.

…arser is active

When a tool_parser is configured and the request carries tools, the streaming
loop emitted every text delta as delta.content — including the model's raw
tool-call markup (e.g. <tool_call>...) — because extract_tool_calls only runs
on the full output after the stream. Clients streaming a tool call therefore
saw the unparsed tool-call syntax as assistant content.

Buffer the text while a tool parser is active for the request; the existing
end-of-stream chat_delta already carries the parsed tool_calls (or the cleaned
content), which the Go side converts to SSE deltas. Non-tool-parser streaming
is unchanged.

Add a server-less regression test covering both the tool-call case (no raw
markup leaked as content) and the plain-text case (content delivered exactly
once — guards against double-emitting the buffered content).

Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
@pos-ei-don pos-ei-don force-pushed the fix/vllm-stream-tool-call-buffering branch from b7f0b50 to 3ae349c Compare June 15, 2026 17:46
@pos-ei-don

Copy link
Copy Markdown
Contributor Author

Updated after end-to-end verification on the live backend (NVIDIA GB10 / arm64 / CUDA 13, vLLM 0.23.0, qwen3_coder tool parser, streaming over the OpenAI API):

  • Tool-call case ✅ — delta.tool_calls with name/arguments, finish_reason: tool_calls, no raw markup in content.
  • Plain-text-with-tools case — testing surfaced a second facet of the same bug: when a tool is offered but the model answers in plain text, the buffered content was being emitted twice (once via an extra flush, once via the final chat_delta which already carries it). Fixed by dropping the redundant flush — the existing end-of-stream chat_delta.content is the single source. Re-verified: content now arrives exactly once.

Also added a server-less regression test (test_streaming_tool_parser_buffering) that mocks the engine + tool parser and asserts both invariants: no raw tool-call markup leaks as streamed content, and buffered content is delivered exactly once. It fails on the pre-fix code and passes on this change.

pos-ei-don pushed a commit to pos-ei-don/LocalAI that referenced this pull request Jun 15, 2026
…ming

When a tool parser is active for a tool-enabled streaming request,
mudler#10346 buffers the entire generation and surfaces it on the final
chunk to prevent raw tool-call markup from leaking as delta.content.
This is correct but turns the request into effectively non-streaming
for plain-text responses — the client sees nothing until the model
stops.

Every concrete tool parser shipped with vLLM 0.23+ already implements
extract_tool_calls_streaming (Granite4, Qwen3Coder, DeepSeekV31, Jamba,
Ernie45, Hermes2Pro, llama3_json, mistral, …). Use it: instantiate
the parser before the streaming loop and call its streaming method per
delta, emitting DeltaMessage(content=…) or DeltaMessage(tool_calls=[…])
when the parser is ready.

Falls back to the existing mudler#10346 buffer path when:
  - the parser does not have extract_tool_calls_streaming, OR
  - extract_tool_calls_streaming raises mid-stream (logged, the
    rest of the request finishes via post-loop extract_tool_calls).

Tests (TestStreamingToolParser):
  1. Buffer path: no markup leaked, no content duplication
  2. Native streaming: plain-text response streams progressively
  3. Native streaming: tool_call structured, no markup leaked
  4. Native streaming exception → graceful fallback, no markup, no crash
  5. No tool parser → unchanged per-delta content stream

E2E verified against qwen3_coder on vLLM 0.23.0 (NVIDIA GB10 / arm64 / CUDA 13).
pos-ei-don pushed a commit to pos-ei-don/LocalAI that referenced this pull request Jun 15, 2026
…ser path

Self-contained stdlib-only script that measures time-to-first-token (TTFT)
for the vLLM backend's two streaming scenarios:

  - tool_call:  request mentions a tool; model is expected to call it
  - plain_text: request offers a tool but explicitly asks for prose

Use this to compare:
  - the buffer-all path (mudler#10346)         → plain_text TTFT ≈ total response time
  - the native-streaming path (this PR)  → plain_text TTFT ≈ true first-token time

  python examples/vllm-bench/ttft_streaming_tool_parser.py \\
      --url http://localhost:8080 --model my-coder --runs 3

Lives under examples/ so it does not interfere with the test suite.
@pos-ei-don

Copy link
Copy Markdown
Contributor Author

Superseded by #10351, which goes one step further: instead of buffering the entire generation, it delegates to parser.extract_tool_calls_streaming (implemented by every vLLM 0.23+ tool parser) and streams progressively. The buffer-all path becomes the fallback for parsers that lack the streaming method (none, today, but defensive).

Same correctness as this PR (no raw tool-call markup in delta.content), plus restored streaming for plain-text-with-tools responses. Verified end-to-end against qwen3_coder — plain-text TTFT drops from ~18s (long output) to ~0.22s.

Closing in favour of #10351.

@pos-ei-don pos-ei-don closed this Jun 15, 2026
@pos-ei-don

Copy link
Copy Markdown
Contributor Author

Superseded by #10351, which takes a better approach: progressive streaming via the parser's extract_tool_calls_streaming (token-by-token) instead of buffering the whole output, and it also covers the plain-text-with-tools content-duplication edge case. Please disregard the update above — continuing in #10351.

@pos-ei-don pos-ei-don deleted the fix/vllm-stream-tool-call-buffering branch June 15, 2026 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant