fix(vllm): don't stream raw tool-call markup as content when a tool parser is active by pos-ei-don · Pull Request #10346 · mudler/LocalAI

pos-ei-don · 2026-06-15T17:25:38Z

Description

When the vLLM backend has a tool_parser configured (options: [tool_parser:...]) and a request includes tools, streaming responses leak the model's raw tool-call markup as assistant content.

The streaming loop yields every text delta as delta.content:

if streaming:
    delta_iteration_text = iteration_text.removeprefix(generated_text)
    yield backend_pb2.Reply(
        message=bytes(delta_iteration_text, encoding='utf-8'),
        chat_deltas=[backend_pb2.ChatDelta(content=delta_iteration_text)],
    )

But the tool parser (extract_tool_calls) only runs on the full text after the stream completes. So for a model that emits e.g. <tool_call>{"name": ...}</tool_call>, the client receives that raw markup as delta.content chunks during the stream, and the structured tool_calls only arrive in the final chunk.

Reproduce (a tool parser such as qwen3_coder, streaming request with tools):

Before: delta.content chunks contain <tool_call>..., finish_reason: stop.
After: delta.tool_calls chunks with name/arguments, finish_reason: tool_calls.

Verified on NVIDIA GB10 / arm64 / CUDA 13, vLLM 0.23.0, with the qwen3_coder tool parser driving a coding agent over the streaming OpenAI API.

Fix

Buffer the text while a tool parser is active for the request, and let the existing end-of-stream ChatDelta carry the parsed tool_calls (which the Go side already converts to SSE delta.tool_calls via ToolCallsFromChatDeltas / buildDeferredToolCallChunks). When the parser finds no tool call, flush the buffered content as a single content delta before the final chunk. Non-tool-parser streaming is unchanged.

…arser is active When a tool_parser is configured and the request carries tools, the streaming loop emitted every text delta as delta.content — including the model's raw tool-call markup (e.g. <tool_call>...) — because extract_tool_calls only runs on the full output after the stream. Clients streaming a tool call therefore saw the unparsed tool-call syntax as assistant content. Buffer the text while a tool parser is active for the request; the existing end-of-stream chat_delta already carries the parsed tool_calls (or the cleaned content), which the Go side converts to SSE deltas. Non-tool-parser streaming is unchanged. Add a server-less regression test covering both the tool-call case (no raw markup leaked as content) and the plain-text case (content delivered exactly once — guards against double-emitting the buffered content). Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>

pos-ei-don · 2026-06-15T17:47:05Z

Updated after end-to-end verification on the live backend (NVIDIA GB10 / arm64 / CUDA 13, vLLM 0.23.0, qwen3_coder tool parser, streaming over the OpenAI API):

Tool-call case ✅ — delta.tool_calls with name/arguments, finish_reason: tool_calls, no raw markup in content.
Plain-text-with-tools case — testing surfaced a second facet of the same bug: when a tool is offered but the model answers in plain text, the buffered content was being emitted twice (once via an extra flush, once via the final chat_delta which already carries it). Fixed by dropping the redundant flush — the existing end-of-stream chat_delta.content is the single source. Re-verified: content now arrives exactly once.

Also added a server-less regression test (test_streaming_tool_parser_buffering) that mocks the engine + tool parser and asserts both invariants: no raw tool-call markup leaks as streamed content, and buffered content is delivered exactly once. It fails on the pre-fix code and passes on this change.

…ming When a tool parser is active for a tool-enabled streaming request, mudler#10346 buffers the entire generation and surfaces it on the final chunk to prevent raw tool-call markup from leaking as delta.content. This is correct but turns the request into effectively non-streaming for plain-text responses — the client sees nothing until the model stops. Every concrete tool parser shipped with vLLM 0.23+ already implements extract_tool_calls_streaming (Granite4, Qwen3Coder, DeepSeekV31, Jamba, Ernie45, Hermes2Pro, llama3_json, mistral, …). Use it: instantiate the parser before the streaming loop and call its streaming method per delta, emitting DeltaMessage(content=…) or DeltaMessage(tool_calls=[…]) when the parser is ready. Falls back to the existing mudler#10346 buffer path when: - the parser does not have extract_tool_calls_streaming, OR - extract_tool_calls_streaming raises mid-stream (logged, the rest of the request finishes via post-loop extract_tool_calls). Tests (TestStreamingToolParser): 1. Buffer path: no markup leaked, no content duplication 2. Native streaming: plain-text response streams progressively 3. Native streaming: tool_call structured, no markup leaked 4. Native streaming exception → graceful fallback, no markup, no crash 5. No tool parser → unchanged per-delta content stream E2E verified against qwen3_coder on vLLM 0.23.0 (NVIDIA GB10 / arm64 / CUDA 13).

…ser path Self-contained stdlib-only script that measures time-to-first-token (TTFT) for the vLLM backend's two streaming scenarios: - tool_call: request mentions a tool; model is expected to call it - plain_text: request offers a tool but explicitly asks for prose Use this to compare: - the buffer-all path (mudler#10346) → plain_text TTFT ≈ total response time - the native-streaming path (this PR) → plain_text TTFT ≈ true first-token time python examples/vllm-bench/ttft_streaming_tool_parser.py \\ --url http://localhost:8080 --model my-coder --runs 3 Lives under examples/ so it does not interfere with the test suite.

pos-ei-don · 2026-06-15T21:15:24Z

Superseded by #10351, which goes one step further: instead of buffering the entire generation, it delegates to parser.extract_tool_calls_streaming (implemented by every vLLM 0.23+ tool parser) and streams progressively. The buffer-all path becomes the fallback for parsers that lack the streaming method (none, today, but defensive).

Same correctness as this PR (no raw tool-call markup in delta.content), plus restored streaming for plain-text-with-tools responses. Verified end-to-end against qwen3_coder — plain-text TTFT drops from ~18s (long output) to ~0.22s.

Closing in favour of #10351.

pos-ei-don · 2026-06-15T21:39:33Z

Superseded by #10351, which takes a better approach: progressive streaming via the parser's extract_tool_calls_streaming (token-by-token) instead of buffering the whole output, and it also covers the plain-text-with-tools content-duplication edge case. Please disregard the update above — continuing in #10351.

pos-ei-don force-pushed the fix/vllm-stream-tool-call-buffering branch from b7f0b50 to 3ae349c Compare June 15, 2026 17:46

pos-ei-don mentioned this pull request Jun 15, 2026

feat(vllm): progressive streaming via parser.extract_tool_calls_streaming (follow-up to #10346) #10351

Open

1 task

pos-ei-don closed this Jun 15, 2026

pos-ei-don deleted the fix/vllm-stream-tool-call-buffering branch June 15, 2026 21:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(vllm): don't stream raw tool-call markup as content when a tool parser is active#10346

fix(vllm): don't stream raw tool-call markup as content when a tool parser is active#10346
pos-ei-don wants to merge 1 commit into
mudler:masterfrom
pos-ei-don:fix/vllm-stream-tool-call-buffering

pos-ei-don commented Jun 15, 2026

Uh oh!

pos-ei-don commented Jun 15, 2026

Uh oh!

pos-ei-don commented Jun 15, 2026

Uh oh!

pos-ei-don commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

pos-ei-don commented Jun 15, 2026

Uh oh!

pos-ei-don commented Jun 15, 2026

Uh oh!

pos-ei-don commented Jun 15, 2026

Uh oh!

pos-ei-don commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant