Gemma4 parser by przepeck · Pull Request #4179 · openvinotoolkit/model_server

przepeck · 2026-05-05T09:39:09Z

🛠 Summary

CVS-184756
Enabling gemma4 model, changing vlm pipeline to accept special tokens in streaming

🧪 Checklist

Unit tests added.
The documentation updated.
Change follows security best practices.
``

dkalinowski · 2026-05-06T11:48:11Z


+#pragma warning(push)
+#pragma warning(disable : 6313)
+#include <rapidjson/writer.h>


you can probably use technique suggested by @atobiszei -> hide these headers in port directory and include these headers. inside those headers keep pragmas, this way you dont need them here

dkalinowski · 2026-05-06T11:55:09Z

+}
+
+TEST_F(Gemma4OutputParserTest, ParseToolCallWithListOfStringsAsArgument) {
+    std::string inputWithProperClosure = "<|tool_call>call:generate_DNA_sequence{length:100,preferences:[<|\"|>G<|\"|>,<|\"|>C<|\"|>]}<tool_call|>";


is it possible that model generates spaces between these keywords? if not, it can happen when model has low accuracy or subtle acc issues, how will this parser react to it?

I am not sure, I didn't encounter any of accuracy issues

Copilot

Pull request overview

This PR adds Gemma4 tool-call parsing support to OVMS LLM I/O processing and adjusts the VLM legacy unary path to preserve/handle special tokens by reusing the streaming text collection logic (per CVS-184756). It also updates model-prep scripts and introduces a dedicated Gemma4 output parser test suite.

Changes:

Added a new Gemma4ToolParser and wired it into OutputParser selection.
Modified VLM legacy unary response preparation to reconstruct the full text from streamer callbacks (to retain special tokens) and updated OpenAI API handler interfaces accordingly.
Added Gemma4 tokenizer/model preparation steps and extensive unit tests for Gemma4 tool-call parsing (unary + streaming scenarios).

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
windows_prepare_llm_models.bat	Adds Gemma4 tokenizer download step for Windows test model prep.
prepare_llm_models.sh	Adds Gemma4 tokenizer conversion/download for Linux test model prep.
src/test/llm/output_parsers/gemma4_output_parser_test.cpp	New unit tests covering Gemma4 tool-call parsing and streaming chunk behavior.
src/test/http_openai_handler_test.cpp	Updates tests to match the new VLM unary serialization API signature.
src/llm/visual_language_model/legacy/servable.cpp	Implements unary “streaming-style” text accumulation to preserve special tokens; passes full text into unary serialization.
src/llm/io_processing/utils.hpp	Adds new utility declarations used by Gemma4 parsing/serialization.
src/llm/io_processing/utils.cpp	Implements JSON argument writing and delimiter search respecting nested structures/quotes.
src/llm/io_processing/output_parser.cpp	Registers `gemma4` tool parser type.
src/llm/io_processing/gemma4/tool_parser.hpp	New Gemma4 tool parser interface + streaming state machine declarations.
src/llm/io_processing/gemma4/tool_parser.cpp	New Gemma4 tool parser implementation (unary + streaming).
src/llm/BUILD	Adds Bazel targets/deps for Gemma4 parser and RapidJSON usage in utils.
src/llm/apis/openai_api_handler.hpp	Updates VLM unary serialization interface to accept explicit `textResponse`.
src/llm/apis/openai_completions.hpp	Updates VLM unary serialization signature.
src/llm/apis/openai_completions.cpp	Uses provided `textResponse` for VLM unary serialization.
src/llm/apis/openai_responses.hpp	Updates VLM unary serialization signature.
src/llm/apis/openai_responses.cpp	Uses provided `textResponse` for VLM unary serialization.
spelling-whitelist.txt	Adds the new Gemma4 test file to the whitelist list.

Comments suppressed due to low confidence (2)

src/llm/apis/openai_completions.cpp:497

serializeUnaryResponse(VLMDecodedResults&, textResponse) only emits a choice when textResponse is non-empty. If the model legitimately generates an empty string, this returns an empty choices array, which is not a valid OpenAI-style response (a single choice with empty content/tool_calls should still be returned). Consider always creating one choice and using textResponse as-is (even when empty).

    if (!textResponse.empty()) {
        SPDLOG_LOGGER_TRACE(llm_calculator_logger, "Generated text: {}", textResponse);

        // Workaround to use OVMS unary parsers: get tokens from string
        // This way we have detokenized text from GenAI and calculate tokens, to further convert back to text again, in parseOutputIfNeeded...
        auto generatedTokens = encodeTextToTokens(textResponse);

        SPDLOG_LOGGER_TRACE(llm_calculator_logger, "Generated tokens: {}", generatedTokens);
        ParsedOutput parsedOutput = parseOutputIfNeeded(generatedTokens);
        jsonResponse.StartObject();
        // finish_reason: "stop" in regular scenario, "tool_calls" if output contains tool calls
        auto finishReason = mapFinishReason(ov::genai::GenerationFinishReason::STOP, !parsedOutput.toolCalls.empty());
        jsonResponse.FinishReason(finishReason.value_or("unknown"));
        // index: integer; Choice index, only n=1 supported anyway
        jsonResponse.Index(index++);
        // TODO: logprobs: object/null; Log probability information for the choice.

        if (endpoint == Endpoint::CHAT_COMPLETIONS) {
            jsonResponse.MessageObject(parsedOutput);
        } else if (endpoint == Endpoint::COMPLETIONS) {
            jsonResponse.Text(parsedOutput);
        }

        // finish message object
        jsonResponse.EndObject();
    }
    // finish choices array
    jsonResponse.EndArray();

src/llm/apis/openai_responses.cpp:676

serializeUnaryResponse(VLMDecodedResults&, textResponse) skips adding any ParsedOutput when textResponse is empty, so serializeUnaryResponseImpl() is called with an empty vector. If the model output is legitimately empty, this likely produces a response without any output items/content, which is invalid for the Responses API. Consider emitting a single empty ParsedOutput (or otherwise ensuring one output item exists) even when textResponse is empty.

    std::vector<ParsedOutput> parsedOutputs;
    if (!textResponse.empty()) {
        if (outputParser != nullptr) {
            // Same workaround as in chat completions
            auto generatedTokens = encodeTextToTokens(textResponse);
            parsedOutputs.push_back(parseOutputIfNeeded(generatedTokens));
        } else {
            // Fast path: no output parser, use decoded text directly.
            ParsedOutput output;
            output.content = textResponse;
            parsedOutputs.push_back(std::move(output));
        }
    }
    return serializeUnaryResponseImpl(parsedOutputs);

+        this->streamingPosition = toolCallEndTagPos + TOOL_CALL_END_TAG.length();
+        this->currentState = State::AfterToolCall;
+        SPDLOG_LOGGER_TRACE(llm_calculator_logger, "Detected end of tool call at position: {}, returning to content state", toolCallEndTagPos);


+std::string Gemma4ToolParser::parseArrayParameter(std::string argumentStr) {
+    size_t pos = 1;
+    std::string parsedArguments = "[";
+
+    while (pos != std::string::npos) {
+        size_t stringStartPos = argumentStr.find(TOOL_ARGS_STRING_INDICATOR, pos);
+        if (stringStartPos == std::string::npos) {
+            break;
+        }
+        stringStartPos += TOOL_ARGS_STRING_INDICATOR.size();
+        size_t stringEndPos = argumentStr.find(TOOL_ARGS_STRING_INDICATOR, stringStartPos);
+        if (stringEndPos == std::string::npos) {
+            break;
+        }
+
+        std::string originalStr = argumentStr.substr(stringStartPos, stringEndPos - stringStartPos);
+        size_t quotePos = 0;
+        while ((quotePos = originalStr.find('\"', quotePos)) != std::string::npos) {
+            originalStr.insert(quotePos, "\\");
+            quotePos += 2;
+        }
+        parsedArguments += "\"" + originalStr + "\",";
+
+        pos = stringEndPos + TOOL_ARGS_STRING_INDICATOR.size() + 1;
+    }
+
+    parsedArguments.back() = ']';
+
+    return parsedArguments;


+void writeArgumentOfAnyType(const rapidjson::Value& arg, rapidjson::Writer<rapidjson::StringBuffer>& writer) {
+    if (arg.IsString()) {
+        writer.String(arg.GetString());
+    } else if (arg.IsInt64()) {
+        writer.Int64(arg.GetInt64());
+    } else if (arg.IsDouble()) {
+        writer.Double(arg.GetDouble());
+    } else if (arg.IsBool()) {
+        writer.Bool(arg.GetBool());
+    } else if (arg.IsArray()) {
+        writer.StartArray();
+        for (auto& elem : arg.GetArray()) {
+            writeArgumentOfAnyType(elem, writer);
+        }
+        writer.EndArray();
+    } else if (arg.IsObject()) {
+        writer.StartObject();
+        for (auto it = arg.MemberBegin(); it != arg.MemberEnd(); ++it) {
+            writer.Key(it->name.GetString());
+            writeArgumentOfAnyType(it->value, writer);
+        }
+        writer.EndObject();
+    } else {
+        writer.String("");
+    }


+std::optional<rapidjson::Document> Gemma4ToolParser::parseChunk(const std::string& chunk, ov::genai::GenerationFinishReason finishReason) {
+    if (chunk.empty()) {
+        return std::nullopt;
+    }
+
+    this->streamingContent += chunk;
+
+    if (parseNewContent()) {
+        if (this->currentState == State::ToolCallParameters) {
+            return BaseOutputParser::wrapFirstDelta(this->toolCall.name, toolCallIndex);
+        }
+        if (this->currentState == State::ToolCallEnded) {
+            return wrapDeltaArgs(this->toolCall.arguments, toolCallIndex);
+        }
+        if (this->currentState == State::Content) {
+            size_t contentEnd = this->streamingContent.find(TOOL_CALL_START_TAG, this->streamingPosition);
+            std::string content;
+            if (contentEnd != std::string::npos) {
+                content = this->streamingContent.substr(this->streamingPosition, contentEnd - this->streamingPosition);
+            } else {
+                content = this->streamingContent.substr(this->streamingPosition);
+            }
+            this->streamingPosition += content.size();
+            if (!content.empty()) {
+                return wrapDeltaContent(content);
+            }
+        }
+        if (this->currentState == State::AfterToolCall) {
+            this->currentState = State::Content;
+        }
+    }


dkalinowski · 2026-05-08T10:38:44Z

+    size_t endPos = std::find(generatedTokens.begin(), generatedTokens.end(), reasoningEndTokenId) - generatedTokens.begin();
+
+    if (startPos != std::string::npos && endPos != std::string::npos && startPos < endPos) {
+        size_t reasoningStart = startPos + 3;  // deleting "<|channel>thought\n"


ensure these tokens were really the ones you assumed. otherwise we might not detect issues early

dkalinowski · 2026-05-08T10:40:52Z

+        // Remove reasoning from content
+        std::string contentWithoutReasoning = tokenizer.decode(std::vector<int64_t>(generatedTokens.begin() + endPos + 1, generatedTokens.end()), ov::genai::skip_special_tokens(true));
+        parsedOutput.content = contentWithoutReasoning;
+    }


what if there was no reasoning? what will take care of content parsing?

probably tool parser will take care of that

dkalinowski · 2026-05-08T10:44:27Z

+    Gemma4ReasoningParser() = delete;
+    explicit Gemma4ReasoningParser(ov::genai::Tokenizer& tokenizer) :
+        Qwen3ReasoningParser(tokenizer) {}
+    void parse(ParsedOutput& parsedOutput, const std::vector<int64_t>& generatedTokens) override;


are we missing support for parseChunk? - does it work with streaming?

I've added it yet

dkalinowski · 2026-05-08T10:45:38Z

+
+        std::string originalStr = argumentStr.substr(stringStartPos, stringEndPos - stringStartPos);
+        size_t quotePos = 0;
+        while ((quotePos = originalStr.find('\"', quotePos)) != std::string::npos) {


Please check if we dont need multi-level escape, like we used to need in qwen3-coder parser

dkalinowski · 2026-05-08T11:03:39Z

-        If the tag is shorter than the buffer, we check:
-           a) if the tag is a substring of the buffer (tag is fully matched)
-           b) if the buffer and tag overlap (part of the tag is matched)
-        in the first case we return FOUND_COMPLETE, in the second FOUND_INCOMPLETE


dkalinowski · 2026-05-08T11:03:53Z

+Using appropriate parser based on the current processing phase
+Call to this method should return either result from parserContentChunk, parseToolCallChunk, parseReasoningChunk when we can determine the phase
+or std::nullopt when we are waiting for more chunks to determine if we should switch phase or not.
+Note that mentioned methods do not take chunk as argument, they read it from streamOutputCache and are responsible for clearing the cache,


dkalinowski · 2026-05-08T11:04:23Z


    ASSERT_EQ(apiHandler->parseRequest(maxTokensLimit, bestOfLimit, maxModelLength), absl::OkStatus());

+    std::string toolCallContent = "I will call a tool.<tool_call>{\"name\":\"get_weather\",\"arguments\":{\"location\":\"Paris\"}}</tool_call>";


this is some content from rebasing main , i will remove it

dkalinowski · 2026-05-08T12:03:30Z

-    }
-
-    // TODO(mzegla): Usage of streaming flow here due to GenAI generate limitations.
-    // This diverges from the general flow - we should have unified systematic approach.


keep the comment, why are you removing isDisconnected?

dkalinowski · 2026-05-08T12:04:43Z

    int singleQuoteDepth = 0;

    for (size_t i = startPos; i < str.length(); ++i) {
+        if (bracketDepth == 0 && braceDepth == 0 && quoteDepth == 0 && singleQuoteDepth == 0 &&


is it also used by other parsers? we should also run bfcl for other parsers that use it so we are sure there is no regression (i think lfm uses it?)

przepeck force-pushed the przepeck/gemma_parser branch from ab0aa5b to 1c329a4 Compare May 6, 2026 09:53