Skip to content

Gemma4 parser#4179

Open
przepeck wants to merge 33 commits intomainfrom
przepeck/gemma_parser
Open

Gemma4 parser#4179
przepeck wants to merge 33 commits intomainfrom
przepeck/gemma_parser

Conversation

@przepeck
Copy link
Copy Markdown
Collaborator

@przepeck przepeck commented May 5, 2026

🛠 Summary

CVS-184756
Enabling gemma4 model, changing vlm pipeline to accept special tokens in streaming

🧪 Checklist

  • Unit tests added.
  • The documentation updated.
  • Change follows security best practices.
    ``

@przepeck przepeck force-pushed the przepeck/gemma_parser branch from ab0aa5b to 1c329a4 Compare May 6, 2026 09:53
Comment thread src/llm/io_processing/gemma4/tool_parser.cpp Outdated

#pragma warning(push)
#pragma warning(disable : 6313)
#include <rapidjson/writer.h>
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can probably use technique suggested by @atobiszei -> hide these headers in port directory and include these headers. inside those headers keep pragmas, this way you dont need them here

}

TEST_F(Gemma4OutputParserTest, ParseToolCallWithListOfStringsAsArgument) {
std::string inputWithProperClosure = "<|tool_call>call:generate_DNA_sequence{length:100,preferences:[<|\"|>G<|\"|>,<|\"|>C<|\"|>]}<tool_call|>";
Copy link
Copy Markdown
Collaborator

@dkalinowski dkalinowski May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible that model generates spaces between these keywords? if not, it can happen when model has low accuracy or subtle acc issues, how will this parser react to it?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, I didn't encounter any of accuracy issues

Comment thread src/test/llm/output_parsers/gemma4_output_parser_test.cpp Outdated
@przepeck przepeck marked this pull request as ready for review May 6, 2026 12:07
Copilot AI review requested due to automatic review settings May 6, 2026 12:07
Comment thread src/test/llm/output_parsers/gemma4_output_parser_test.cpp Outdated
Comment thread src/test/llm/output_parsers/gemma4_output_parser_test.cpp Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Gemma4 tool-call parsing support to OVMS LLM I/O processing and adjusts the VLM legacy unary path to preserve/handle special tokens by reusing the streaming text collection logic (per CVS-184756). It also updates model-prep scripts and introduces a dedicated Gemma4 output parser test suite.

Changes:

  • Added a new Gemma4ToolParser and wired it into OutputParser selection.
  • Modified VLM legacy unary response preparation to reconstruct the full text from streamer callbacks (to retain special tokens) and updated OpenAI API handler interfaces accordingly.
  • Added Gemma4 tokenizer/model preparation steps and extensive unit tests for Gemma4 tool-call parsing (unary + streaming scenarios).

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
windows_prepare_llm_models.bat Adds Gemma4 tokenizer download step for Windows test model prep.
prepare_llm_models.sh Adds Gemma4 tokenizer conversion/download for Linux test model prep.
src/test/llm/output_parsers/gemma4_output_parser_test.cpp New unit tests covering Gemma4 tool-call parsing and streaming chunk behavior.
src/test/http_openai_handler_test.cpp Updates tests to match the new VLM unary serialization API signature.
src/llm/visual_language_model/legacy/servable.cpp Implements unary “streaming-style” text accumulation to preserve special tokens; passes full text into unary serialization.
src/llm/io_processing/utils.hpp Adds new utility declarations used by Gemma4 parsing/serialization.
src/llm/io_processing/utils.cpp Implements JSON argument writing and delimiter search respecting nested structures/quotes.
src/llm/io_processing/output_parser.cpp Registers gemma4 tool parser type.
src/llm/io_processing/gemma4/tool_parser.hpp New Gemma4 tool parser interface + streaming state machine declarations.
src/llm/io_processing/gemma4/tool_parser.cpp New Gemma4 tool parser implementation (unary + streaming).
src/llm/BUILD Adds Bazel targets/deps for Gemma4 parser and RapidJSON usage in utils.
src/llm/apis/openai_api_handler.hpp Updates VLM unary serialization interface to accept explicit textResponse.
src/llm/apis/openai_completions.hpp Updates VLM unary serialization signature.
src/llm/apis/openai_completions.cpp Uses provided textResponse for VLM unary serialization.
src/llm/apis/openai_responses.hpp Updates VLM unary serialization signature.
src/llm/apis/openai_responses.cpp Uses provided textResponse for VLM unary serialization.
spelling-whitelist.txt Adds the new Gemma4 test file to the whitelist list.
Comments suppressed due to low confidence (2)

src/llm/apis/openai_completions.cpp:497

  • serializeUnaryResponse(VLMDecodedResults&, textResponse) only emits a choice when textResponse is non-empty. If the model legitimately generates an empty string, this returns an empty choices array, which is not a valid OpenAI-style response (a single choice with empty content/tool_calls should still be returned). Consider always creating one choice and using textResponse as-is (even when empty).
    if (!textResponse.empty()) {
        SPDLOG_LOGGER_TRACE(llm_calculator_logger, "Generated text: {}", textResponse);

        // Workaround to use OVMS unary parsers: get tokens from string
        // This way we have detokenized text from GenAI and calculate tokens, to further convert back to text again, in parseOutputIfNeeded...
        auto generatedTokens = encodeTextToTokens(textResponse);

        SPDLOG_LOGGER_TRACE(llm_calculator_logger, "Generated tokens: {}", generatedTokens);
        ParsedOutput parsedOutput = parseOutputIfNeeded(generatedTokens);
        jsonResponse.StartObject();
        // finish_reason: "stop" in regular scenario, "tool_calls" if output contains tool calls
        auto finishReason = mapFinishReason(ov::genai::GenerationFinishReason::STOP, !parsedOutput.toolCalls.empty());
        jsonResponse.FinishReason(finishReason.value_or("unknown"));
        // index: integer; Choice index, only n=1 supported anyway
        jsonResponse.Index(index++);
        // TODO: logprobs: object/null; Log probability information for the choice.

        if (endpoint == Endpoint::CHAT_COMPLETIONS) {
            jsonResponse.MessageObject(parsedOutput);
        } else if (endpoint == Endpoint::COMPLETIONS) {
            jsonResponse.Text(parsedOutput);
        }

        // finish message object
        jsonResponse.EndObject();
    }
    // finish choices array
    jsonResponse.EndArray();

src/llm/apis/openai_responses.cpp:676

  • serializeUnaryResponse(VLMDecodedResults&, textResponse) skips adding any ParsedOutput when textResponse is empty, so serializeUnaryResponseImpl() is called with an empty vector. If the model output is legitimately empty, this likely produces a response without any output items/content, which is invalid for the Responses API. Consider emitting a single empty ParsedOutput (or otherwise ensuring one output item exists) even when textResponse is empty.
    std::vector<ParsedOutput> parsedOutputs;
    if (!textResponse.empty()) {
        if (outputParser != nullptr) {
            // Same workaround as in chat completions
            auto generatedTokens = encodeTextToTokens(textResponse);
            parsedOutputs.push_back(parseOutputIfNeeded(generatedTokens));
        } else {
            // Fast path: no output parser, use decoded text directly.
            ParsedOutput output;
            output.content = textResponse;
            parsedOutputs.push_back(std::move(output));
        }
    }
    return serializeUnaryResponseImpl(parsedOutputs);

Comment on lines +314 to +316
this->streamingPosition = toolCallEndTagPos + TOOL_CALL_END_TAG.length();
this->currentState = State::AfterToolCall;
SPDLOG_LOGGER_TRACE(llm_calculator_logger, "Detected end of tool call at position: {}, returning to content state", toolCallEndTagPos);
Comment on lines +39 to +67
std::string Gemma4ToolParser::parseArrayParameter(std::string argumentStr) {
size_t pos = 1;
std::string parsedArguments = "[";

while (pos != std::string::npos) {
size_t stringStartPos = argumentStr.find(TOOL_ARGS_STRING_INDICATOR, pos);
if (stringStartPos == std::string::npos) {
break;
}
stringStartPos += TOOL_ARGS_STRING_INDICATOR.size();
size_t stringEndPos = argumentStr.find(TOOL_ARGS_STRING_INDICATOR, stringStartPos);
if (stringEndPos == std::string::npos) {
break;
}

std::string originalStr = argumentStr.substr(stringStartPos, stringEndPos - stringStartPos);
size_t quotePos = 0;
while ((quotePos = originalStr.find('\"', quotePos)) != std::string::npos) {
originalStr.insert(quotePos, "\\");
quotePos += 2;
}
parsedArguments += "\"" + originalStr + "\",";

pos = stringEndPos + TOOL_ARGS_STRING_INDICATOR.size() + 1;
}

parsedArguments.back() = ']';

return parsedArguments;
Comment thread src/llm/io_processing/gemma4/tool_parser.cpp Outdated
Comment on lines +38 to +62
void writeArgumentOfAnyType(const rapidjson::Value& arg, rapidjson::Writer<rapidjson::StringBuffer>& writer) {
if (arg.IsString()) {
writer.String(arg.GetString());
} else if (arg.IsInt64()) {
writer.Int64(arg.GetInt64());
} else if (arg.IsDouble()) {
writer.Double(arg.GetDouble());
} else if (arg.IsBool()) {
writer.Bool(arg.GetBool());
} else if (arg.IsArray()) {
writer.StartArray();
for (auto& elem : arg.GetArray()) {
writeArgumentOfAnyType(elem, writer);
}
writer.EndArray();
} else if (arg.IsObject()) {
writer.StartObject();
for (auto it = arg.MemberBegin(); it != arg.MemberEnd(); ++it) {
writer.Key(it->name.GetString());
writeArgumentOfAnyType(it->value, writer);
}
writer.EndObject();
} else {
writer.String("");
}
Comment on lines +356 to +386
std::optional<rapidjson::Document> Gemma4ToolParser::parseChunk(const std::string& chunk, ov::genai::GenerationFinishReason finishReason) {
if (chunk.empty()) {
return std::nullopt;
}

this->streamingContent += chunk;

if (parseNewContent()) {
if (this->currentState == State::ToolCallParameters) {
return BaseOutputParser::wrapFirstDelta(this->toolCall.name, toolCallIndex);
}
if (this->currentState == State::ToolCallEnded) {
return wrapDeltaArgs(this->toolCall.arguments, toolCallIndex);
}
if (this->currentState == State::Content) {
size_t contentEnd = this->streamingContent.find(TOOL_CALL_START_TAG, this->streamingPosition);
std::string content;
if (contentEnd != std::string::npos) {
content = this->streamingContent.substr(this->streamingPosition, contentEnd - this->streamingPosition);
} else {
content = this->streamingContent.substr(this->streamingPosition);
}
this->streamingPosition += content.size();
if (!content.empty()) {
return wrapDeltaContent(content);
}
}
if (this->currentState == State::AfterToolCall) {
this->currentState = State::Content;
}
}
Comment thread src/test/llm/output_parsers/gemma4_output_parser_test.cpp
Comment thread src/test/llm/output_parsers/gemma4_output_parser_test.cpp
Comment thread src/test/llm/output_parsers/gemma4_output_parser_test.cpp
Comment thread src/test/llm/output_parsers/gemma4_output_parser_test.cpp
Comment thread prepare_llm_models.sh Outdated
@dtrawins dtrawins added this to the 2026.2_rc milestone May 8, 2026
@przepeck przepeck force-pushed the przepeck/gemma_parser branch from dea7d61 to c6f3835 Compare May 8, 2026 09:49
Comment thread src/llm/io_processing/gemma4/reasoning_parser.hpp Outdated
size_t endPos = std::find(generatedTokens.begin(), generatedTokens.end(), reasoningEndTokenId) - generatedTokens.begin();

if (startPos != std::string::npos && endPos != std::string::npos && startPos < endPos) {
size_t reasoningStart = startPos + 3; // deleting "<|channel>thought\n"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensure these tokens were really the ones you assumed. otherwise we might not detect issues early

Comment thread src/llm/io_processing/gemma4/reasoning_parser.cpp Outdated
// Remove reasoning from content
std::string contentWithoutReasoning = tokenizer.decode(std::vector<int64_t>(generatedTokens.begin() + endPos + 1, generatedTokens.end()), ov::genai::skip_special_tokens(true));
parsedOutput.content = contentWithoutReasoning;
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if there was no reasoning? what will take care of content parsing?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably tool parser will take care of that

Comment thread src/llm/io_processing/gemma4/reasoning_parser.cpp Outdated
Gemma4ReasoningParser() = delete;
explicit Gemma4ReasoningParser(ov::genai::Tokenizer& tokenizer) :
Qwen3ReasoningParser(tokenizer) {}
void parse(ParsedOutput& parsedOutput, const std::vector<int64_t>& generatedTokens) override;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we missing support for parseChunk? - does it work with streaming?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added it yet

Comment thread src/llm/io_processing/gemma4/tool_parser.cpp Outdated

std::string originalStr = argumentStr.substr(stringStartPos, stringEndPos - stringStartPos);
size_t quotePos = 0;
while ((quotePos = originalStr.find('\"', quotePos)) != std::string::npos) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check if we dont need multi-level escape, like we used to need in qwen3-coder parser

Comment thread src/llm/io_processing/gemma4/tool_parser.cpp Outdated
Comment thread src/llm/io_processing/gemma4/tool_parser.cpp Outdated
Comment thread src/llm/io_processing/gemma4/tool_parser.cpp Outdated
If the tag is shorter than the buffer, we check:
a) if the tag is a substring of the buffer (tag is fully matched)
b) if the buffer and tag overlap (part of the tag is matched)
in the first case we return FOUND_COMPLETE, in the second FOUND_INCOMPLETE
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Comment thread src/llm/io_processing/output_parser.cpp Outdated
Using appropriate parser based on the current processing phase
Call to this method should return either result from parserContentChunk, parseToolCallChunk, parseReasoningChunk when we can determine the phase
or std::nullopt when we are waiting for more chunks to determine if we should switch phase or not.
Note that mentioned methods do not take chunk as argument, they read it from streamOutputCache and are responsible for clearing the cache,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Comment thread src/test/http_openai_handler_test.cpp Outdated

ASSERT_EQ(apiHandler->parseRequest(maxTokensLimit, bestOfLimit, maxModelLength), absl::OkStatus());

std::string toolCallContent = "I will call a tool.<tool_call>{\"name\":\"get_weather\",\"arguments\":{\"location\":\"Paris\"}}</tool_call>";
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Copy Markdown
Collaborator Author

@przepeck przepeck May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is some content from rebasing main , i will remove it

}

// TODO(mzegla): Usage of streaming flow here due to GenAI generate limitations.
// This diverges from the general flow - we should have unified systematic approach.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep the comment, why are you removing isDisconnected?

int singleQuoteDepth = 0;

for (size_t i = startPos; i < str.length(); ++i) {
if (bracketDepth == 0 && braceDepth == 0 && quoteDepth == 0 && singleQuoteDepth == 0 &&
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it also used by other parsers? we should also run bfcl for other parsers that use it so we are sure there is no regression (i think lfm uses it?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants