fix: Gemma-4 QuantizedKVCache + kv_bits API + Test 9 (mlx-swift-lm b440) by solderzzc · Pull Request #73 · SharpAI/SwiftLM

solderzzc · 2026-04-22T18:19:14Z

Summary

Closes #71. Bundles three changes:

1. mlx-swift-lm submodule → b440

Bumps the submodule from 50c3732 (b436) to 63707c0 (b440), which includes:

PR fix: improve build failure diagnostics #29: fix(Gemma4Text) — dispatch QuantizedKVCache correctly in LLM attention
PR Feature/speculative decoding ci #28: fix(Gemma4) — pad token / infinite padding loop fix

2. `Server.swift` — `kv_bits` per-request API field

Adds kv_bits: Int? to ChatCompletionRequest (JSON key "kv_bits") and threads it into GenerateParameters.kvBits. Enables MLX-native QuantizedKVCache (4-bit / 8-bit) without a CLI flag or server restart.

3. `run_benchmark.sh` — Test 9: QuantizedKVCache Regression

New test suite that sends decode requests with kv_bits=4, kv_bits=8, a longer prompt (exercises KV-sharing layers), and a baseline without quantization.

Test 9 result on gemma-4-26b-a4b-it-4bit:

[1/4] kv_bits=4 short  ✅  [1.6s, 13 tokens]
[2/4] kv_bits=8 short  ✅  [0.9s, 15 tokens]
[3/4] kv_bits=4 long   ✅  [3.9s, 66 tokens]
[4/4] baseline         ✅  [2.8s, 48 tokens]

✅ REGRESSION PASSED — QuantizedKVCache dispatches correctly.

…b440) - Bumps mlx-swift-lm submodule to b440 (tag) / 63707c0: fix(Gemma4Text): dispatch QuantizedKVCache correctly in LLM attention (merges PR #29, closes #71) - Server.swift: expose `kv_bits` as a per-request API field (ChatCompletionRequest.kvBits -> GenerateParameters.kvBits) enabling native MLX QuantizedKVCache without a server restart. - run_benchmark.sh: add Test 9 — QuantizedKVCache regression suite [1/4] kv_bits=4 short [2/4] kv_bits=8 short [3/4] kv_bits=4 long (KV-sharing path) [4/4] baseline Test 9 passed on mlx-community/gemma-4-26b-a4b-it-4bit.

Copilot

Pull request overview

Updates SwiftLM to support per-request native MLX quantized KV cache selection for Gemma-4 (issue #71), and adds a new benchmark suite to regression-test kv_bits decoding.

Changes:

Adds kv_bits to the chat completions request schema and threads it into GenerateParameters.kvBits.
Introduces “Test 9” in run_benchmark.sh to exercise kv_bits=4/8 and a non-quantized baseline.
(Per PR description) bumps the mlx-swift-lm submodule to a revision that fixes Gemma-4 QuantizedKVCache dispatch.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File	Description
run_benchmark.sh	Adds a new interactive menu option and a Python-based regression suite for `kv_bits` QuantizedKVCache decoding.
Sources/SwiftLM/Server.swift	Extends the chat request API with `kv_bits` and forwards it into generation parameters.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-22T18:25:22Z

+#   - 4-bit run: prefill + ≥20 decode tokens, response is non-empty coherent text
+#   - 8-bit run: same
+#   - Multi-turn run: second turn with kv_bits=4 also succeeds (exercises sharedKV path)


The Test 9 header comment/documentation doesn’t match what the script actually validates: it claims “≥20 decode tokens” and a “multi-turn run”, but the checks only require 3–5 tokens and there is no multi-turn conversation (each call is a fresh request). Please align the stated pass criteria with the implemented assertions, or update the script to actually perform the multi-turn/≥20-token checks.

Suggested change

# - 4-bit run: prefill + ≥20 decode tokens, response is non-empty coherent text

# - 8-bit run: same

# - Multi-turn run: second turn with kv_bits=4 also succeeds (exercises sharedKV path)

# - 4-bit run succeeds without server crash and returns a short non-empty response

# - 8-bit run succeeds without server crash and returns a short non-empty response

# - The regression is considered covered by these single-request decode checks

Copilot · 2026-04-22T18:25:23Z

+    echo "Waiting for server (up to 180s)..."
+    for i in {1..180}; do
+        if ! kill -0 $SERVER_PID 2>/dev/null; then
+            echo "❌ Server died early. Logs:"
+            print_server_log ./tmp/kvcache_regression.log
+            exit 1
+        fi
+        if curl -sf http://127.0.0.1:5431/health > /dev/null 2>&1; then
+            echo "Server ready (${i}s)"
+            break
+        fi
+        sleep 1
+    done


The server readiness loop breaks early on success, but if the server never becomes healthy within 180s the loop just ends and the script continues into the Python test, which will fail with a less-informative connection error. Add an explicit post-loop check/flag to fail fast with a clear “server not ready after 180s” message (and print logs) when the timeout is hit.

Copilot · 2026-04-22T18:25:23Z

    let params = GenerateParameters(
        maxTokens: tokenLimit,
        maxKVSize: config.ctxSize,
+        kvBits: chatReq.kvBits,


kvBits is threaded into GenerateParameters, but the prompt cache currently keys only on token prefix and will happily restore a cached KV state produced with different generation parameters. With a per-request kv_bits, this can mix quantized and non-quantized cache states across requests, which is unsafe and can lead to incorrect results or runtime failures. Consider including kvBits (and any other cache-shaping params) in the prompt-cache key, or disabling prompt-cache restore/save whenever chatReq.kvBits is non-nil.

Suggested change

let params = GenerateParameters(

maxTokens: tokenLimit,

maxKVSize: config.ctxSize,

kvBits: chatReq.kvBits,

// Do not thread per-request kvBits into generation while prompt-cache identity

// is still based only on prompt prefix. Reusing a cached KV state created with a

// different kvBits setting is unsafe and can produce incorrect results or failures.

let params = GenerateParameters(

maxTokens: tokenLimit,

maxKVSize: config.ctxSize,

kvBits: nil,

Copilot · 2026-04-22T18:25:23Z

+    /// Number of bits for native MLX quantized KV cache (nil = no quantization, 4 or 8 typical).
+    /// Enables `QuantizedKVCache` instead of `KVCacheSimple`.  Separate from `--turbo-kv`.
+    let kvBits: Int?


kv_bits is a new public request field, but there’s no input validation here. If a client sends an unsupported value (e.g. 0, 3, 16), it will be forwarded into GenerateParameters and may cause an internal failure (500) instead of a clear 400. Add explicit validation (accept only nil/4/8, or whatever the underlying library supports) and return a structured OpenAI-style error message for invalid values.

Copilot · 2026-04-22T18:25:24Z

    let params = GenerateParameters(
        maxTokens: tokenLimit,
        maxKVSize: config.ctxSize,
+        kvBits: chatReq.kvBits,


This PR adds a new per-request kv_bits execution path, but the CI integration tests (e.g. tests/test-server.sh) don’t cover it. Add at least one E2E test that sets kv_bits (and ideally alternates quantized/non-quantized requests) to ensure the server doesn’t crash and returns valid responses.

Copilot · 2026-04-22T18:25:24Z

+    elapsed = time.time() - t0
+    content = d["choices"][0]["message"].get("content") or ""
+    # Strip Gemma-4 thinking blocks
+    content = re.sub(r"<\|channel\|>thought.*?<channel\|>", "", content, flags=re.DOTALL).strip()


The “strip thinking blocks” regex only matches the <|channel|>thought ... <channel|> form. Elsewhere in this repo the tag is referenced as <|channel>thought (no trailing | after channel). To make this regression test robust across tag variants, expand the regex to handle both forms (and ideally require a correct closing token).

Suggested change

content = re.sub(r"<\|channel\|>thought.*?<channel\|>", "", content, flags=re.DOTALL).strip()

content = re.sub(r"<\|channel\|?>thought.*?<\|/channel\|?>", "", content, flags=re.DOTALL).strip()

README.md: - Added '🔧 Per-Request API Parameters' section with kv_bits table, kv_bits vs --turbo-kv comparison table, and curl usage example - Clarified --turbo-kv CLI entry: 'activates after 2048 tokens, server-wide' Server.swift: - Added kv_bits input validation (only nil/4/8 accepted; returns 400 otherwise) - Bypass prompt cache restore when kv_bits is set (prevents unsafe mixing of QuantizedKVCache and KVCacheSimple states across requests) - Bypass prompt cache save when kv_bits is set (same safety reason) run_benchmark.sh (Test 9): - Corrected header comment to match actual assertions (removed false ≥20 token and multi-turn claims; stated actual ≥3 token / non-empty checks) - Added explicit SERVER_READY flag + post-loop failure with log dump - Widened thinking-block regex to handle both <|channel|>thought and <|channel>thought

solderzzc · 2026-04-22T18:33:28Z

Thanks for the thorough review. Applied all 5 comments in the latest push (f007b3b):

run_benchmark.sh comments:

Header comment mismatch — Corrected pass criteria to accurately describe actual assertions: ≥3 tokens, non-empty response for each sub-test. Removed the incorrect ≥20 token and multi-turn claims.
Server-not-ready silent pass-through — Added a SERVER_READY flag; after the loop, fail fast with log dump if the server never became healthy.
Thinking-block regex coverage — Widened regex to <\|channel\|?>thought.*?<channel\|?> to cover both the <|channel|>thought and <|channel>thought tag variants.

Server.swift comments:
4. Prompt cache mixing (safety-critical) — Added params.kvBits != nil bypass on both the cache restore path (skipPromptCache flag) and the cache save path, with an explanatory comment. Quantized and non-quantized KV states are now kept strictly separate.
5. kv_bits input validation — Added an explicit guard before GenerateParameters construction: any value other than nil, 4, or 8 returns a structured 400 invalid_request_error with code: invalid_kv_bits.

The E2E test comment (CI test coverage) is tracked as a follow-up — wiring it into tests/test-server.sh is straightforward but outside the scope of this targeted fix PR.

- Replace 🧠 with 📡 heading emoji - Rewrite as structured tables (Text / Vision / Audio) with all 50+ model families derived from the actual MLXLLM + MLXVLM model file inventory - LLM table: Gemma, Qwen, Phi, Mistral, Llama, GLM, DeepSeek, Falcon, LFM2, OLMo, Granite, SmolLM3, InternLM2, Cohere, Jamba, Exaone, MiMo, Ernie, Baichuan, Bailing, NemotronH, Starcoder2, OpenELM, BitNet, MiniMax, Apertus/AfMoE, MiniCPM, Qwen3Next - VLM table: Gemma4, Gemma3, Qwen3-VL, Qwen2-VL/2.5-VL, LFM2-VL, Pixtral, PaliGemma, Idefics3, Mistral3, FastVLM, SmolVLM2, GlmOcr, QwenVL - ALM table: Gemma-4-e4b only (factually correct — Qwen2-Audio removed; it was never wired into the audio pipeline here)

…reverted from main in 50c3732)

Copilot AI review requested due to automatic review settings April 22, 2026 18:19

Copilot started reviewing on behalf of solderzzc April 22, 2026 18:19 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

github-actions Bot added 2 commits April 22, 2026 11:41

docs: remove GLM 5.1 from supported models (still on feature branch, …

ed5f8f6

…reverted from main in 50c3732)

solderzzc merged commit e4a2036 into main Apr 22, 2026
8 checks passed

solderzzc deleted the fix/gemma4-quantizedkv-b440 branch April 22, 2026 19:41

solderzzc mentioned this pull request Apr 22, 2026

[codex] make OpenAI streaming strict by default #74

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Gemma-4 QuantizedKVCache + kv_bits API + Test 9 (mlx-swift-lm b440)#73

fix: Gemma-4 QuantizedKVCache + kv_bits API + Test 9 (mlx-swift-lm b440)#73
solderzzc merged 4 commits intomainfrom
fix/gemma4-quantizedkv-b440

solderzzc commented Apr 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

solderzzc commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	content = re.sub(r"<\\|channel\\|>thought.*?<channel\\|>", "", content, flags=re.DOTALL).strip()
	content = re.sub(r"<\\|channel\\|?>thought.*?<\\|/channel\\|?>", "", content, flags=re.DOTALL).strip()

Conversation

solderzzc commented Apr 22, 2026

Summary

1. mlx-swift-lm submodule → b440

2. Server.swift — kv_bits per-request API field

3. run_benchmark.sh — Test 9: QuantizedKVCache Regression

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

solderzzc commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

2. `Server.swift` — `kv_bits` per-request API field

3. `run_benchmark.sh` — Test 9: QuantizedKVCache Regression