Skip to content

fix: Gemma-4 QuantizedKVCache + kv_bits API + Test 9 (mlx-swift-lm b440)#73

Merged
solderzzc merged 4 commits intomainfrom
fix/gemma4-quantizedkv-b440
Apr 22, 2026
Merged

fix: Gemma-4 QuantizedKVCache + kv_bits API + Test 9 (mlx-swift-lm b440)#73
solderzzc merged 4 commits intomainfrom
fix/gemma4-quantizedkv-b440

Conversation

@solderzzc
Copy link
Copy Markdown
Member

Summary

Closes #71. Bundles three changes:

1. mlx-swift-lm submodule → b440

Bumps the submodule from 50c3732 (b436) to 63707c0 (b440), which includes:

2. Server.swiftkv_bits per-request API field

Adds kv_bits: Int? to ChatCompletionRequest (JSON key "kv_bits") and threads it into GenerateParameters.kvBits. Enables MLX-native QuantizedKVCache (4-bit / 8-bit) without a CLI flag or server restart.

3. run_benchmark.sh — Test 9: QuantizedKVCache Regression

New test suite that sends decode requests with kv_bits=4, kv_bits=8, a longer prompt (exercises KV-sharing layers), and a baseline without quantization.

Test 9 result on gemma-4-26b-a4b-it-4bit:

[1/4] kv_bits=4 short  ✅  [1.6s, 13 tokens]
[2/4] kv_bits=8 short  ✅  [0.9s, 15 tokens]
[3/4] kv_bits=4 long   ✅  [3.9s, 66 tokens]
[4/4] baseline         ✅  [2.8s, 48 tokens]

✅ REGRESSION PASSED — QuantizedKVCache dispatches correctly.

…b440)

- Bumps mlx-swift-lm submodule to b440 (tag) / 63707c0:
  fix(Gemma4Text): dispatch QuantizedKVCache correctly in LLM attention
  (merges PR #29, closes #71)

- Server.swift: expose `kv_bits` as a per-request API field
  (ChatCompletionRequest.kvBits -> GenerateParameters.kvBits)
  enabling native MLX QuantizedKVCache without a server restart.

- run_benchmark.sh: add Test 9 — QuantizedKVCache regression suite
  [1/4] kv_bits=4 short  [2/4] kv_bits=8 short
  [3/4] kv_bits=4 long (KV-sharing path)  [4/4] baseline

  Test 9 passed on mlx-community/gemma-4-26b-a4b-it-4bit.
Copilot AI review requested due to automatic review settings April 22, 2026 18:19
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates SwiftLM to support per-request native MLX quantized KV cache selection for Gemma-4 (issue #71), and adds a new benchmark suite to regression-test kv_bits decoding.

Changes:

  • Adds kv_bits to the chat completions request schema and threads it into GenerateParameters.kvBits.
  • Introduces “Test 9” in run_benchmark.sh to exercise kv_bits=4/8 and a non-quantized baseline.
  • (Per PR description) bumps the mlx-swift-lm submodule to a revision that fixes Gemma-4 QuantizedKVCache dispatch.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
run_benchmark.sh Adds a new interactive menu option and a Python-based regression suite for kv_bits QuantizedKVCache decoding.
Sources/SwiftLM/Server.swift Extends the chat request API with kv_bits and forwards it into generation parameters.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread run_benchmark.sh Outdated
Comment on lines +981 to +983
# - 4-bit run: prefill + ≥20 decode tokens, response is non-empty coherent text
# - 8-bit run: same
# - Multi-turn run: second turn with kv_bits=4 also succeeds (exercises sharedKV path)
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Test 9 header comment/documentation doesn’t match what the script actually validates: it claims “≥20 decode tokens” and a “multi-turn run”, but the checks only require 3–5 tokens and there is no multi-turn conversation (each call is a fresh request). Please align the stated pass criteria with the implemented assertions, or update the script to actually perform the multi-turn/≥20-token checks.

Suggested change
# - 4-bit run: prefill + ≥20 decode tokens, response is non-empty coherent text
# - 8-bit run: same
# - Multi-turn run: second turn with kv_bits=4 also succeeds (exercises sharedKV path)
# - 4-bit run succeeds without server crash and returns a short non-empty response
# - 8-bit run succeeds without server crash and returns a short non-empty response
# - The regression is considered covered by these single-request decode checks

Copilot uses AI. Check for mistakes.
Comment thread run_benchmark.sh Outdated
Comment on lines +998 to +1010
echo "Waiting for server (up to 180s)..."
for i in {1..180}; do
if ! kill -0 $SERVER_PID 2>/dev/null; then
echo "❌ Server died early. Logs:"
print_server_log ./tmp/kvcache_regression.log
exit 1
fi
if curl -sf http://127.0.0.1:5431/health > /dev/null 2>&1; then
echo "Server ready (${i}s)"
break
fi
sleep 1
done
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The server readiness loop breaks early on success, but if the server never becomes healthy within 180s the loop just ends and the script continues into the Python test, which will fail with a less-informative connection error. Add an explicit post-loop check/flag to fail fast with a clear “server not ready after 180s” message (and print logs) when the timeout is hit.

Copilot uses AI. Check for mistakes.
Comment on lines 1051 to +1054
let params = GenerateParameters(
maxTokens: tokenLimit,
maxKVSize: config.ctxSize,
kvBits: chatReq.kvBits,
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kvBits is threaded into GenerateParameters, but the prompt cache currently keys only on token prefix and will happily restore a cached KV state produced with different generation parameters. With a per-request kv_bits, this can mix quantized and non-quantized cache states across requests, which is unsafe and can lead to incorrect results or runtime failures. Consider including kvBits (and any other cache-shaping params) in the prompt-cache key, or disabling prompt-cache restore/save whenever chatReq.kvBits is non-nil.

Suggested change
let params = GenerateParameters(
maxTokens: tokenLimit,
maxKVSize: config.ctxSize,
kvBits: chatReq.kvBits,
// Do not thread per-request kvBits into generation while prompt-cache identity
// is still based only on prompt prefix. Reusing a cached KV state created with a
// different kvBits setting is unsafe and can produce incorrect results or failures.
let params = GenerateParameters(
maxTokens: tokenLimit,
maxKVSize: config.ctxSize,
kvBits: nil,

Copilot uses AI. Check for mistakes.
Comment thread Sources/SwiftLM/Server.swift Outdated
Comment on lines +2309 to +2311
/// Number of bits for native MLX quantized KV cache (nil = no quantization, 4 or 8 typical).
/// Enables `QuantizedKVCache` instead of `KVCacheSimple`. Separate from `--turbo-kv`.
let kvBits: Int?
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kv_bits is a new public request field, but there’s no input validation here. If a client sends an unsupported value (e.g. 0, 3, 16), it will be forwarded into GenerateParameters and may cause an internal failure (500) instead of a clear 400. Add explicit validation (accept only nil/4/8, or whatever the underlying library supports) and return a structured OpenAI-style error message for invalid values.

Copilot uses AI. Check for mistakes.
let params = GenerateParameters(
maxTokens: tokenLimit,
maxKVSize: config.ctxSize,
kvBits: chatReq.kvBits,
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds a new per-request kv_bits execution path, but the CI integration tests (e.g. tests/test-server.sh) don’t cover it. Add at least one E2E test that sets kv_bits (and ideally alternates quantized/non-quantized requests) to ensure the server doesn’t crash and returns valid responses.

Copilot uses AI. Check for mistakes.
Comment thread run_benchmark.sh Outdated
elapsed = time.time() - t0
content = d["choices"][0]["message"].get("content") or ""
# Strip Gemma-4 thinking blocks
content = re.sub(r"<\|channel\|>thought.*?<channel\|>", "", content, flags=re.DOTALL).strip()
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “strip thinking blocks” regex only matches the <|channel|>thought ... <channel|> form. Elsewhere in this repo the tag is referenced as <|channel>thought (no trailing | after channel). To make this regression test robust across tag variants, expand the regex to handle both forms (and ideally require a correct closing token).

Suggested change
content = re.sub(r"<\|channel\|>thought.*?<channel\|>", "", content, flags=re.DOTALL).strip()
content = re.sub(r"<\|channel\|?>thought.*?<\|/channel\|?>", "", content, flags=re.DOTALL).strip()

Copilot uses AI. Check for mistakes.
README.md:
- Added '🔧 Per-Request API Parameters' section with kv_bits table,
  kv_bits vs --turbo-kv comparison table, and curl usage example
- Clarified --turbo-kv CLI entry: 'activates after 2048 tokens, server-wide'

Server.swift:
- Added kv_bits input validation (only nil/4/8 accepted; returns 400 otherwise)
- Bypass prompt cache restore when kv_bits is set (prevents unsafe mixing of
  QuantizedKVCache and KVCacheSimple states across requests)
- Bypass prompt cache save when kv_bits is set (same safety reason)

run_benchmark.sh (Test 9):
- Corrected header comment to match actual assertions (removed false ≥20 token
  and multi-turn claims; stated actual ≥3 token / non-empty checks)
- Added explicit SERVER_READY flag + post-loop failure with log dump
- Widened thinking-block regex to handle both <|channel|>thought and <|channel>thought
@solderzzc
Copy link
Copy Markdown
Member Author

Thanks for the thorough review. Applied all 5 comments in the latest push (f007b3b):

run_benchmark.sh comments:

  1. Header comment mismatch — Corrected pass criteria to accurately describe actual assertions: ≥3 tokens, non-empty response for each sub-test. Removed the incorrect ≥20 token and multi-turn claims.
  2. Server-not-ready silent pass-through — Added a SERVER_READY flag; after the loop, fail fast with log dump if the server never became healthy.
  3. Thinking-block regex coverage — Widened regex to <\|channel\|?>thought.*?<channel\|?> to cover both the <|channel|>thought and <|channel>thought tag variants.

Server.swift comments:
4. Prompt cache mixing (safety-critical) — Added params.kvBits != nil bypass on both the cache restore path (skipPromptCache flag) and the cache save path, with an explanatory comment. Quantized and non-quantized KV states are now kept strictly separate.
5. kv_bits input validation — Added an explicit guard before GenerateParameters construction: any value other than nil, 4, or 8 returns a structured 400 invalid_request_error with code: invalid_kv_bits.

The E2E test comment (CI test coverage) is tracked as a follow-up — wiring it into tests/test-server.sh is straightforward but outside the scope of this targeted fix PR.

- Replace 🧠 with 📡 heading emoji
- Rewrite as structured tables (Text / Vision / Audio) with all 50+ model
  families derived from the actual MLXLLM + MLXVLM model file inventory
- LLM table: Gemma, Qwen, Phi, Mistral, Llama, GLM, DeepSeek, Falcon,
  LFM2, OLMo, Granite, SmolLM3, InternLM2, Cohere, Jamba, Exaone, MiMo,
  Ernie, Baichuan, Bailing, NemotronH, Starcoder2, OpenELM, BitNet,
  MiniMax, Apertus/AfMoE, MiniCPM, Qwen3Next
- VLM table: Gemma4, Gemma3, Qwen3-VL, Qwen2-VL/2.5-VL, LFM2-VL,
  Pixtral, PaliGemma, Idefics3, Mistral3, FastVLM, SmolVLM2, GlmOcr, QwenVL
- ALM table: Gemma-4-e4b only (factually correct — Qwen2-Audio removed;
  it was never wired into the audio pipeline here)
@solderzzc solderzzc merged commit e4a2036 into main Apr 22, 2026
8 checks passed
@solderzzc solderzzc deleted the fix/gemma4-quantizedkv-b440 branch April 22, 2026 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: MLXLLM Gemma 4 attention crashes when kvBits is set

2 participants