[Cuda] enable turboquant on gemma4#19891
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19891
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 2 PendingAs of commit 0d0c1ba with merge base 5395f20 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@claude review this code |
This PR needs a
|
|
Claude finished @metascroy's task in 1m 39s —— View job Code Review: [Cuda] enable turboquant on gemma4
SummaryThis PR adds TurboQuant TQ4 KV cache compression support for the CUDA backend on Gemma 4 31B, extending the existing MLX-only support. The approach monkey-patches the attention forward on full-attention layers to call the fused Bug: CLI validation blocks CUDA + turboquant
if args.turboquant and args.backend != "mlx":
parser.error("--turboquant requires --backend mlx.")This guard was not updated — it still rejects if args.turboquant and args.backend not in ("mlx", "cuda"):
parser.error("--turboquant requires --backend mlx or --backend cuda.")Or simply removed since both backends now support it. Other observations
VerdictThe |
No description provided.