support qwen3.5 moe padding_free#226
Conversation
# Conflicts: # tests/transformers/test_qwen35_linear_attention_sp.py
There was a problem hiding this comment.
Code Review
This pull request extends support for Qwen3.5-MoE models by generalizing the padding-free patching mechanism and sequence parallel strategies to handle both dense and MoE variants of Qwen3.5. Key changes include dynamically importing modeling functions and classes, iterating over multiple class pairs (dense and MoE) during patching, and adding comprehensive alignment tests for Qwen3.5-MoE. The code review feedback highlights two important issues: first, the patched DecoderLayer.forward should return a tuple (including router logits for MoE) to match Hugging Face's expected return format and prevent runtime crashes; second, a fallback import is needed for apply_mask_to_padding_states since it may not be present in the MoE modeling module.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
PR type
PR information
support qwen3.5 moe padding_free
Experiment results
Paste your experiment result here(if needed).
[2026-06-16 03:39:14][INFO:twinkle] Current is step 0 of 63, metric: {'loss': '6.5880', 'grad_norm': '6.843750', 'learning rate(param group 1)': '2.000000e-05', 'learning rate(param group 2)': '2.000000e-05', 'iters': 0, 'total time elapse': '3.6 minutes', 'speed': '0.00 iters/s'}
[2026-06-16 03:39:58][INFO:twinkle] Current is step 20 of 63, metric: {'loss': '5.0045', 'grad_norm': '3.125000', 'learning rate(param group 1)': '8.236931e-05', 'learning rate(param group 2)': '8.236931e-05', 'iters': 20, 'total time elapse': '257 seconds', 'speed': '0.46 iters/s'}
[2026-06-16 03:40:26][INFO:twinkle] Current is step 40 of 63, metric: {'loss': '2.9167', 'grad_norm': '3.218750', 'learning rate(param group 1)': '3.149309e-05', 'learning rate(param group 2)': '3.149309e-05', 'iters': 40, 'total time elapse': '285 seconds', 'speed': '0.71 iters/s'}
[2026-06-16 03:40:55][INFO:twinkle] Current is step 60 of 63, metric: {'loss': '2.4172', 'grad_norm': '5.468750', 'learning rate(param group 1)': '2.931021e-07', 'learning rate(param group 2)': '2.931021e-07', 'iters': 60, 'total time elapse': '314 seconds', 'speed': '0.70 iters/s'}