support qwen3.5 moe padding_free by meichangsu1 · Pull Request #226 · modelscope/twinkle

meichangsu1 · 2026-06-16T09:06:26Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

support qwen3.5 moe padding_free

Experiment results

Paste your experiment result here(if needed).
[2026-06-16 03:39:14][INFO:twinkle] Current is step 0 of 63, metric: {'loss': '6.5880', 'grad_norm': '6.843750', 'learning rate(param group 1)': '2.000000e-05', 'learning rate(param group 2)': '2.000000e-05', 'iters': 0, 'total time elapse': '3.6 minutes', 'speed': '0.00 iters/s'}
[2026-06-16 03:39:58][INFO:twinkle] Current is step 20 of 63, metric: {'loss': '5.0045', 'grad_norm': '3.125000', 'learning rate(param group 1)': '8.236931e-05', 'learning rate(param group 2)': '8.236931e-05', 'iters': 20, 'total time elapse': '257 seconds', 'speed': '0.46 iters/s'}
[2026-06-16 03:40:26][INFO:twinkle] Current is step 40 of 63, metric: {'loss': '2.9167', 'grad_norm': '3.218750', 'learning rate(param group 1)': '3.149309e-05', 'learning rate(param group 2)': '3.149309e-05', 'iters': 40, 'total time elapse': '285 seconds', 'speed': '0.71 iters/s'}
[2026-06-16 03:40:55][INFO:twinkle] Current is step 60 of 63, metric: {'loss': '2.4172', 'grad_norm': '5.468750', 'learning rate(param group 1)': '2.931021e-07', 'learning rate(param group 2)': '2.931021e-07', 'iters': 60, 'total time elapse': '314 seconds', 'speed': '0.70 iters/s'}

# Conflicts: # tests/transformers/test_qwen35_linear_attention_sp.py

gemini-code-assist

Code Review

This pull request extends support for Qwen3.5-MoE models by generalizing the padding-free patching mechanism and sequence parallel strategies to handle both dense and MoE variants of Qwen3.5. Key changes include dynamically importing modeling functions and classes, iterating over multiple class pairs (dense and MoE) during patching, and adding comprehensive alignment tests for Qwen3.5-MoE. The code review feedback highlights two important issues: first, the patched DecoderLayer.forward should return a tuple (including router logits for MoE) to match Hugging Face's expected return format and prevent runtime crashes; second, a fallback import is needed for apply_mask_to_padding_states since it may not be present in the MoE modeling module.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

meichangsu1 added 3 commits June 10, 2026 20:42

support qwen35 moe gdn sp

9c4ca46

support gdn padding-free & fix sp unit test

8a2f835

Merge remote-tracking branch 'origin/main' into qwe35_moe

485f23a

# Conflicts: # tests/transformers/test_qwen35_linear_attention_sp.py

meichangsu1 marked this pull request as draft June 16, 2026 09:06

gemini-code-assist Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread src/twinkle/patch/gdn_padding_free.py

Comment thread src/twinkle/model/transformers/strategy/sequence_parallel/linear_attention_sp.py

meichangsu1 marked this pull request as ready for review June 16, 2026 09:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support qwen3.5 moe padding_free#226

support qwen3.5 moe padding_free#226
meichangsu1 wants to merge 3 commits into
modelscope:mainfrom
meichangsu1:qwe35_moe

meichangsu1 commented Jun 16, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

meichangsu1 commented Jun 16, 2026

PR type

PR information

Experiment results

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant