[New Model Bringup] Initial Commit to enable Text-only architecture for Qwen3.5#3712
Open
Rohan-Bierneni wants to merge 1 commit intomainfrom
Open
[New Model Bringup] Initial Commit to enable Text-only architecture for Qwen3.5#3712Rohan-Bierneni wants to merge 1 commit intomainfrom
Rohan-Bierneni wants to merge 1 commit intomainfrom
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
5f80704 to
7606921
Compare
entrpn
approved these changes
Apr 23, 2026
parambole
reviewed
Apr 23, 2026
parambole
reviewed
Apr 23, 2026
parambole
reviewed
Apr 23, 2026
7606921 to
2db8a8f
Compare
Add config file for 397B model update attentions.py with new decoder block type Update other files with new model to ensure model initialization is correct Update decoder block type Train Compile test is passing resolve nits in config file formatting resolve formatting errors Fix conflict in maxtext_utils Fix linter errors Fix linter errors Fix linter errors Ran pyink locally for formatting
7e8941f to
056b165
Compare
parambole
reviewed
Apr 29, 2026
| num_experts_per_tok: 10 | ||
| norm_topk_prob: True | ||
|
|
||
| # Qwen3-Next Specific Parameters for Linear Attention (Gated Delta Net) |
Collaborator
There was a problem hiding this comment.
Nit: we can probably just mention that it's Gated Delta Net parameters and drop qwen3-next ( which might be confusing ) ?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This pr enables maxtext to run train workloads on the text only architecture of qwen3.5, which is identical to that of qwen3-next. This current pr enables the model configs for the largest MoE model in the family: qwen3.5-397b-a17b. Other models part of the family will be added later on.
Tests
Will run a training workload and verify that the loss decreases correctly and will add a train_compile test for this new model
Update:
Ran a train workload on a mini config: the 122b MoE model. The only differences within configs are as follows:
The loss decreases slowly and training is stable:
Command: https://paste.googleplex.com/5762710475243520#l=15
Logs: https://paste.googleplex.com/5093482696933376
Train compile test is passing
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.