Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions src/maxtext/configs/post_train/lora_module_path.yml

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any tests for Gemma4 Lora?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ran the end-to-end LoRA training loop for Gemma 4 successfully without any issues.4 successfully without any issues. log

Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ mistral: "decoder/layers/.*(attention/(query|key|value|out)|mlp/(wi_0|wi_1|wo))"
deepseek2: "decoder/(dense_layers|moe_stack)/self_attention/(query|out|wkv_a|wkv_b)|decoder/(dense_layers|moe_stack)/(mlp|shared_experts)/(wi_0|wi_1|wo)"
gemma2: "decoder/layers/(self_attention_local|self_attention_global)/(query|key|value|out)|decoder/layers/(mlp_local|mlp_global)/(wi_0|wi_1|wo)"
gemma3: "decoder/layers/.*(self_attention/(query|key|value|out)|mlp/(wi_0|wi_1|wo|gate|up|down))"
gemma4: "decoder/(scanned_blocks|layers_remainder)/layers.*/.*(self_attention/(query|key|value|out)|mlp/.*(wi_0|wi_1|wo|shared_experts/(wi_0|wi_1|wo)))"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the scope of the project both scan_layers=true and false, Gemma4 dense and moe?

Suggest adding a CPU unit test that asserts the regex matches the expected module paths for both scan_layers values and dense and moe.

olmo3: "decoder/layers/.*(attention/(query|key|value|out)|mlp/(wi_0|wi_1|wo))"
Comment thread
RexBearIU marked this conversation as resolved.
gpt3: "decoder/layers/(self_attention/(qkv_proj|out)|mlp/(wi|wo))"

Expand Down
5 changes: 5 additions & 0 deletions src/maxtext/trainers/post_train/sft/train_sft.py
Original file line number Diff line number Diff line change
Expand Up @@ -274,9 +274,14 @@ def setup_trainer_state(mt_config, goodput_recorder=None):
def train_model(mt_config, trainer, mesh):
"""Runs the SFT training loop in Tunix."""
with jax.set_mesh(mesh), nn_partitioning.axis_rules(mt_config.logical_axis_rules):
# Disable NNX graph caching for MoE models (where experts > 1) to allow
# necessary dynamic metadata synchronization during forward passes (e.g., in jax.lax.scan).
enable_nnx_cache = mt_config.num_experts <= 1

trainer.train(
trainer.data_hooks.train_data_iterator,
trainer.data_hooks.eval_data_iterator,
cache_nnx_graph=enable_nnx_cache,
)
return trainer

Expand Down
Loading