fix: when setting fsdp size unuse megatron for gather in npu by 0hujun · Pull Request #185 · modelscope/twinkle

0hujun · 2026-04-23T11:48:55Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

Fix issue #184 .
When using Ascend NPU for FSDP2 + CP train, got not init error cause import megatron package and using it to get dp group that is not initialized.
This error may be introduced from this pr #153 .
Because using FSDP2 will set fsdp_size and using Megatron will not set fsdp_size， so just check fsdp whether in DeviceMesh , and add it in if code line.

Experiment results

Works fine.

[2026-04-23 10:57:54][INFO:twinkle] Current is step 0 of 125, metric: {'loss': '3.1660', 'accuracy': '0.48', 'correct_tokens': 189, 'total_tokens': 392, 'learning rate(param group 1)': '0.000000e+00', 'learning rate(param group 2)': '0.000000e+00', 'iters': 0, 'total time elapse': '11 seconds', 'speed': '0.00 iters/s'}
[2026-04-23 10:58:17][INFO:twinkle] Current is step 20 of 125, metric: {'loss': '2.6121', 'grad_norm': '4.062500', 'accuracy': '0.54', 'correct_tokens': 3275, 'total_tokens': 6050, 'learning rate(param group 1)': '9.957224e-05', 'learning rate(param group 2)': '9.957224e-05', 'iters': 10, 'total time elapse': '33 seconds', 'speed': '0.45 iters/s'}
[2026-04-23 10:58:35][INFO:twinkle] Current is step 40 of 125, metric: {'loss': '1.4322', 'grad_norm': '3.062500', 'accuracy': '0.67', 'correct_tokens': 4335, 'total_tokens': 6455, 'learning rate(param group 1)': '9.619398e-05', 'learning rate(param group 2)': '9.619398e-05', 'iters': 20, 'total time elapse': '52 seconds', 'speed': '0.53 iters/s'}
[2026-04-23 10:58:54][INFO:twinkle] Current is step 60 of 125, metric: {'loss': '0.9267', 'grad_norm': '3.781250', 'accuracy': '0.75', 'correct_tokens': 4581, 'total_tokens': 6087, 'learning rate(param group 1)': '8.966767e-05', 'learning rate(param group 2)': '8.966767e-05', 'iters': 30, 'total time elapse': '70 seconds', 'speed': '0.55 iters/s'}
[2026-04-23 10:59:12][INFO:twinkle] Current is step 80 of 125, metric: {'loss': '0.6422', 'grad_norm': '2.062500', 'accuracy': '0.81', 'correct_tokens': 5232, 'total_tokens': 6426, 'learning rate(param group 1)': '8.043807e-05', 'learning rate(param group 2)': '8.043807e-05', 'iters': 40, 'total time elapse': '89 seconds', 'speed': '0.53 iters/s'}
[2026-04-23 10:59:31][INFO:twinkle] Current is step 100 of 125, metric: {'loss': '0.4760', 'grad_norm': '2.328125', 'accuracy': '0.86', 'correct_tokens': 5322, 'total_tokens': 6201, 'learning rate(param group 1)': '6.913417e-05', 'learning rate(param group 2)': '6.913417e-05', 'iters': 50, 'total time elapse': '107 seconds', 'speed': '0.55 iters/s'}
[2026-04-23 10:59:49][INFO:twinkle] Current is step 120 of 125, metric: {'loss': '0.3139', 'grad_norm': '2.250000', 'accuracy': '0.91', 'correct_tokens': 5808, 'total_tokens': 6411, 'learning rate(param group 1)': '5.652631e-05', 'learning rate(param group 2)': '5.652631e-05', 'iters': 60, 'total time elapse': '126 seconds', 'speed': '0.53 iters/s'}

gemini-code-assist

Code Review

This pull request modifies the gather_object function in src/twinkle/utils/framework.py to conditionally bypass a specific NPU workaround when 'fsdp' is present in the device mesh. A review comment points out a potential TypeError when accessing mesh_dim_names directly and suggests using the has_dim method instead to safely check for the dimension.

gemini-code-assist · 2026-04-23T11:50:21Z

        group_size = 1
        if dist.is_available() and dist.is_initialized():
-            if Platform.device_prefix() == 'npu':
+            if Platform.device_prefix() == 'npu' and 'fsdp' not in device_mesh.mesh_dim_names:


Directly accessing device_mesh.mesh_dim_names can lead to a TypeError if it is None, which is a valid state according to the DeviceMesh class definition (line 32 in device_mesh.py). It is safer to use the has_dim method, which internally handles the None check and provides the same logic.

Suggested change

if Platform.device_prefix() == 'npu' and 'fsdp' not in device_mesh.mesh_dim_names:

if Platform.device_prefix() == 'npu' and not device_mesh.has_dim('fsdp'):

fix: when setting fsdp size unuse megatron for gather in npu

dc7438b

gemini-code-assist Bot reviewed Apr 23, 2026

View reviewed changes

0hujun mentioned this pull request Apr 23, 2026

FSDP2训练LoRA报错AssertionError: data parallel group-gloo with context parallel combined is not initialized #184

Open

1 task

fix: model dtype is not same as lora dtype in FSDP train

d7873ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: when setting fsdp size unuse megatron for gather in npu#185

fix: when setting fsdp size unuse megatron for gather in npu#185
0hujun wants to merge 2 commits intomodelscope:mainfrom
0hujun:fsdp2+cp

0hujun commented Apr 23, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	if Platform.device_prefix() == 'npu' and 'fsdp' not in device_mesh.mesh_dim_names:
	if Platform.device_prefix() == 'npu' and not device_mesh.has_dim('fsdp'):

Conversation

0hujun commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR type

PR information

Experiment results

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

0hujun commented Apr 23, 2026 •

edited

Loading