Skip to content

fix: when setting fsdp size unuse megatron for gather in npu#185

Open
0hujun wants to merge 2 commits intomodelscope:mainfrom
0hujun:fsdp2+cp
Open

fix: when setting fsdp size unuse megatron for gather in npu#185
0hujun wants to merge 2 commits intomodelscope:mainfrom
0hujun:fsdp2+cp

Conversation

@0hujun
Copy link
Copy Markdown

@0hujun 0hujun commented Apr 23, 2026

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

Fix issue #184 .
When using Ascend NPU for FSDP2 + CP train, got not init error cause import megatron package and using it to get dp group that is not initialized.
This error may be introduced from this pr #153 .
Because using FSDP2 will set fsdp_size and using Megatron will not set fsdp_size, so just check fsdp whether in DeviceMesh , and add it in if code line.

Experiment results

Works fine.

[2026-04-23 10:57:54][INFO:twinkle] Current is step 0 of 125, metric: {'loss': '3.1660', 'accuracy': '0.48', 'correct_tokens': 189, 'total_tokens': 392, 'learning rate(param group 1)': '0.000000e+00', 'learning rate(param group 2)': '0.000000e+00', 'iters': 0, 'total time elapse': '11 seconds', 'speed': '0.00 iters/s'}
[2026-04-23 10:58:17][INFO:twinkle] Current is step 20 of 125, metric: {'loss': '2.6121', 'grad_norm': '4.062500', 'accuracy': '0.54', 'correct_tokens': 3275, 'total_tokens': 6050, 'learning rate(param group 1)': '9.957224e-05', 'learning rate(param group 2)': '9.957224e-05', 'iters': 10, 'total time elapse': '33 seconds', 'speed': '0.45 iters/s'}
[2026-04-23 10:58:35][INFO:twinkle] Current is step 40 of 125, metric: {'loss': '1.4322', 'grad_norm': '3.062500', 'accuracy': '0.67', 'correct_tokens': 4335, 'total_tokens': 6455, 'learning rate(param group 1)': '9.619398e-05', 'learning rate(param group 2)': '9.619398e-05', 'iters': 20, 'total time elapse': '52 seconds', 'speed': '0.53 iters/s'}
[2026-04-23 10:58:54][INFO:twinkle] Current is step 60 of 125, metric: {'loss': '0.9267', 'grad_norm': '3.781250', 'accuracy': '0.75', 'correct_tokens': 4581, 'total_tokens': 6087, 'learning rate(param group 1)': '8.966767e-05', 'learning rate(param group 2)': '8.966767e-05', 'iters': 30, 'total time elapse': '70 seconds', 'speed': '0.55 iters/s'}
[2026-04-23 10:59:12][INFO:twinkle] Current is step 80 of 125, metric: {'loss': '0.6422', 'grad_norm': '2.062500', 'accuracy': '0.81', 'correct_tokens': 5232, 'total_tokens': 6426, 'learning rate(param group 1)': '8.043807e-05', 'learning rate(param group 2)': '8.043807e-05', 'iters': 40, 'total time elapse': '89 seconds', 'speed': '0.53 iters/s'}
[2026-04-23 10:59:31][INFO:twinkle] Current is step 100 of 125, metric: {'loss': '0.4760', 'grad_norm': '2.328125', 'accuracy': '0.86', 'correct_tokens': 5322, 'total_tokens': 6201, 'learning rate(param group 1)': '6.913417e-05', 'learning rate(param group 2)': '6.913417e-05', 'iters': 50, 'total time elapse': '107 seconds', 'speed': '0.55 iters/s'}
[2026-04-23 10:59:49][INFO:twinkle] Current is step 120 of 125, metric: {'loss': '0.3139', 'grad_norm': '2.250000', 'accuracy': '0.91', 'correct_tokens': 5808, 'total_tokens': 6411, 'learning rate(param group 1)': '5.652631e-05', 'learning rate(param group 2)': '5.652631e-05', 'iters': 60, 'total time elapse': '126 seconds', 'speed': '0.53 iters/s'}

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the gather_object function in src/twinkle/utils/framework.py to conditionally bypass a specific NPU workaround when 'fsdp' is present in the device mesh. A review comment points out a potential TypeError when accessing mesh_dim_names directly and suggests using the has_dim method instead to safely check for the dimension.

Comment thread src/twinkle/utils/framework.py Outdated
group_size = 1
if dist.is_available() and dist.is_initialized():
if Platform.device_prefix() == 'npu':
if Platform.device_prefix() == 'npu' and 'fsdp' not in device_mesh.mesh_dim_names:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Directly accessing device_mesh.mesh_dim_names can lead to a TypeError if it is None, which is a valid state according to the DeviceMesh class definition (line 32 in device_mesh.py). It is safer to use the has_dim method, which internally handles the None check and provides the same logic.

Suggested change
if Platform.device_prefix() == 'npu' and 'fsdp' not in device_mesh.mesh_dim_names:
if Platform.device_prefix() == 'npu' and not device_mesh.has_dim('fsdp'):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant