Skip to content

fix(pt): fsdp unavailable in older version of pytorch (≦2.5)#5415

Merged
njzjz merged 2 commits intodeepmodeling:masterfrom
OutisLi:pr/fsdp
Apr 25, 2026
Merged

fix(pt): fsdp unavailable in older version of pytorch (≦2.5)#5415
njzjz merged 2 commits intodeepmodeling:masterfrom
OutisLi:pr/fsdp

Conversation

@OutisLi
Copy link
Copy Markdown
Collaborator

@OutisLi OutisLi commented Apr 23, 2026

Summary by CodeRabbit

  • Bug Fixes

    • Improved compatibility checks during distributed training initialization, avoiding hard failures when FSDP2 support is unavailable.
    • Added clearer, actionable error messages that indicate required PyTorch versions and suggest upgrading or using lower zero-stage settings.
  • Documentation

    • Clarified help text for zero-stage to note PyTorch version requirements for FSDP2-related stages.

Copilot AI review requested due to automatic review settings April 23, 2026 14:53
@dosubot dosubot Bot added the bug label Apr 23, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 23, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 8e619f02-575f-49c9-aac4-03b510555885

📥 Commits

Reviewing files that changed from the base of the PR and between a607a47 and 6aded53.

📒 Files selected for processing (2)
  • deepmd/pt/train/training.py
  • deepmd/utils/argcheck.py
✅ Files skipped from review due to trivial changes (1)
  • deepmd/utils/argcheck.py

📝 Walkthrough

Walkthrough

The training module now conditionally imports torch.distributed.fsdp.fully_shard and records availability. During distributed init, requesting zero_stage >= 2 triggers a runtime check that raises a descriptive RuntimeError if FSDP2 is unavailable. The arg parsing help text for zero_stage was updated to note the PyTorch version requirement.

Changes

Cohort / File(s) Summary
FSDP2 Conditional Import & Compatibility Guard
deepmd/pt/train/training.py
Make torch.distributed.fsdp.fully_shard import optional; set fully_shard = None on ImportError and validate availability at distributed init. When zero_stage >= 2 is requested but fully_shard is absent, raise a fast, descriptive RuntimeError referencing the minimum PyTorch version and fallback options.
Argument Help Text Update
deepmd/utils/argcheck.py
Updated training_args help/docstring for zero_stage to include a note about the PyTorch version requirement for FSDP2-backed stages (2 and 3); no validation or default changes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: making FSDP2 dependency conditional for older PyTorch versions, addressing compatibility issues with PyTorch ≤2.5.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@OutisLi OutisLi requested a review from njzjz April 23, 2026 15:54
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 23, 2026

Codecov Report

❌ Patch coverage is 33.33333% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.45%. Comparing base (5c22e17) to head (6aded53).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
deepmd/pt/train/training.py 33.33% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5415      +/-   ##
==========================================
- Coverage   80.46%   80.45%   -0.01%     
==========================================
  Files         823      823              
  Lines       86625    86630       +5     
  Branches     4139     4139              
==========================================
- Hits        69701    69699       -2     
- Misses      15651    15654       +3     
- Partials     1273     1277       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@njzjz-bot njzjz-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The conditional import plus explicit runtime error for missing FSDP2 support is a clean way to keep <=2.5 behavior understandable instead of failing later with a cryptic import/runtime error.

— OpenClaw 2026.4.22 (model: gpt-5.4)

Comment thread deepmd/pt/train/training.py
@OutisLi OutisLi requested a review from njzjz April 25, 2026 02:30
@njzjz njzjz enabled auto-merge April 25, 2026 04:27
@njzjz njzjz added this pull request to the merge queue Apr 25, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 25, 2026
@njzjz njzjz added this pull request to the merge queue Apr 25, 2026
Merged via the queue into deepmodeling:master with commit c454e49 Apr 25, 2026
70 checks passed
@OutisLi OutisLi deleted the pr/fsdp branch April 26, 2026 05:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants