fix(cc): handle nloc==0 in DeepSpinPTExpt with phantom-atom padding by wanghan-iapcm · Pull Request #5485 · deepmodeling/deepmd-kit

wanghan-iapcm · 2026-06-02T03:27:42Z

Problem

Multi-rank spin MD can leave a rank with zero real local atoms (nloc_real == 0) when atoms migrate to other subdomains. The with-comm AOTI artifact hits an intermittent SIGFPE (integer divide by zero) at runtime in inductor-generated shape arithmetic that uses nloc as a divisor.

Reproduced on master CI run 26667802665:

Caught signal 8 (Floating point exception: integer divide by zero)
4  forward_lower_with_comm/.../wrapper.so(AOTInductorModel::run_impl+0xf482)

Root cause:

The graph was traced with nloc_min=1 (serialization.py:362) and inductor lowered an even stricter nloc >= 2 runtime-check (visible in the generated wrapper.cpp's check_input_3).
That runtime-check is gated by env var AOTI_RUNTIME_CHECK_INPUTS (default OFF), so with nloc = 0 the check is silently bypassed and the compiled graph runs through its own divide-by-zero on shape arithmetic.
Whether the offending divide is actually emitted depends on inductor's code-gen choices, which vary across compiles — hence the intermittent nature.

Fix

Prepend two phantom atoms with empty neighbour lists when nloc_real == 0 so the AOTI graph runs with nloc == 2 and never reaches the integer-divide-by-zero path. Phantoms have no neighbours so they contribute zero atomic energy / force / virial, preserving the physically-correct "this rank has no real atoms" result.

Key details (all in source/api_cc/src/DeepSpinPTExpt.cc):

dcoord / datype / dspin get two zero-valued rows prepended.
firstneigh_tensor gets two -1 rows prepended (no neighbours).
mapping_tensor gets two identity entries prepended.
comm_dict.nlocal is set to 2 (not the LAMMPS-reported 0) so border_op writes received ghost features past the phantom slots.
Output arrays (dforce, dforce_mag, datom_energy, datom_virial) get the phantom prefix stripped before being scattered back to LAMMPS via select_map.

Why phantoms rather than `Dim(min=0)` re-export

Bumping the trace constraint to min=0 would require:

auditing every nloc-dependent divide in deepmd/dpmodel/{descriptor,fitting,model}/ and protecting with xp.maximum(nloc, 1);
torch.export re-emitting compatible guards (currently fails because spin-side shape relationships require nloc >= 1 to be inferable);
inductor cooperating with the relaxed bound (it makes independent specialization choices downstream);
re-exporting every .pt2 archive in source/tests/infer/.

The phantom approach is a strict superset of correctness and self-contained in one C++ file. The two approaches aren't mutually exclusive — the min=0 route can land as a follow-up once the dpmodel audit is done.

Test plan

Local CPU rebuild + runUnitTests_cc --gtest_filter='*Spin*': 42 / 42 spin C++ regression tests pass (12 TF-backend tests skipped, as expected in the PT-only venv).
CI: the multi-rank LAMMPS test test_pair_deepmd_mpi_dpa3_spin_empty_subdomain should now pass deterministically. Local Python LAMMPS-MPI verification is blocked by a pre-existing OpenMPI/MPICH ABI mismatch in my local venv (the plugin's ompi_mpi_* symbols can't resolve against MPICH's libmpi.so.12), so end-to-end verification falls to CI.

Known limitations

The phantom path is structurally inert for nloc_real > 0 (the if (phantom_n > 0) branch never fires), so the common path is unchanged.
If a future inductor version bumps the nloc lower-bound to >2, phantom_n will need to track that minimum.
This fix is in DeepSpinPTExpt only. The corresponding non-spin path in DeepPotPTExpt has the same code shape; non-spin DPA3 empty-subdomain currently passes in CI but could regress similarly with a future inductor change. Deferred to a follow-up if observed.
Supersedes test(lammps): skip spin DPA3 empty-subdomain pt2 case #5478 (which proposed skipping the test); this PR fixes the underlying bug instead.

Multi-rank spin MD can leave a rank with zero real local atoms when all atoms migrate to other subdomains. The with-comm AOTI artifact hits an intermittent SIGFPE (integer divide by zero) at runtime in inductor-generated shape arithmetic that uses nloc as a divisor. The graph was traced with nloc_min=1 and inductor lowered an even stricter nloc>=2 runtime-check which is silently bypassed because AOTI_RUNTIME_CHECK_INPUTS is unset by default. Whether the offending divide is actually emitted depends on inductor's code-gen choices, which vary across compiles -- hence the random nature of the failure (reproduced on CI run 26667802665). Fix: prepend two phantom atoms with empty neighbour lists ahead of the real atoms when nloc_real==0. The AOTI graph then runs with nloc==2, satisfying the inductor specialisation. Phantoms have no neighbours so they contribute zero atomic energy / force / virial, preserving the physically-correct 'this rank has no real atoms' result. comm_dict's nlocal is set to 2 so border_op writes received ghost features past the phantom slots; outputs are stripped of the phantom prefix before being scattered back to LAMMPS via select_map.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 47f15b41a6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-02T03:30:01Z

+  // received ghost features past the phantom slots; outputs are stripped
+  // of the phantom prefix before being scattered back to LAMMPS atoms
+  // via ``select_map``.
+  const int phantom_n = (nloc_real == 0 && nall_real > 0) ? 2 : 0;


Subtract phantom atoms before returning energy

When an empty-subdomain rank takes this new path, the two inserted type-0 phantom atoms are included in the model's energy_redu, and that reduced energy is assigned directly before any phantom prefix is stripped. DeepMD energy fitting nets can add per-type biases / nonzero zero-neighbor outputs, so phantoms are not guaranteed to have zero atomic energy; in those models every rank with nloc_real == 0 will contribute extra phantom energy to the MPI-reduced LAMMPS total. Please either remove/subtract the phantom atomic energies from energy_redu or make the model see them through a mask that excludes them from reductions.

Useful? React with 👍 / 👎.

coderabbitai · 2026-06-02T03:35:29Z

📝 Walkthrough

Walkthrough

This PR modifies DeepSpinPTExpt::compute to handle a zero-local-atoms corner case by introducing phantom-atom padding. When a rank has no real local atoms but has remote atoms, the code prepends two phantom atoms to avoid AOTI/Inductor model failures. The mapping, neighbor list, and output tensors are adjusted to isolate phantom atoms, then phantom entries are stripped from results before returning to LAMMPS.

Changes

Phantom-atom padding for zero-local-atoms corner case

Layer / File(s)	Summary
Phantom-atom padding initialization `source/api_cc/src/DeepSpinPTExpt.cc`	When `nloc_real == 0` and `nall_real > 0`, prepend two phantom atoms with zero values; update `nall_real` and `nloc_real` counters; reconstruct spin tensor to fill real atoms from `bkw_map` while keeping phantom spin at zero.
Tensor mappings and neighbor-list preparation `source/api_cc/src/DeepSpinPTExpt.cc`	Rebuild `mapping_tensor` so phantom slots use identity entries (self-indexing) while real slots use `fwd_map`-based remapping; prepend `phantom_n` neighbor-list rows filled with `-1` to `firstneigh_tensor` so model receives adjusted `nloc` shape while phantom atoms contribute no neighbors.
Output post-processing and phantom removal `source/api_cc/src/DeepSpinPTExpt.cc`	After inference, strip phantom-prefixed entries from `dforce` and `dforce_mag`; strip phantom prefixes from `datom_energy` and `datom_virial` when atomic output is enabled; restore `nall_real` to real-atom-only count so downstream force mapping applies only to physical atoms.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: fixing a zero-local-atom edge case (nloc==0) in DeepSpinPTExpt using phantom-atom padding.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@source/api_cc/src/DeepSpinPTExpt.cc`:
- Around line 487-489: The mapping assignment for real atoms fails to account
for phantom padding: when phantom_n > 0 the indices returned by fwd_map(...) are
in the pre-padding space and must be shifted by phantom_n. Update the loop that
sets mapping[ii] (which uses fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]])
to add phantom_n to the fwd_map result so real-atom targets point into the
post-padding index range; keep all other indexing (ii, bkw_map,
lmp_list.mapping) unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e4e67575-7512-48ab-a532-4f17dad67d93

📥 Commits

Reviewing files that changed from the base of the PR and between c67b7f7 and 47f15b4.

📒 Files selected for processing (1)

source/api_cc/src/DeepSpinPTExpt.cc

coderabbitai · 2026-06-02T03:35:32Z

+      for (int ii = phantom_n; ii < nall_real; ii++) {
+        mapping[ii] = fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]];
      }


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Missing phantom_n offset in mapping values for real atoms.

When phantom_n > 0, fwd_map[...] returns indices in the pre-padding coordinate space [0, nall_real_orig). After phantom padding, real atoms are shifted to indices [phantom_n, nall_real), so the mapping values must be offset accordingly.

For example, with phantom_n=2: if fwd_map[...] returns 0 (first real ghost in pre-padding space), the mapping should point to index 2 (first real ghost in post-padding space), not 0 (which is now a phantom slot).

🐛 Proposed fix

for (int ii = phantom_n; ii < nall_real; ii++) { - mapping[ii] = fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]]; + mapping[ii] = fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]] + phantom_n; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for (int ii = phantom_n; ii < nall_real; ii++) {

mapping[ii] = fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]];

}

for (int ii = phantom_n; ii < nall_real; ii++) {

mapping[ii] = fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]] + phantom_n;

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@source/api_cc/src/DeepSpinPTExpt.cc` around lines 487 - 489, The mapping assignment for real atoms fails to account for phantom padding: when phantom_n > 0 the indices returned by fwd_map(...) are in the pre-padding space and must be shifted by phantom_n. Update the loop that sets mapping[ii] (which uses fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]]) to add phantom_n to the fwd_map result so real-atom targets point into the post-padding index range; keep all other indexing (ii, bkw_map, lmp_list.mapping) unchanged.

codecov · 2026-06-02T04:19:09Z

Codecov Report

❌ Patch coverage is 14.28571% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.73%. Comparing base (c67b7f7) to head (47f15b4).

Files with missing lines	Patch %	Lines
source/api_cc/src/DeepSpinPTExpt.cc	14.28%	20 Missing and 4 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5485      +/-   ##
==========================================
- Coverage   81.36%   79.73%   -1.63%     
==========================================
  Files         868      868              
  Lines       96437    96439       +2     
  Branches     4233     4234       +1     
==========================================
- Hits        78463    76896    -1567     
- Misses      16674    18410    +1736     
+ Partials     1300     1133     -167

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

dosubot Bot added the bug label Jun 2, 2026

github-actions Bot added C++ and removed bug labels Jun 2, 2026

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

wanghan-iapcm requested a review from njzjz June 2, 2026 03:31

wanghan-iapcm mentioned this pull request Jun 2, 2026

test(lammps): skip spin DPA3 empty-subdomain pt2 case #5478

Open

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cc): handle nloc==0 in DeepSpinPTExpt with phantom-atom padding#5485

fix(cc): handle nloc==0 in DeepSpinPTExpt with phantom-atom padding#5485
wanghan-iapcm wants to merge 1 commit into
deepmodeling:masterfrom
wanghan-iapcm:fix-spin-pt2-empty-rank-fpe

wanghan-iapcm commented Jun 2, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 2, 2026

Uh oh!

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 2, 2026

Uh oh!

codecov Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wanghan-iapcm commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Why phantoms rather than Dim(min=0) re-export

Test plan

Known limitations

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wanghan-iapcm commented Jun 2, 2026 •

edited

Loading

Why phantoms rather than `Dim(min=0)` re-export

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

codecov Bot commented Jun 2, 2026 •

edited

Loading