Skip to content

fix(cc): handle nloc==0 in DeepSpinPTExpt with phantom-atom padding#5485

Open
wanghan-iapcm wants to merge 1 commit into
deepmodeling:masterfrom
wanghan-iapcm:fix-spin-pt2-empty-rank-fpe
Open

fix(cc): handle nloc==0 in DeepSpinPTExpt with phantom-atom padding#5485
wanghan-iapcm wants to merge 1 commit into
deepmodeling:masterfrom
wanghan-iapcm:fix-spin-pt2-empty-rank-fpe

Conversation

@wanghan-iapcm
Copy link
Copy Markdown
Collaborator

@wanghan-iapcm wanghan-iapcm commented Jun 2, 2026

Problem

Multi-rank spin MD can leave a rank with zero real local atoms (nloc_real == 0) when atoms migrate to other subdomains. The with-comm AOTI artifact hits an intermittent SIGFPE (integer divide by zero) at runtime in inductor-generated shape arithmetic that uses nloc as a divisor.

Reproduced on master CI run 26667802665:

Caught signal 8 (Floating point exception: integer divide by zero)
4  forward_lower_with_comm/.../wrapper.so(AOTInductorModel::run_impl+0xf482)

Root cause:

  • The graph was traced with nloc_min=1 (serialization.py:362) and inductor lowered an even stricter nloc >= 2 runtime-check (visible in the generated wrapper.cpp's check_input_3).
  • That runtime-check is gated by env var AOTI_RUNTIME_CHECK_INPUTS (default OFF), so with nloc = 0 the check is silently bypassed and the compiled graph runs through its own divide-by-zero on shape arithmetic.
  • Whether the offending divide is actually emitted depends on inductor's code-gen choices, which vary across compiles — hence the intermittent nature.

Fix

Prepend two phantom atoms with empty neighbour lists when nloc_real == 0 so the AOTI graph runs with nloc == 2 and never reaches the integer-divide-by-zero path. Phantoms have no neighbours so they contribute zero atomic energy / force / virial, preserving the physically-correct "this rank has no real atoms" result.

Key details (all in source/api_cc/src/DeepSpinPTExpt.cc):

  • dcoord / datype / dspin get two zero-valued rows prepended.
  • firstneigh_tensor gets two -1 rows prepended (no neighbours).
  • mapping_tensor gets two identity entries prepended.
  • comm_dict.nlocal is set to 2 (not the LAMMPS-reported 0) so border_op writes received ghost features past the phantom slots.
  • Output arrays (dforce, dforce_mag, datom_energy, datom_virial) get the phantom prefix stripped before being scattered back to LAMMPS via select_map.

Why phantoms rather than Dim(min=0) re-export

Bumping the trace constraint to min=0 would require:

  1. auditing every nloc-dependent divide in deepmd/dpmodel/{descriptor,fitting,model}/ and protecting with xp.maximum(nloc, 1);
  2. torch.export re-emitting compatible guards (currently fails because spin-side shape relationships require nloc >= 1 to be inferable);
  3. inductor cooperating with the relaxed bound (it makes independent specialization choices downstream);
  4. re-exporting every .pt2 archive in source/tests/infer/.

The phantom approach is a strict superset of correctness and self-contained in one C++ file. The two approaches aren't mutually exclusive — the min=0 route can land as a follow-up once the dpmodel audit is done.

Test plan

  • Local CPU rebuild + runUnitTests_cc --gtest_filter='*Spin*': 42 / 42 spin C++ regression tests pass (12 TF-backend tests skipped, as expected in the PT-only venv).
  • CI: the multi-rank LAMMPS test test_pair_deepmd_mpi_dpa3_spin_empty_subdomain should now pass deterministically. Local Python LAMMPS-MPI verification is blocked by a pre-existing OpenMPI/MPICH ABI mismatch in my local venv (the plugin's ompi_mpi_* symbols can't resolve against MPICH's libmpi.so.12), so end-to-end verification falls to CI.

Known limitations

  • The phantom path is structurally inert for nloc_real > 0 (the if (phantom_n > 0) branch never fires), so the common path is unchanged.
  • If a future inductor version bumps the nloc lower-bound to >2, phantom_n will need to track that minimum.
  • This fix is in DeepSpinPTExpt only. The corresponding non-spin path in DeepPotPTExpt has the same code shape; non-spin DPA3 empty-subdomain currently passes in CI but could regress similarly with a future inductor change. Deferred to a follow-up if observed.
  • Supersedes test(lammps): skip spin DPA3 empty-subdomain pt2 case #5478 (which proposed skipping the test); this PR fixes the underlying bug instead.

Multi-rank spin MD can leave a rank with zero real local atoms when
all atoms migrate to other subdomains.  The with-comm AOTI artifact
hits an intermittent SIGFPE (integer divide by zero) at runtime in
inductor-generated shape arithmetic that uses nloc as a divisor.

The graph was traced with nloc_min=1 and inductor lowered an even
stricter nloc>=2 runtime-check which is silently bypassed because
AOTI_RUNTIME_CHECK_INPUTS is unset by default.  Whether the offending
divide is actually emitted depends on inductor's code-gen choices,
which vary across compiles -- hence the random nature of the failure
(reproduced on CI run 26667802665).

Fix: prepend two phantom atoms with empty neighbour lists ahead of
the real atoms when nloc_real==0.  The AOTI graph then runs with
nloc==2, satisfying the inductor specialisation.  Phantoms have no
neighbours so they contribute zero atomic energy / force / virial,
preserving the physically-correct 'this rank has no real atoms'
result.  comm_dict's nlocal is set to 2 so border_op writes received
ghost features past the phantom slots; outputs are stripped of the
phantom prefix before being scattered back to LAMMPS via select_map.
@dosubot dosubot Bot added the bug label Jun 2, 2026
@github-actions github-actions Bot added C++ and removed bug labels Jun 2, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 47f15b41a6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// received ghost features past the phantom slots; outputs are stripped
// of the phantom prefix before being scattered back to LAMMPS atoms
// via ``select_map``.
const int phantom_n = (nloc_real == 0 && nall_real > 0) ? 2 : 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Subtract phantom atoms before returning energy

When an empty-subdomain rank takes this new path, the two inserted type-0 phantom atoms are included in the model's energy_redu, and that reduced energy is assigned directly before any phantom prefix is stripped. DeepMD energy fitting nets can add per-type biases / nonzero zero-neighbor outputs, so phantoms are not guaranteed to have zero atomic energy; in those models every rank with nloc_real == 0 will contribute extra phantom energy to the MPI-reduced LAMMPS total. Please either remove/subtract the phantom atomic energies from energy_redu or make the model see them through a mask that excludes them from reductions.

Useful? React with 👍 / 👎.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 2, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR modifies DeepSpinPTExpt::compute to handle a zero-local-atoms corner case by introducing phantom-atom padding. When a rank has no real local atoms but has remote atoms, the code prepends two phantom atoms to avoid AOTI/Inductor model failures. The mapping, neighbor list, and output tensors are adjusted to isolate phantom atoms, then phantom entries are stripped from results before returning to LAMMPS.

Changes

Phantom-atom padding for zero-local-atoms corner case

Layer / File(s) Summary
Phantom-atom padding initialization
source/api_cc/src/DeepSpinPTExpt.cc
When nloc_real == 0 and nall_real > 0, prepend two phantom atoms with zero values; update nall_real and nloc_real counters; reconstruct spin tensor to fill real atoms from bkw_map while keeping phantom spin at zero.
Tensor mappings and neighbor-list preparation
source/api_cc/src/DeepSpinPTExpt.cc
Rebuild mapping_tensor so phantom slots use identity entries (self-indexing) while real slots use fwd_map-based remapping; prepend phantom_n neighbor-list rows filled with -1 to firstneigh_tensor so model receives adjusted nloc shape while phantom atoms contribute no neighbors.
Output post-processing and phantom removal
source/api_cc/src/DeepSpinPTExpt.cc
After inference, strip phantom-prefixed entries from dforce and dforce_mag; strip phantom prefixes from datom_energy and datom_virial when atomic output is enabled; restore nall_real to real-atom-only count so downstream force mapping applies only to physical atoms.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: fixing a zero-local-atom edge case (nloc==0) in DeepSpinPTExpt using phantom-atom padding.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@source/api_cc/src/DeepSpinPTExpt.cc`:
- Around line 487-489: The mapping assignment for real atoms fails to account
for phantom padding: when phantom_n > 0 the indices returned by fwd_map(...) are
in the pre-padding space and must be shifted by phantom_n. Update the loop that
sets mapping[ii] (which uses fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]])
to add phantom_n to the fwd_map result so real-atom targets point into the
post-padding index range; keep all other indexing (ii, bkw_map,
lmp_list.mapping) unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e4e67575-7512-48ab-a532-4f17dad67d93

📥 Commits

Reviewing files that changed from the base of the PR and between c67b7f7 and 47f15b4.

📒 Files selected for processing (1)
  • source/api_cc/src/DeepSpinPTExpt.cc

Comment on lines +487 to 489
for (int ii = phantom_n; ii < nall_real; ii++) {
mapping[ii] = fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]];
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Missing phantom_n offset in mapping values for real atoms.

When phantom_n > 0, fwd_map[...] returns indices in the pre-padding coordinate space [0, nall_real_orig). After phantom padding, real atoms are shifted to indices [phantom_n, nall_real), so the mapping values must be offset accordingly.

For example, with phantom_n=2: if fwd_map[...] returns 0 (first real ghost in pre-padding space), the mapping should point to index 2 (first real ghost in post-padding space), not 0 (which is now a phantom slot).

🐛 Proposed fix
     for (int ii = phantom_n; ii < nall_real; ii++) {
-      mapping[ii] = fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]];
+      mapping[ii] = fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]] + phantom_n;
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for (int ii = phantom_n; ii < nall_real; ii++) {
mapping[ii] = fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]];
}
for (int ii = phantom_n; ii < nall_real; ii++) {
mapping[ii] = fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]] + phantom_n;
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@source/api_cc/src/DeepSpinPTExpt.cc` around lines 487 - 489, The mapping
assignment for real atoms fails to account for phantom padding: when phantom_n >
0 the indices returned by fwd_map(...) are in the pre-padding space and must be
shifted by phantom_n. Update the loop that sets mapping[ii] (which uses
fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]]) to add phantom_n to the
fwd_map result so real-atom targets point into the post-padding index range;
keep all other indexing (ii, bkw_map, lmp_list.mapping) unchanged.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 2, 2026

Codecov Report

❌ Patch coverage is 14.28571% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.73%. Comparing base (c67b7f7) to head (47f15b4).

Files with missing lines Patch % Lines
source/api_cc/src/DeepSpinPTExpt.cc 14.28% 20 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5485      +/-   ##
==========================================
- Coverage   81.36%   79.73%   -1.63%     
==========================================
  Files         868      868              
  Lines       96437    96439       +2     
  Branches     4233     4234       +1     
==========================================
- Hits        78463    76896    -1567     
- Misses      16674    18410    +1736     
+ Partials     1300     1133     -167     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant