fix(cc): handle nloc==0 in DeepSpinPTExpt with phantom-atom padding#5485
fix(cc): handle nloc==0 in DeepSpinPTExpt with phantom-atom padding#5485wanghan-iapcm wants to merge 1 commit into
Conversation
Multi-rank spin MD can leave a rank with zero real local atoms when all atoms migrate to other subdomains. The with-comm AOTI artifact hits an intermittent SIGFPE (integer divide by zero) at runtime in inductor-generated shape arithmetic that uses nloc as a divisor. The graph was traced with nloc_min=1 and inductor lowered an even stricter nloc>=2 runtime-check which is silently bypassed because AOTI_RUNTIME_CHECK_INPUTS is unset by default. Whether the offending divide is actually emitted depends on inductor's code-gen choices, which vary across compiles -- hence the random nature of the failure (reproduced on CI run 26667802665). Fix: prepend two phantom atoms with empty neighbour lists ahead of the real atoms when nloc_real==0. The AOTI graph then runs with nloc==2, satisfying the inductor specialisation. Phantoms have no neighbours so they contribute zero atomic energy / force / virial, preserving the physically-correct 'this rank has no real atoms' result. comm_dict's nlocal is set to 2 so border_op writes received ghost features past the phantom slots; outputs are stripped of the phantom prefix before being scattered back to LAMMPS via select_map.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 47f15b41a6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // received ghost features past the phantom slots; outputs are stripped | ||
| // of the phantom prefix before being scattered back to LAMMPS atoms | ||
| // via ``select_map``. | ||
| const int phantom_n = (nloc_real == 0 && nall_real > 0) ? 2 : 0; |
There was a problem hiding this comment.
Subtract phantom atoms before returning energy
When an empty-subdomain rank takes this new path, the two inserted type-0 phantom atoms are included in the model's energy_redu, and that reduced energy is assigned directly before any phantom prefix is stripped. DeepMD energy fitting nets can add per-type biases / nonzero zero-neighbor outputs, so phantoms are not guaranteed to have zero atomic energy; in those models every rank with nloc_real == 0 will contribute extra phantom energy to the MPI-reduced LAMMPS total. Please either remove/subtract the phantom atomic energies from energy_redu or make the model see them through a mask that excludes them from reductions.
Useful? React with 👍 / 👎.
📝 WalkthroughWalkthroughThis PR modifies ChangesPhantom-atom padding for zero-local-atoms corner case
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@source/api_cc/src/DeepSpinPTExpt.cc`:
- Around line 487-489: The mapping assignment for real atoms fails to account
for phantom padding: when phantom_n > 0 the indices returned by fwd_map(...) are
in the pre-padding space and must be shifted by phantom_n. Update the loop that
sets mapping[ii] (which uses fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]])
to add phantom_n to the fwd_map result so real-atom targets point into the
post-padding index range; keep all other indexing (ii, bkw_map,
lmp_list.mapping) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: e4e67575-7512-48ab-a532-4f17dad67d93
📒 Files selected for processing (1)
source/api_cc/src/DeepSpinPTExpt.cc
| for (int ii = phantom_n; ii < nall_real; ii++) { | ||
| mapping[ii] = fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]]; | ||
| } |
There was a problem hiding this comment.
Missing phantom_n offset in mapping values for real atoms.
When phantom_n > 0, fwd_map[...] returns indices in the pre-padding coordinate space [0, nall_real_orig). After phantom padding, real atoms are shifted to indices [phantom_n, nall_real), so the mapping values must be offset accordingly.
For example, with phantom_n=2: if fwd_map[...] returns 0 (first real ghost in pre-padding space), the mapping should point to index 2 (first real ghost in post-padding space), not 0 (which is now a phantom slot).
🐛 Proposed fix
for (int ii = phantom_n; ii < nall_real; ii++) {
- mapping[ii] = fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]];
+ mapping[ii] = fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]] + phantom_n;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for (int ii = phantom_n; ii < nall_real; ii++) { | |
| mapping[ii] = fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]]; | |
| } | |
| for (int ii = phantom_n; ii < nall_real; ii++) { | |
| mapping[ii] = fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]] + phantom_n; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@source/api_cc/src/DeepSpinPTExpt.cc` around lines 487 - 489, The mapping
assignment for real atoms fails to account for phantom padding: when phantom_n >
0 the indices returned by fwd_map(...) are in the pre-padding space and must be
shifted by phantom_n. Update the loop that sets mapping[ii] (which uses
fwd_map[lmp_list.mapping[bkw_map[ii - phantom_n]]]) to add phantom_n to the
fwd_map result so real-atom targets point into the post-padding index range;
keep all other indexing (ii, bkw_map, lmp_list.mapping) unchanged.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5485 +/- ##
==========================================
- Coverage 81.36% 79.73% -1.63%
==========================================
Files 868 868
Lines 96437 96439 +2
Branches 4233 4234 +1
==========================================
- Hits 78463 76896 -1567
- Misses 16674 18410 +1736
+ Partials 1300 1133 -167 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Problem
Multi-rank spin MD can leave a rank with zero real local atoms (
nloc_real == 0) when atoms migrate to other subdomains. The with-comm AOTI artifact hits an intermittent SIGFPE (integer divide by zero) at runtime in inductor-generated shape arithmetic that usesnlocas a divisor.Reproduced on master CI run
26667802665:Root cause:
nloc_min=1(serialization.py:362) and inductor lowered an even stricternloc >= 2runtime-check (visible in the generatedwrapper.cpp'scheck_input_3).AOTI_RUNTIME_CHECK_INPUTS(default OFF), so withnloc = 0the check is silently bypassed and the compiled graph runs through its own divide-by-zero on shape arithmetic.Fix
Prepend two phantom atoms with empty neighbour lists when
nloc_real == 0so the AOTI graph runs withnloc == 2and never reaches the integer-divide-by-zero path. Phantoms have no neighbours so they contribute zero atomic energy / force / virial, preserving the physically-correct "this rank has no real atoms" result.Key details (all in
source/api_cc/src/DeepSpinPTExpt.cc):dcoord/datype/dspinget two zero-valued rows prepended.firstneigh_tensorgets two-1rows prepended (no neighbours).mapping_tensorgets two identity entries prepended.comm_dict.nlocalis set to2(not the LAMMPS-reported0) soborder_opwrites received ghost features past the phantom slots.dforce,dforce_mag,datom_energy,datom_virial) get the phantom prefix stripped before being scattered back to LAMMPS viaselect_map.Why phantoms rather than
Dim(min=0)re-exportBumping the trace constraint to
min=0would require:nloc-dependent divide indeepmd/dpmodel/{descriptor,fitting,model}/and protecting withxp.maximum(nloc, 1);torch.exportre-emitting compatible guards (currently fails because spin-side shape relationships requirenloc >= 1to be inferable);.pt2archive insource/tests/infer/.The phantom approach is a strict superset of correctness and self-contained in one C++ file. The two approaches aren't mutually exclusive — the
min=0route can land as a follow-up once the dpmodel audit is done.Test plan
runUnitTests_cc --gtest_filter='*Spin*': 42 / 42 spin C++ regression tests pass (12 TF-backend tests skipped, as expected in the PT-only venv).test_pair_deepmd_mpi_dpa3_spin_empty_subdomainshould now pass deterministically. Local Python LAMMPS-MPI verification is blocked by a pre-existing OpenMPI/MPICH ABI mismatch in my local venv (the plugin'sompi_mpi_*symbols can't resolve against MPICH'slibmpi.so.12), so end-to-end verification falls to CI.Known limitations
nloc_real > 0(theif (phantom_n > 0)branch never fires), so the common path is unchanged.nloclower-bound to >2,phantom_nwill need to track that minimum.DeepSpinPTExptonly. The corresponding non-spin path inDeepPotPTExpthas the same code shape; non-spin DPA3 empty-subdomain currently passes in CI but could regress similarly with a future inductor change. Deferred to a follow-up if observed.