Skip to content

Feature: DeltaSpin for LCAO and PW base and DFTU for PW, both collinear and noncollinear spin (code change and UTs)#7384

Open
dyzheng wants to merge 7 commits into
deepmodeling:developfrom
dyzheng:feat/port-dftu-ds-code-v4
Open

Feature: DeltaSpin for LCAO and PW base and DFTU for PW, both collinear and noncollinear spin (code change and UTs)#7384
dyzheng wants to merge 7 commits into
deepmodeling:developfrom
dyzheng:feat/port-dftu-ds-code-v4

Conversation

@dyzheng
Copy link
Copy Markdown
Collaborator

@dyzheng dyzheng commented May 26, 2026

Reminder

  • Have you linked an issue with this pull request?
  • Have you added adequate unit tests and/or case tests for your pull request?
  • Have you noticed possible changes of behavior below or in the linked issue?
  • Have you explained the changes of codes in core modules of ESolver, HSolver, ElecState, Hamilt, Operator or Psi? (ignore if not applicable)

Linked Issue

Fix #...

Unit Tests and/or Case Tests for my changes

  • A unit test is added for each new feature or bug fix.

What's changed?

  • Example: My changes might affect the performance of the application under certain conditions, and I have tested the impact on various scenarios...

Any changes of core modules? (ignore if not applicable)

  • Example: I have added a new virtual function in the esolver base class in order to ...

dyzheng and others added 7 commits May 26, 2026 16:02
…ing develop)

Direct sync of source/ directory from feat/dftu-pw-port-v2
which has been merged with origin/develop.
101 files changed, +10659/-1262 lines.
Reverted infrastructure files to origin/develop state to clean up the PR.
Files reverted:
- source/source_cell/read_atoms_helper.cpp
- source/source_base/parallel_global.cpp
- source/source_base/timer.cpp
- source/source_base/kernels/cuda/math_kernel_op.cu
- source/source_base/module_device/device_check.h
- source/source_base/module_container/base/macros/cuda.h
- source/source_base/module_container/base/third_party/cusolver.h
- source/source_esolver/lcao_others.cpp
- source/source_hsolver/kernels/cuda/diag_cusolvermp.cu
- source/source_io/module_unk/berryphase.cpp
- source/source_lcao/module_rt/solve_propagation.cpp
- source/source_main/main.cpp
- source/source_lcao/module_ri/RPA_LRI.hpp

Kept:
- DFT+U/DeltaSpin core functionality (esolver, kernels with npol, dftu modules)
- Input parameters
- Build/Test system updates
Restored the following files from port-v2 as they contain essential
functional changes and bug fixes:

- source/source_esolver/lcao_others.cpp: Passes sc_direction_only parameter to init_deltaspin_lcao.
- source/source_base/parallel_global.cpp: Fixes MPI_Comm_free logic to avoid freeing MPI_COMM_WORLD.
- source/source_main/main.cpp: Reorders fftw_cleanup_threads before MPI_Finalize to avoid segfault.
- source/source_base/kernels/cuda/math_kernel_op.cu: Adds cublas_handle null check.

Other infrastructure files (timer.cpp, device_check.h, berryphase.cpp, etc.) remain reverted to develop.
Restoring all code-related changes as previous partial revert caused
compilation errors in CI.
Reverted 15 files with modifications that are completely unrelated to
the DFT+U and DeltaSpin functionality:

- Pure comment/formatting changes: timer.cpp, diago_david.cpp,
  esolver_sdft_pw.cpp, hsolver_lcao.cpp, berryphase.cpp, RPA_LRI.hpp,
  solve_propagation.cpp, diag_cusolvermp.cu, read_sep_test.cpp,
  cusolver.h (3 comment-only lines restored, CUDA version guards kept)
- Unrelated code refactor: elecstate_pw.cpp (vkb.nc -> vkbnc rename)
- Unused variable: forces.cpp (forcepaw declaration)
- Unrelated test: nscf_utils_test.cpp (deleted) and its CMakeLists entry
- operator_lcao.cpp: restored original comments and removed extra blank lines
@mohanchen mohanchen requested a review from dzzz2001 May 28, 2026 08:41
Copy link
Copy Markdown
Collaborator

@dzzz2001 dzzz2001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detailed inline review covering blockers and should-fix items, organized by sub-area (DeltaSpin core, DFTU core, PW kernels + onsite, infra/wiring, tests). Each inline comment is anchored to the most representative diff line. Happy to discuss any of these.

}
}
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Blocker] The CPU onsite_ps_op DFT+U overload in kernels/onsite_op.cpp gained an if(npol == 2) {…} else {…} branch (npol==1 path), but the matching CUDA kernel — the second __global__ void onsite_op(…) defined in this file (around line 48 of the new file, the variant that takes vu_iat, vu_begin_iat, orb_l_iat, ip_m) — was not patched. Only the DeltaSpin variant above (this hunk) gained the npol==1 else branch.

When PARAM.inp.device == "gpu" for DFT+U with nspin=1 or nspin=2 (the headline feature this PR is adding for PW), the DFT+U GPU kernel still hardcodes

ps[psind]     += vu_iat[index_mm] * becp[becpind] + vu_iat[index_mm + tlp1_2*2] * becp[becpind + tnp];
ps[psind + 1] += vu_iat[index_mm + tlp1_2*1] * becp[…]    + vu_iat[index_mm + tlp1_2*3] * becp[…];

For npol==1 this reads past becp (into the next atom/projector), writes ps[psind+1] (into the next ip's row), and indexes off-diagonal vu entries that don't exist (vu for nspin=1/2 is sized tlp1^2, not tlp1^2 * 4). The same applies to the ROCm mirror in kernels/rocm/onsite_op.hip.cu.

const int& nbands,
const int& ik,
const int& nkb,
const int& npol,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Blocker] Only the function signature gained const int& npol — the device kernels in this file (cal_force_npw_op / cal_force_nl_op at lines 225, 267, 339, 405 of the new file) still hardcode ib * 2, (ib2+1) * nkb, and ipol * nbands * 2 * nkb + …. When invoked with npol == 1 (collinear DFT+U / DeltaSpin force in PW), these will read OOB on dbecp/becp and add the spin-down contribution to a non-existent row.

The CPU side (kernels/force_op.cpp) was correctly split into if(npol==2) { … } else if (npol==1) { … } branches. The GPU kernels need the same. The ROCm mirror (kernels/rocm/force_op.hip.cu, not in this diff) doesn't appear to have the new npol parameter in its signature at all, which will cause a link failure on ROCm builds.

const int& ntype,
const int& wg_nc,
const int& ik,
const int& npol,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Blocker] Same issue as cuda/force_op.cu: only the signature gained const int& npol; the kernels at lines 317, 946, 1008 of the new file still hardcode ib * 2 and (ib2+1) * nkb. For npol==1 (collinear PW DFT+U/DeltaSpin stress) this will OOB-read becp/dbecp and double-count. Same for kernels/rocm/stress_op.hip.cu.

static inline void trtri(cusolverDnHandle_t& cusolver_handle, const char& uplo, const char& diag, const int& n, float* A, const int& lda)
{
int lwork = 0;
CHECK_CUSOLVER(cusolverDnStrtri_bufferSize(cusolver_handle, cublas_fill_mode(uplo), cublas_diag_type(diag), n, A, lda, &lwork));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Blocker] The #if CUDA_VERSION < 11000 fallback calls cusolverDnStrtri_bufferSize / cusolverDnStrtri (and D/C/Z variants below). These symbols do not exist in any cuSOLVER release — only the generic cusolverDnXtrtri (CUDA 11.0+, used by the #if … >= 11000 branch above) and the batched cublas{S,D,C,Z}trtri_batched exist. Any build with CUDA < 11 will fail to link with undefined references to cusolverDn{S,D,C,Z}trtri.

Options:

  1. Drop the fallback entirely and require CUDA ≥ 11 — consistent with module_container/base/macros/cuda.h:25-31 which already guards GetTypeCuda<int64_t> on CUDA_VERSION >= 11000.
  2. Route the legacy branch to cublas{S,D,C,Z}trtri_batched (with batchCount=1).

@@ -398,25 +411,43 @@ struct cal_force_nl_op<FPTYPE, base_device::DEVICE_CPU>
const std::complex<FPTYPE> coefficients3(-1 * lambda[iat*3+2], 0.0);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Blocker] This npol==1 branch:

local_force[ipol] -= fac * lambda[iat*3+2] * dbb;

omits the spin sign that is correctly applied in the Hamiltonian application at op_pw_proj.cpp:1448-1456:

int spin_sign = 1;
if (PARAM.inp.nspin == 2) {
    spin_sign = (this->isk[this->ik] == 0) ? 1 : -1;
}
tmp_lambda_coeff[iat] = std::complex<double>(lambda[iat][2] * spin_sign, 0.0);

Without the sign, the force from spin-down k-points (isk==1) is added with the wrong sign and the force will not equal -dE/dR. Same issue in kernels/stress_op.cpp npol==1 branch.

CHECK_CUDA(cudaFree(d_info));
}
static inline void trtri(cusolverDnHandle_t& cusolver_handle, const char& uplo, const char& diag, const int& n, double* A, const int& lda)
{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Should-fix] Even after fixing the non-existent symbols flagged above, the new functions allocate d_work / d_info via cudaMalloc and free them at the end. There's no RAII guard, so a CHECK_CUSOLVER failure between the alloc and free leaks both. Consider a small scope guard, or check the status code explicitly and free before throwing.

{
syncmem_complex_d2h_op()(h_becp, this->becp, this->size_becp);
}
this->becp_ready_ = true;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Should-fix] The new cache (becp_ready_ = true; ik_becp_ = ik_; here and bool is_becp_ready(int ik) const { return becp_ready_ && ik_becp_ == ik; } in the header) is set/invalidated but never read anywhere I can see. force_onsite / stress_onsite still recompute becp unconditionally. Either wire the consumer to skip the recomputation (cheapest perf win) or remove the dead state.

@@ -1,7 +1,8 @@
#ifndef W_ABACUS_DEVELOP_ABACUS_DEVELOP_SOURCE_source_pw_HAMILT_PWDFT_KERNELS_FORCE_OP_H
#define W_ABACUS_DEVELOP_ABACUS_DEVELOP_SOURCE_source_pw_HAMILT_PWDFT_KERNELS_FORCE_OP_H
#include "source_io/module_parameter/parameter.h"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Should-fix] Pulling source_io/module_parameter/parameter.h into a kernel-op header means every TU that includes this header now transitively includes the whole parameter parser — meaningful compile-time bloat. The parameter values needed in the kernel implementation can be passed as function arguments (as npol already is). Please move the include into .cpp files only, or use a forward-declared opaque accessor.

@@ -1,7 +1,7 @@
#ifndef SRC_PW_STRESS_MULTI_DEVICE_H
#define SRC_PW_STRESS_MULTI_DEVICE_H
#include "source_io/module_parameter/parameter.h"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Should-fix] Same as force_op.h above: please move #include "source_io/module_parameter/parameter.h" out of this header. Header pulls in the full parameter parser; kernel-op headers should stay light.


#include "elecstate.h"
#include "source_estate/module_dm/density_matrix.h"
#include "source_basis/module_ao/parallel_orbitals.h"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Should-fix] Two new includes here (parallel_orbitals.h, klist.h) but I don't see any new use of those types in the header itself — only the .cpp. Please move them into the implementation file to avoid compile-time bloat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants