Skip to content

Refactor: convert ia-it nested loops to iat flat loops with OpenMP in ESolver_DP#7394

Open
chengleizheng wants to merge 2 commits into
deepmodeling:developfrom
chengleizheng:develop
Open

Refactor: convert ia-it nested loops to iat flat loops with OpenMP in ESolver_DP#7394
chengleizheng wants to merge 2 commits into
deepmodeling:developfrom
chengleizheng:develop

Conversation

@chengleizheng
Copy link
Copy Markdown

@chengleizheng chengleizheng commented May 29, 2026

Replaced ia-it nested loops with flat iat loops using ucell.iat2it/iat2ia lookup arrays and added #pragma omp parallel for guarded by #ifdef _OPENMP in runner() coord building, runner() force assignment, and type_map() atype assignment.

…t2ia lookup arrays and added #pragma omp parallel for guarded by #ifdef _OPENMP in runner() coord building, runner() force assignment, and type_map() atype assignment.
@chengleizheng chengleizheng changed the title Replaced ia-it nested loops with flat iat loops using ucell.iat2it/iat2ia lookup arrays and added #pragma omp parallel for guarded by #ifdef _OPENMP in runner() coord building, runner() force assignment, and type_map() atype assignment. Refactor: convert ia-it nested loops to iat flat loops with OpenMP in ESolver_DP May 29, 2026
@mohanchen mohanchen requested a review from 19hello May 29, 2026 09:50
@mohanchen mohanchen added Feature Discussed The features will be discussed first but will not be implemented soon project_learning labels May 29, 2026
@mohanchen
Copy link
Copy Markdown
Collaborator

Nice try, you can do more, and put your test and analysis here.

#ifdef _OPENMP
#pragma omp parallel for
#endif
for (int iat = 0; iat < ucell.nat; ++iat)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend default(none) because it requires explicit variable scoping and avoids hidden parallel errors.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the recommendation!😊

@chengleizheng
Copy link
Copy Markdown
Author

chengleizheng commented Jun 1, 2026

二、代码修改详情

修改总览

修改点 位置 改动内容
runner() 坐标构建循环 L73-84 ia-it 双层嵌套 → iat 扁平循环 + OpenMP
runner() 力赋值循环 L103-105 #pragma omp parallel for
type_map() atype 赋值循环 L189-209 校验与赋值分离 + iat 扁平循环 + OpenMP

2.1 runner() 坐标构建

改动前(ia-it 嵌套循环):

int iat = 0;
for (int it = 0; it < ucell.ntype; ++it)          // 外层:元素种类(1~3)
{
    for (int ia = 0; ia < ucell.atoms[it].na; ++ia) // 内层:同类原子(数百)
    {
        coord[3 * iat]     = ucell.atoms[it].tau[ia].x * ucell.lat0_angstrom;
        coord[3 * iat + 1] = ucell.atoms[it].tau[ia].y * ucell.lat0_angstrom;
        coord[3 * iat + 2] = ucell.atoms[it].tau[ia].z * ucell.lat0_angstrom;
        iat++;  // 共享变量,多线程下产生数据竞争
    }
}
assert(ucell.nat == iat);

改动后(iat 扁平循环 + OpenMP):

#ifdef _OPENMP
#pragma omp parallel for default(none) shared(ucell, coord)
#endif
for (int iat = 0; iat < ucell.nat; ++iat)
{
    int it = ucell.iat2it[iat];  // 预建索引表,O(1) 查表
    int ia = ucell.iat2ia[iat];  // 预建索引表,O(1) 查表
    coord[3 * iat]     = ucell.atoms[it].tau[ia].x * ucell.lat0_angstrom;
    coord[3 * iat + 1] = ucell.atoms[it].tau[ia].y * ucell.lat0_angstrom;
    coord[3 * iat + 2] = ucell.atoms[it].tau[ia].z * ucell.lat0_angstrom;
}

改动要点:

  • 消除共享变量 iat++,每次迭代写入位置仅依赖循环变量 iat,无数据竞争
  • 利用 UnitCell 初始化时预建的 iat2it[] / iat2ia[] 索引数组(O(1) 查表),替代嵌套循环的类型遍历
  • 864 次迭代均匀分配,天然负载均衡
  • 消除了冗余的 assert(nat == iat)(iat 扁平循环不再需要累加校验)

2.2 runner() 力赋值

#ifdef _OPENMP
#pragma omp parallel for default(none) shared(ucell, f, fact_f)
#endif
for (int i = 0; i < ucell.nat; ++i)
{
    dp_force(i, 0) = f[3 * i] * fact_f;
    dp_force(i, 1) = f[3 * i + 1] * fact_f;
    dp_force(i, 2) = f[3 * i + 2] * fact_f;
}

此处循环体原本就是单层 iat 循环,仅添加 OpenMP pragma,改动量最小。

2.3 type_map() 校验与赋值分离

改动前:label 校验(含 WARNING_QUIT)和 atype 赋值混杂在双层循环中,校验代码被每条原子无意义地重复执行。

改动后:拆为两个阶段:

// 阶段 1:校验(串行,仅循环 ntype 次)
for (int it = 0; it < ucell.ntype; ++it)
    if (label.find(ucell.atoms[it].label) == label.end())
        WARNING_QUIT(...);

// 阶段 2:赋值(并行,ntype → nat 次迭代)
#ifdef _OPENMP
#pragma omp parallel for default(none) shared(ucell, label)
#endif
for (int iat = 0; iat < ucell.nat; ++iat)
    atype[iat] = label[ucell.atoms[ucell.iat2it[iat]].label];

工程细节

  • 所有 #pragma omp#ifdef _OPENMP 宏保护,对齐项目规范(参考 esolver_of_tddft.cpp
  • virial 赋值(3×3 循环,共 9 次迭代)不并行——线程启动开销大于计算量
  • schedule 使用默认 static,因为每次迭代工作量一致

三、性能测试结果

测试环境:

  • 体系:864 原子 Al(abacus-user-guide/examples/md/3_DPMD
  • 势函数:Al-SCAN.pb(DeepMD 模型)
  • MD 类型:MSST,10 步
  • 编译:cmake .. -DDeePMD_DIR=/home/chenglei/miniconda3/envs/deepmd

优化前:mpirun -np 1,代码里没有 #pragma omp,循环纯串行执行

 TIME STATISTICS
-------------------------------------------------------
 CLASS_NAME     NAME           TIME/s  CALLS   AVG/s  PER/%
-------------------------------------------------------
            total              10.23  1        10.23  100.00
 Driver     atomic_world       10.23  1        10.23  100.00
 Run_MD     md_line             9.14  1         9.14   89.36
 MD_func    force_virial        2.64  10        0.26   25.79
 ESolver_DP runner              2.60  10        0.26   25.42
-------------------------------------------------------
 TOTAL  Time  : 10s

优化后:mpirun -np 1,未设 OMP_NUM_THREADS,OpenMP 默认使用机器上所有核

 TIME STATISTICS
-------------------------------------------------------
 CLASS_NAME     NAME           TIME/s  CALLS   AVG/s  PER/%
-------------------------------------------------------
            total               6.28  1         6.28  100.00
 Run_MD     md_line             5.71  1         5.71   90.90
 MD_func    force_virial        2.59  10        0.26   41.18
 ESolver_DP runner              2.55  10        0.25   40.54
-------------------------------------------------------
 TOTAL  Time  : 6s

效果对比

指标 优化前 优化后 变化
总时间 10.23 s 6.28 s -38.6%
Run_MD md_line 9.14 s 5.71 s -37.5%
ESolver_DP runner 2.60 s 2.55 s -1.9%
runner 单步平均 0.260 s 0.255 s -1.9%

分析:

  • ESolver_DP::runner 自身的耗时小幅下降(2.60s → 2.55s),因为坐标构建和力赋值的循环在 864 原子规模下本身开销有限
  • 总时间大幅缩短(10.23s → 6.28s)的主要原因:启用 OpenMP 后 MKL/BLAS 等数学库自动受益于多线程。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feature Discussed The features will be discussed first but will not be implemented soon project_learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants