Skip to content

[Bug] DSP Segmentation fault on MEMORY print for many nodes #7269

@Cstandardlib

Description

@Cstandardlib

Describe the bug

Build: ABACUS DSP build, run on multiple DSP nodes.
Calculation: sDFT scf

If many nodes are used, a segmentation fault will be raised after the output of TIME STATISTICS in the running_scf.log.

If node number =3,4,5, abacus_dsp will terminate without error, while a segfault will be raised if node number = 6,7,8.

The running_scf.log is truncated at the TIME STATISTICS block.

TIME STATISTICS
---------------------------------------------------------------
   CLASS_NAME         NAME       TIME/s  CALLS   AVG/s  PER/%  
---------------------------------------------------------------
...
---------------------------------------------------------------

# no output below!

The block starting with

 NAME-------------------------|MEMORY(MB)------------------

will not be output.

In this case, stderr gives a segmentation fault:

[cn4053:721248:0:721248] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x6c656e72656b57)
==== backtrace (tid: 721248) ====
 0  /usr/local/ucx/lib/libucs.so.0(ucs_handle_error+0x2d4) [0x40003df0979c]
 1  /usr/local/ucx/lib/libucs.so.0(+0x2a92c) [0x40003df0992c]
 2  /usr/local/ucx/lib/libucs.so.0(+0x2acd4) [0x40003df09cd4]
 3  linux-vdso.so.1(__kernel_rt_sigreturn+0) [0x400039ed65b8]
 4  /lib/aarch64-linux-gnu/libc.so.6(cfree+0x24) [0x40003d468a2c]
 5 /abacus-develop/build_dsp/abacus_dsp(+0x6e704c) [0xaaaadf64204c]
 6  /abacus-develop/build_dsp/abacus_dsp(+0x55f8d8) [0xaaaadf4ba8d8]
 7  /abacus-develop/build_dsp/abacus_dsp(+0x3769cc) [0xaaaadf2d19cc]
 8  /abacus-develop/build_dsp/abacus_dsp(+0x376220) [0xaaaadf2d1220]
 9  /abacus-develop/build_dsp/abacus_dsp(+0x3763dc) [0xaaaadf2d13dc]
10  /abacus-develop/build_dsp/abacus_dsp(+0x99d48) [0xaaaadeff4d48]
11  /lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0xe8) [0x40003d40fe10]
12  /abacus-develop/build_dsp/abacus_dsp(+0x99bc8) [0xaaaadeff4bc8]
=================================
srun: error: cn4053: task 20: Segmentation fault
slurmstepd: error:  mpi/pmix_v3: _errhandler: cn4053 [5]: pmixp_client_v2.c:211: Error handler invoked: status = -25, source = [slurm.pmix.5500412.0:20]
slurmstepd: error: *** STEP 5500412.0 ON cn4048 CANCELLED AT 2026-04-19T21:23:54 ***
srun: Job step aborted: Waiting up to 302 seconds for job step to finish.
srun: error: cn4053: tasks 21-23: Killed

Expected behavior

There should be a block as follows after TIME STATISTICS in running_scf.log without segfault:

 NAME-------------------------|MEMORY(MB)------------------
                         total      6221.1736
                SDFT::chi0_cpu      2120.1782
...
 -------------   < 1.0 MB has been ignored ----------------
 ----------------------------------------------------------

To Reproduce

ABACUS Release v3.9.0.27

Environment

No response

Additional Context

CASE: sdft scf, C-sdft-8atom-700eV-3.5η

issue7269.tar.gz

Metadata

Metadata

Assignees

No one assigned

    Labels

    GPU & DCU & HPCGPU and DCU and HPC related any issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions