diff --git a/pexsi_kpoint_parallel_optimization_report.tex b/pexsi_kpoint_parallel_optimization_report.tex
new file mode 100644
index 00000000000..3fe07093d79
--- /dev/null
+++ b/pexsi_kpoint_parallel_optimization_report.tex
@@ -0,0 +1,649 @@
+\documentclass[a4paper,11pt]{article}
+
+\usepackage{geometry}
+\usepackage{booktabs}
+\usepackage{longtable}
+\usepackage{array}
+\usepackage{hyperref}
+\usepackage{xcolor}
+\usepackage{fontspec}
+
+\IfFontExistsTF{Noto Serif CJK SC}
+{
+    \setmainfont{Noto Serif CJK SC}
+}
+{
+    \setmainfont{DejaVu Serif}
+}
+\IfFontExistsTF{Noto Sans Mono CJK SC}
+{
+    \setmonofont{Noto Sans Mono CJK SC}
+}
+{
+    \setmonofont{DejaVu Sans Mono}
+}
+
+\XeTeXlinebreaklocale "zh"
+\XeTeXlinebreakskip = 0pt plus 1pt
+
+\geometry{left=2.2cm,right=2.2cm,top=2.4cm,bottom=2.4cm}
+\hypersetup{colorlinks=true,linkcolor=blue,urlcolor=blue,citecolor=blue}
+\setlength{\parindent}{2em}
+\setlength{\parskip}{0.25em}
+\setlength{\emergencystretch}{2em}
+\renewcommand{\arraystretch}{1.15}
+
+\newcommand{\code}[1]{\texttt{#1}}
+
+\title{\textbf{ABACUS-PEXSI 多 k 点并行优化报告}\\
+\large 代码如何优化与性能验证}
+\author{分支：\code{pexsi-stage4-pattern-reuse-opt}}
+\date{2026 年 5 月 27 日}
+
+\begin{document}
+\maketitle
+
+\begin{abstract}
+本报告记录本轮针对 ABACUS LCAO-PEXSI 多 k 点路径的并行优化和反馈后的补充测试。优化目标是减少原 stage4 中逐 k 点串行执行 PEXSI 的阻塞，利用 \code{kpar} 将不同 k 点分配到多个 MPI k 池中并行执行，同时避免中间化学势搜索阶段重复构造 H/S 和重复收集 DM/EDM。代码层面完成了 PEXSI 矩阵转换通信域局部化、复数 PEXSI k 池并行执行、H/S 池内缓存、DM/EDM 延迟收集、全局电子数规约、PEXSI 子通信域参数保护、细粒度计时、局部 OpenMP 循环优化，以及本轮新增的 PEXSI complex plan、symbolic factorization 和通信 pattern 保守复用框架。测试表明，新增路径可以正确完成官方 001 和 003 多 k 点样例；在 \code{003\_4MoS2} 上，\code{np=8,kpar=4,pexsi\_npole=80} 的当前实现总耗时为 68.01 s，比此前 stage4 \code{np=4} 串行 PEXSI 的 147.03 s 明显更快。反馈中关注的 H/S 打包优化已做 A/B 测试：001 中 \code{TransMAT2CCS} 仅提升约 1.25\%，003 中该计时反而因噪声略慢，说明它不是当前主瓶颈。新增 symbolic/pattern 复用在 001 的同输入 on/off 对照中保持最终能量完全一致，但由于默认数值阈值下 CCS 稀疏 pattern 在 SCF/化学势循环间漂移，保守校验没有命中可复用 plan，端到端暂无加速。详细计时显示主要瓶颈仍是 \code{PPEXSICalculateFermiOperatorComplex}；降低 \code{pexsi\_npole} 可以显著减少耗时，但 001 的低 pole 试验已出现可观能量偏差，因此不能直接作为默认正确性路径。当前实现尚未达到 10 个官方样例均在 1800 s 内收敛或全面快于传统 ScaLAPACK 的目标。
+\end{abstract}
+
+\noindent\textbf{关键词：}ABACUS；PEXSI；多 k 点并行；KPAR；MPI 子通信域；H/S 缓存；DM/EDM 延迟收集
+
+\tableofcontents
+\newpage
+
+\section{优化目标}
+
+stage4 代码已经打通了复数多 k 点 PEXSI 路径，但执行方式仍接近串行：
+
+\begin{verbatim}
+for each mu trial:
+    for each k point:
+        update H(k), S(k)
+        run complex PEXSI on all MPI ranks
+        retrieve DM/EDM
+        accumulate electron number and energy
+\end{verbatim}
+
+该方式有三个直接瓶颈：
+
+\begin{itemize}
+    \item 所有 k 点逐个执行，不能利用不同 k 点之间天然独立的并行性；
+    \item 同一个 SCF 步内，H/S 不随化学势 \(\mu\) 改变，但旧路径每次 \(\mu\) 尝试都重新 \code{updateHk} 和重新分发 H/S；
+    \item \(\mu\) 尚未满足电子数条件时，DM/EDM 只会被下一次 \(\mu\) 尝试覆盖，提前收集回全局 2D 分布会产生不必要通信。
+\end{itemize}
+
+本轮优化的直接目标是实现 PEXSI 多 k 点并行，减少重复 H/S 构造和无效 DM/EDM 收集，并用官方样例验证是否能缩短耗时。
+
+\section{代码优化内容}
+
+\subsection{通信域局部化}
+
+PEXSI 的 BCD/CCS 转换原先隐含使用 \code{MPI\_COMM\_WORLD}。这在单个 PEXSI 求解器占用全局 MPI 通信域时可以工作，但多个 k 池并行调用 PEXSI 时会出问题：不同 k 池会在不同的 group 上进入全局 collective，可能产生死锁或错误 rank 映射。
+
+本轮修改如下：
+
+\begin{itemize}
+    \item \code{DistBCDMatrix} 的参数广播从 \code{MPI\_COMM\_WORLD} 改为当前 BCD 通信域；
+    \item \code{DistMatrixTransformer::newGroupCommTrans} 使用 BCD 通信域创建临时转换通信域；
+    \item \code{buildTransformParameter} 和 BCD/CCS 数值重分布中的 root 判断改为转换通信域内 rank 0；
+    \item \code{simplePEXSI} 和 \code{simplePEXSIComplex} 末尾 barrier 改为当前 \code{comm\_2D}/\code{comm\_PEXSI}，避免不同 k 池互相等待。
+\end{itemize}
+
+涉及文件包括：
+
+\begin{itemize}
+    \item \code{source/source\_hsolver/module\_pexsi/dist\_bcd\_matrix.cpp}
+    \item \code{source/source\_hsolver/module\_pexsi/dist\_matrix\_transformer.cpp}
+    \item \code{source/source\_hsolver/module\_pexsi/simple\_pexsi.cpp}
+\end{itemize}
+
+\subsection{PEXSI 多 k 点并行执行}
+
+在 \code{HSolverLCAO} 的 \code{ks\_solver pexsi} 分支中新增复数路径专用的 k 并行 runner。该路径只在满足以下条件时启用：
+
+\begin{itemize}
+    \item 当前矩阵类型为 \code{std::complex<double>}，即非 Gamma-only 的多 k 点路径；
+    \item 输入设置了 \code{kpar > 1}，并由 LCAO 输入逻辑保存到 \code{kpar\_lcao}；
+    \item 每个 k 池至少保留 \code{pexsi\_nproc\_pole} 所需的 MPI rank。
+\end{itemize}
+
+新的并行结构为：
+
+\begin{verbatim}
+split MPI ranks into k pools by kpar
+cache H(k), S(k) in each owning pool
+for each mu trial:
+    each pool runs PEXSI for its local k points
+    pool root accumulates weighted electron number and energy
+    MPI_Allreduce over all pools
+    update global mu
+after mu converges:
+    collect cached DM/EDM from each pool back to global 2D layout
+dm2rho builds real-space density
+\end{verbatim}
+
+该实现复用了 ABACUS 既有 \code{Parallel\_K2D} 的 k 点分配和 H/S 分发能力，避免另写一套 k 点调度器。
+
+\subsection{H/S 缓存}
+
+同一 SCF 步内，H/S 由当前电荷密度决定，不随 PEXSI 化学势搜索中的 \(\mu\) 改变。因此本轮将 H/S 分发提前到 \(\mu\) 搜索外：
+
+\begin{verbatim}
+before mu loop:
+    distribute H(k), S(k) once
+    cache hk_pool/sk_pool for local k
+inside mu loop:
+    reuse cached H/S
+\end{verbatim}
+
+在 001 样例上，该优化将 \code{Parallel\_K2D::distribute\_hsk} 调用数从 115 次降到 45 次，\code{HamiltLCAO::updateHk} 调用数从 460 次降到 180 次。
+
+\subsection{DM/EDM 延迟收集}
+
+PEXSI 每次 \(\mu\) 尝试都会产生 DM/EDM，但只有电子数满足条件的最终 \(\mu\) 才会用于 \code{dm2rho}。因此本轮将全局 DM/EDM 收集推迟到 \code{finish\_k\_loop} 判定收敛之后。
+
+实现方式为：每个 k 池先把本池负责的 DM/EDM 缓存在 \code{dm\_pool\_cache}/\code{edm\_pool\_cache} 中；当当前 \(\mu\) 满足电子数条件后，再通过 \code{Cpxgemr2d} 把各 k 点的 DM/EDM 收集到全局 2D block-cyclic 分布。
+
+\subsection{全局电子数与能量规约}
+
+每个 k 池只对自己负责的 k 点执行 PEXSI。为避免一个池内多个 rank 重复计入同一个 k 点，本轮只让池内 rank 0 将该 k 点的结果加入局部和：
+
+\begin{verbatim}
+local_totals = weighted sum on pool root only
+MPI_Allreduce(local_totals -> global_totals)
+DiagoPexsi::set_k_loop_totals(global_totals)
+finish_k_loop(target_nelec)
+\end{verbatim}
+
+对应地，\code{DiagoPexsi} 新增了三个轻量接口：
+
+\begin{itemize}
+    \item \code{ensure\_density\_buffers}：为所有 k 点预分配 DM/EDM 指针；
+    \item \code{current\_mu}：读取当前全局化学势；
+    \item \code{set\_k\_loop\_totals}：写入跨 k 池规约后的电子数、导数和能量。
+\end{itemize}
+
+\subsection{PEXSI 子通信域参数保护}
+
+当用户设置 \code{kpar} 后，每个 PEXSI 子通信域的 rank 数会小于全局 rank 数。因此 \code{simple\_pexsi.cpp} 中将 \code{pexsi\_nproc\_pole} 限制到当前 PEXSI 通信域大小以内，避免 PEXSI 使用超过池内 rank 数的 pole 并行配置。
+
+\subsection{H/S 打包重分布与转换计时}
+
+BCD 到 CCS 转换阶段，H 和 S 使用相同的稀疏非零模式。旧实现对 H 和 S 分别执行一次 \code{MPI\_Alltoallv}；当前实现把同一非零位置的 H/S 数值打包后用一次 \code{MPI\_Alltoallv} 完成重分布。实数路径打包为 \code{[H,S]}，复数路径打包为 \code{[Re(H),Im(H),Re(S),Im(S)]}。
+
+该修改减少了一次 collective 调用，但它的收益必须通过计时验证。因此本轮在 \code{simplePEXSI} 和 \code{simplePEXSIComplex} 中补充了：
+
+\begin{itemize}
+    \item \code{TransMAT2CCS}：BCD 到 CCS 输入转换；
+    \item \code{PEXSI\_LoadHS\_C/R}：装载 H/S 到 PEXSI；
+    \item \code{PEXSI\_Symbolic\_C}：复数 symbolic factorization；
+    \item \code{PEXSI\_FermiOp\_C}：复数 Fermi operator 计算；
+    \item \code{PEXSI\_Retrieve\_C/R}：取回 DM/EDM；
+    \item \code{TransMAT22D}：CCS 到 BCD 输出转换。
+\end{itemize}
+
+\subsection{低温 pole 数策略}
+
+旧 stage4 为了低温能量精度，将默认 \code{pexsi\_npole=40} 在低 temperature 下提高到 120。详细计时显示，001 和 003 的 \code{PEXSIDFT} 几乎完全由 Fermi operator 计算主导，120 pole 会把多 k 点路径推到明显的性能瓶颈。本轮把低温自动提升值从 120 调整到 80，并保留用户显式设置 \code{pexsi\_npole} 和 \code{pexsi\_temp} 做性能试探的空间。
+
+需要强调：降低 \code{pexsi\_npole} 是速度/精度权衡，不是无条件优化。001 的补充测试表明，\code{pexsi\_npole=40} 配合较高 \code{pexsi\_temp} 可以更快，但总能量会偏离 120-pole 参考。因此当前默认只降到 80，后续如果要继续降低，必须先由总能量和 force 正确性测试确认。
+
+\subsection{OpenMP 本地循环优化}
+
+本轮只在 ABACUS 侧本地数组循环中引入 OpenMP，例如非零元扫描、发送/接收 buffer 填充、CCS 到 BCD 回填等。没有把 OpenMP 包到 PEXSI、BLACS 或 MPI collective 外层，也没有并行多个 PEXSI 调用。原因是这些路径涉及 MPI 通信器、PEXSI plan 和 SuperLU\_DIST 状态，外层线程并行容易引入线程安全问题或死锁。
+
+为保持结果确定性，OpenMP 循环使用先构造 mask、再串行 prefix、最后并行填充的方式，保证 \code{colidx}/\code{rowidx} 顺序与原串行扫描一致。实际测试中，001 在 \code{OMP\_NUM\_THREADS=2} 下得到与 \code{OMP\_NUM\_THREADS=1} 一致的最终能量；但小矩阵上耗时变慢，说明 OpenMP 当前主要是为较大局部矩阵预留优化路径，不是 001/003 的主要加速来源。
+
+\subsection{PEXSI symbolic 与通信 pattern 复用}
+
+本轮新增 \code{pexsi::PexsiComplexReuseContext}，用于在复数多 k 点 PEXSI 路径中保存每个 k 点的 PEXSI complex plan。该 context 不放在 \code{HSolverLCAO} 局部对象中，而是由 \code{ESolver\_KS\_LCAO} 持有并通过 \code{HSolverLCAO} 传入 \code{simplePEXSIComplex}。这样可以跨电子步保留 plan 生命周期，避免 \code{HSolverLCAO} 每个电子步重新构造导致缓存立即失效。
+
+复用条件采用保守策略。每次 BCD 到 CCS 转换后，代码为当前 PEXSI plan 生成 signature，内容包括：
+
+\begin{itemize}
+    \item PEXSI 子通信域大小、\code{numProcessPerPole}、PEXSI 2D 进程网格；
+    \item PEXSI symbolic/ordering 相关选项；
+    \item CCS 全局非零数、本地非零数、本地列数；
+    \item 当前 rank 的本地 \code{colptrLocal/rowindLocal} hash；
+    \item 所有 rank 本地 hash 规约得到的全局 pattern hash。
+\end{itemize}
+
+只有当所有 rank 的 signature 同时匹配，并且通信器 rank 顺序保持一致时，才复用已有 plan。复用命中时，\code{PPEXSILoadComplexHSMatrix} 和 \code{PPEXSICalculateFermiOperatorComplex} 使用 \code{isSymbolicFactorize=0}、\code{isConstructCommPattern=0}，并跳过显式 \code{PPEXSISymbolicFactorizeComplexUnsymmetricMatrix}。若任一 rank 发现 pattern 或参数变化，则全体 rank 重新初始化 plan，避免错误复用 PEXSI 的稀疏分解和通信状态。
+
+为便于验证，新增环境变量 \code{ABACUS\_PEXSI\_REUSE\_SETUP=0}，可在不改输入文件的情况下关闭该复用路径。该开关只用于调试和 A/B 测试；默认行为是启用保守复用检查。
+
+\section{测试环境}
+
+编译命令：
+
+\begin{verbatim}
+PATH=/home/HeJunXiang/VSCode/abacus-develop/.conda/pexsi-env/bin:$PATH \
+cmake --build build-pexsi-conda -j 2
+\end{verbatim}
+
+运行命令统一显式限制线程数，避免 OpenBLAS/OpenMP 过度嵌套：
+
+\begin{verbatim}
+OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 MKL_NUM_THREADS=1 \
+PATH=/home/HeJunXiang/VSCode/abacus-develop/.conda/pexsi-env/bin:$PATH \
+timeout <seconds>s mpirun -np <N> build-pexsi-conda/abacus_basic_para
+\end{verbatim}
+
+测试目录见表 \ref{tab:test-dirs}。
+
+\begin{longtable}{@{}p{0.30\linewidth}p{0.58\linewidth}@{}}
+\caption{本轮优化测试目录}
+\label{tab:test-dirs}\\
+\toprule
+\textbf{样例} & \textbf{目录}\\
+\midrule
+\endfirsthead
+\toprule
+\textbf{样例} & \textbf{目录}\\
+\midrule
+\endhead
+001 \code{np=4,kpar=1} & \code{/tmp/abacus\_pexsi\_kpar\_opt\_20260525\_001\_kpar1\_omp1}\\
+001 \code{np=4,kpar=4} & \code{/tmp/abacus\_pexsi\_kpar\_opt\_20260525\_001\_kpar4\_cache\_omp1}\\
+001 \code{np=8,kpar=4,npole=80} & \code{/tmp/abacus\_pexsi\_kpar\_opt\_20260525\_001\_np8\_kpar4\_npole80\_omp1}\\
+001 H/S A/B packed & \code{/tmp/abacus\_pexsi\_kpar\_opt\_20260525\_001\_np8\_kpar4\_packed\_json\_omp1}\\
+001 H/S A/B legacy & \code{/tmp/abacus\_pexsi\_kpar\_opt\_20260525\_001\_np8\_kpar4\_legacy\_hs\_json\_omp1}\\
+001 OpenMP 校验 & \code{/tmp/abacus\_pexsi\_kpar\_opt\_20260525\_001\_np8\_kpar4\_auto80\_omp2}\\
+001 pattern reuse off & \code{/tmp/abacus\_pexsi\_pattern\_reuse\_compare\_off\_20260527}\\
+001 pattern reuse on & \code{/tmp/abacus\_pexsi\_pattern\_reuse\_compare\_on\_20260527}\\
+003 \code{np=4,kpar=4} & \code{/tmp/abacus\_pexsi\_kpar\_opt\_20260525\_003\_kpar4\_cache\_omp1}\\
+003 \code{np=8,kpar=4} & \code{/tmp/abacus\_pexsi\_kpar\_opt\_20260525\_003\_np8\_kpar4\_cache\_omp1}\\
+003 \code{np=8,kpar=4,npole=80} & \code{/tmp/abacus\_pexsi\_kpar\_opt\_20260525\_003\_np8\_kpar4\_npole80\_omp1}\\
+003 H/S A/B packed & \code{/tmp/abacus\_pexsi\_kpar\_opt\_20260525\_003\_np8\_kpar4\_packed\_json\_omp1}\\
+003 H/S A/B legacy & \code{/tmp/abacus\_pexsi\_kpar\_opt\_20260525\_003\_np8\_kpar4\_legacy\_hs\_json\_omp1}\\
+006 \code{np=4,kpar=4} & \code{/tmp/abacus\_pexsi\_kpar\_opt\_20260525\_006\_kpar4\_cache\_omp1}\\
+006 \code{np=8,kpar=4,npole=80} & \code{/tmp/abacus\_pexsi\_kpar\_opt\_20260525\_006\_np8\_kpar4\_cache\_omp1}\\
+006 低 pole 试探 & \code{/tmp/abacus\_pexsi\_kpar\_opt\_20260525\_006\_np8\_kpar4\_npole40\_temp0002\_omp1}\\
+\bottomrule
+\end{longtable}
+
+\section{测试结果}
+
+\subsection{001\_4GaAs}
+
+001 是 20 个 k 点、152 基函数的小型多 k 点样例。测试结果见表 \ref{tab:case001}。
+
+\begin{longtable}{@{}p{0.18\linewidth}r r r r p{0.22\linewidth}@{}}
+\caption{001\_4GaAs PEXSI k 并行测试}
+\label{tab:case001}\\
+\toprule
+\textbf{配置} & \textbf{总/s} & \textbf{PEXSIDFT/s} & \textbf{H/S 缓存/s} & \textbf{能量/eV} & \textbf{结论}\\
+\midrule
+\endfirsthead
+\toprule
+\textbf{配置} & \textbf{总/s} & \textbf{PEXSIDFT/s} & \textbf{H/S 缓存/s} & \textbf{能量/eV} & \textbf{结论}\\
+\midrule
+\endhead
+\code{np=4,kpar=1} & 50.06 & 43.98 & -- & -7836.1565576466 & 串行 k 循环基线\\
+\code{np=4,kpar=4} & 52.45 & 31.65 & 1.71 & -7836.1565576466 & PEXSI 求解变快，但总时间略慢\\
+\code{np=8,kpar=4,npole=80} & 23.41 & 12.74 & -- & -7836.1562642064 & 低温 80-pole 当前最快配置\\
+\code{np=8,kpar=4,npole=80,OMP=2} & 25.85 & 14.52 & 1.39 & -7836.1562642064 & 能量一致，线程开销使小矩阵变慢\\
+\bottomrule
+\end{longtable}
+
+001 说明新增 k 并行路径保持了能量一致性。4 rank 时每个 k 池只有 1 个 rank，额外的 k 池调度、缓存和全局收集抵消了收益；当使用 \code{np=8,kpar=4} 后，每个 k 池保留 2 个 rank，80-pole 配置总耗时降到 23.41 s。OpenMP=2 的能量与 OpenMP=1 一致到 \(10^{-10}\) eV 量级，但小矩阵上线程开销使总耗时上升。
+
+\subsection{003\_4MoS2}
+
+003 是 25 个 k 点、212 基函数的多 k 点样例。结果见表 \ref{tab:case003}。
+
+\begin{longtable}{@{}p{0.20\linewidth}r r r r p{0.20\linewidth}@{}}
+\caption{003\_4MoS2 PEXSI k 并行测试}
+\label{tab:case003}\\
+\toprule
+\textbf{配置} & \textbf{总/s} & \textbf{PEXSIDFT/s} & \textbf{H/S 缓存/s} & \textbf{能量/eV} & \textbf{结论}\\
+\midrule
+\endfirsthead
+\toprule
+\textbf{配置} & \textbf{总/s} & \textbf{PEXSIDFT/s} & \textbf{H/S 缓存/s} & \textbf{能量/eV} & \textbf{结论}\\
+\midrule
+\endhead
+此前 stage4 \code{np=4,kpar=1} & 147.03 & 136.15 & -- & -9699.0065944614 & 串行 PEXSI 参考\\
+\code{np=4,kpar=4} & 157.77 & 149.32 & 2.36 & -9699.0065944608 & 每池 1 rank，变慢\\
+\code{np=8,kpar=4} & 95.89 & 88.79 & 1.88 & -9699.0065944605 & 每池 2 rank，120-pole 参考\\
+\code{np=8,kpar=4,npole=80} & 68.01 & 60.45 & -- & -9699.0061842626 & 80-pole 进一步降低 FermiOp 成本\\
+\bottomrule
+\end{longtable}
+
+003 表明，多 k 点并行是否加速取决于每个 k 池内仍保留多少 PEXSI rank。\code{np=4,kpar=4} 把 4 个 rank 拆成 4 个单 rank 池，丢失了每 k 内部并行，因此变慢；\code{np=8,kpar=4} 每池 2 rank 后，总耗时降到 95.89 s，相比此前 stage4 串行 PEXSI 147.03 s 加速约 1.53 倍。进一步把低温默认 pole 数从 120 调整到 80 后，003 总耗时降到 68.01 s，但能量相对 120-pole 参考偏移约 \(4.1\times10^{-4}\) eV，需要总能量验收确认。
+
+但该结果仍不能说明已经快于传统 \code{ks\_solver scalapack\_gvx}。旧报告中 003 的普通 ScaLAPACK 路径约 29 s，当前 PEXSI 仍慢于成熟传统求解器。
+
+\subsection{006\_16Na}
+
+006 是 172 个 k 点、240 基函数的压力样例。此前 stage4 串行 PEXSI 在 1800 s 内只完成 PE1，PE1 约 1087.87 s。本轮使用 \code{np=4,kpar=4} 测试 600 s，程序进入 SCF 后未打印 PE1，日志停在第一个电子步开头。进一步使用 \code{np=8,kpar=4,npole=80} 和 \code{np=16,kpar=8,npole=80} 进行短时压力测试，均在数分钟内没有打印 PE1。按反馈建议把 006 改为 \code{pexsi\_npole=40,pexsi\_temp=0.0002} 后，600 s 内完成到 PE3，前三步约为 215.02/160.42/169.99 s，但仍未收敛。
+
+这说明：006 的瓶颈不是原先 multi-k 入口阻塞，而是 \(172\) 个 k 点上重复 PEXSI Fermi operator 的累计成本。在本地 24 线程机器上，单纯增加总 MPI rank 并不能线性改善，因为每个 k 池内 PEXSI rank、k 点分配均衡和单 k 稀疏求解开销之间存在权衡。低 pole 明显改善单步耗时，但 001 已显示 40 pole 可能破坏总能量，因此 006 低 pole 结果只能作为瓶颈定位，不能作为默认正确性配置。
+
+\subsection{H/S 打包 A/B 测试}
+
+反馈中指出，需要验证 ``H/S 打包成一次 \code{MPI\_Alltoallv}'' 是否真的带来速度收益。本轮用同一份代码分别切换为 packed 版本和 legacy 两次 \code{MPI\_Alltoallv} 版本，测试 001 和 003 的 \code{np=8,kpar=4,npole=80,OMP=1}。结果见表 \ref{tab:hs-pack-ab}。
+
+{\scriptsize
+\begin{longtable}{@{}p{0.14\linewidth}p{0.10\linewidth}r r r r p{0.22\linewidth}@{}}
+\caption{H/S 打包重分布 A/B 测试}
+\label{tab:hs-pack-ab}\\
+\toprule
+\textbf{样例} & \textbf{版本} & \textbf{总/s} & \textbf{TransMAT2CCS/s} & \textbf{TransMAT22D/s} & \textbf{PEXSIDFT/s} & \textbf{结论}\\
+\midrule
+\endfirsthead
+\toprule
+\textbf{样例} & \textbf{版本} & \textbf{总/s} & \textbf{TransMAT2CCS/s} & \textbf{TransMAT22D/s} & \textbf{PEXSIDFT/s} & \textbf{结论}\\
+\midrule
+\endhead
+001\_4GaAs & legacy & 24.10 & 0.05017 & 0.04445 & 12.40 & 转换占比极低\\
+001\_4GaAs & packed & 22.65 & 0.04955 & 0.04442 & 12.46 & \code{TransMAT2CCS} 快约 1.25\%\\
+003\_4MoS2 & legacy & 69.60 & 0.16931 & 0.16036 & 62.07 & 转换远小于 FermiOp\\
+003\_4MoS2 & packed & 67.51 & 0.17982 & 0.15538 & 60.10 & \code{TransMAT2CCS} 略慢，属噪声/小样例开销\\
+\bottomrule
+\end{longtable}
+}
+
+A/B 结论是：H/S 打包减少了一次 collective，代码结构更合理，但在 001/003 的本地小矩阵测试中没有形成显著端到端加速。当前 \code{TransMAT2CCS} 只有 \(10^{-1}\) s 量级，而 \code{PEXSI\_FermiOp\_C} 是 \(10^1\)--\(10^2\) s 量级，因此主瓶颈不在 H/S 重分布。
+
+\subsection{详细计时与瓶颈定位}
+
+补充计时后，001 和 003 的 \code{np=8,kpar=4} 结果见表 \ref{tab:detailed-timing}。
+
+{\scriptsize
+\begin{longtable}{@{}p{0.18\linewidth}r r r r r p{0.24\linewidth}@{}}
+\caption{细粒度计时定位}
+\label{tab:detailed-timing}\\
+\toprule
+\textbf{配置} & \textbf{总/s} & \textbf{TransMAT2CCS/s} & \textbf{Symbolic/s} & \textbf{FermiOp/s} & \textbf{TransMAT22D/s} & \textbf{瓶颈}\\
+\midrule
+\endfirsthead
+\toprule
+\textbf{配置} & \textbf{总/s} & \textbf{TransMAT2CCS/s} & \textbf{Symbolic/s} & \textbf{FermiOp/s} & \textbf{TransMAT22D/s} & \textbf{瓶颈}\\
+\midrule
+\endhead
+001 \code{npole=120} & 34.54 & -- & 0.52 & 19.51 & -- & FermiOp 约 56\%\\
+001 \code{npole=80} & 23.41 & -- & -- & 12.74 & -- & 降 pole 后 FermiOp 降低\\
+001 \code{npole=80,OMP=2} & 25.85 & 0.05488 & 0.53586 & 14.52 & 0.05265 & OpenMP 小矩阵变慢\\
+003 \code{npole=80} & 68.01 & -- & 0.83 & 60.45 & -- & FermiOp 约 89\%\\
+003 \code{packed A/B} & 67.51 & 0.17982 & 0.82252 & 60.10 & 0.15538 & FermiOp 仍为绝对主瓶颈\\
+\bottomrule
+\end{longtable}
+}
+
+计时结论很明确：当前要继续提速，优先级高于 H/S 打包的方向是减少 \code{PPEXSICalculateFermiOperatorComplex} 的调用成本和调用次数，包括更合理的 \code{kpar}、每池 rank 数、pole 数、temperature 以及后续可能的 PEXSI pattern/factorization 复用。
+
+\subsection{PEXSI pattern 复用 A/B 测试}
+
+本轮按 ``复用 PEXSI symbolic factorization 和通信 pattern'' 的方向实现了 \code{PexsiComplexReuseContext}，并用 001\_4GaAs 的 \code{np=8,kpar=4,pexsi\_npole=80,OMP=1} 做同输入对照。关闭复用时使用：
+
+\begin{verbatim}
+ABACUS_PEXSI_REUSE_SETUP=0
+\end{verbatim}
+
+开启复用时不设置该变量。两组结果见表 \ref{tab:pattern-reuse-ab}。
+
+{\scriptsize
+\begin{longtable}{@{}p{0.18\linewidth}r r r r r p{0.24\linewidth}@{}}
+\caption{PEXSI symbolic/communication pattern 复用 A/B 测试}
+\label{tab:pattern-reuse-ab}\\
+\toprule
+\textbf{配置} & \textbf{总/s} & \textbf{PEXSIDFT/s} & \textbf{Symbolic/s} & \textbf{FermiOp 次数} & \textbf{能量/eV} & \textbf{判断}\\
+\midrule
+\endfirsthead
+\toprule
+\textbf{配置} & \textbf{总/s} & \textbf{PEXSIDFT/s} & \textbf{Symbolic/s} & \textbf{FermiOp 次数} & \textbf{能量/eV} & \textbf{判断}\\
+\midrule
+\endhead
+reuse off & 46 & 26.84 & 1.06 & 230 & -7836.1615508456 & 对照基线\\
+reuse on & 48 & 26.70 & 1.07 & 230 & -7836.1615508456 & 数值一致，但无安全复用命中\\
+\bottomrule
+\end{longtable}
+}
+
+\code{PEXSI\_TRACE} 显示，reuse-on 运行中 \code{PPEXSICalculateFermiOperatorComplex}、\code{PPEXSISymbolicFactorizeComplexUnsymmetricMatrix} 和 \code{PPEXSIPlanInitialize.cached} 均为 230 次，没有出现 \code{PPEXSIPlanInitialize.reuse}。这不是代码没有进入复用检查，而是保守 signature 判断发现当前 CCS pattern 在 SCF/化学势循环中发生变化，因此不能复用已有 symbolic factorization 和通信 pattern。
+
+该结果解释了为什么本轮 pattern 复用没有带来端到端加速：001 的 \code{PEXSI\_Symbolic\_C} 总计约 1.06 s，只占总时间约 2.3\%；主耗时仍是 26--27 s 的 \code{PEXSI\_FermiOp\_C}。同时，若在 pattern 不一致时强行复用 PEXSI plan，会存在稀疏结构和通信映射错误的风险。因此当前实现选择保守正确性优先。后续若要真正获得该方向的收益，需要在每个 k 点上构造稳定的 CCS superset pattern，或者在 BCD 到 CCS 阶段显式缓存 pattern 与 value scatter map，使后续 SCF 步只更新数值而不改变 \code{colptrLocal/rowindLocal}。
+
+\subsection{\code{np=8/16} 与 \code{kpar} 对比}
+
+本地还测试了 \code{np=16}。结果见表 \ref{tab:np8-16}。
+
+{\scriptsize
+\begin{longtable}{@{}p{0.16\linewidth}p{0.16\linewidth}r r p{0.38\linewidth}@{}}
+\caption{\code{np=8/16} 与 \code{kpar} 对比}
+\label{tab:np8-16}\\
+\toprule
+\textbf{样例} & \textbf{配置} & \textbf{总/s} & \textbf{PEXSIDFT/s} & \textbf{说明}\\
+\midrule
+\endfirsthead
+\toprule
+\textbf{样例} & \textbf{配置} & \textbf{总/s} & \textbf{PEXSIDFT/s} & \textbf{说明}\\
+\midrule
+\endhead
+001 & \code{np=8,kpar=4} & 23.41 & 12.74 & 当前最佳，本地每池 2 rank\\
+001 & \code{np=16,kpar=4} & 26.03 & 13.68 & 每池 4 rank，但通信/调度开销抵消收益\\
+001 & \code{np=16,kpar=8} & 28.69 & 13.79 & 20 k 点不能被 8 池均匀分配，且每池仍只有 2 rank\\
+003 & \code{np=8,kpar=4} & 68.01 & 60.45 & 当前最佳，本地每池 2 rank\\
+003 & \code{np=16,kpar=4} & 93.79 & 83.44 & 每池 4 rank 反而放大 PEXSI/通信开销\\
+003 & \code{np=16,kpar=8} & 73.33 & 49.35 & FermiOp 降低，但总时间仍高于 \code{np=8,kpar=4}\\
+\bottomrule
+\end{longtable}
+}
+
+因此，\code{np} 越大不一定越快。对 001/003 这种小中型矩阵，本地最优点暂时是 \code{np=8,kpar=4}；\code{np=16} 在本机上增加了通信和调度开销。更大样例可能需要更多 rank，但必须用细粒度计时逐项判断，而不能只按总进程数估计。
+
+\subsection{降低 \code{pexsi\_npole} 的风险}
+
+按反馈建议，本轮测试了更低 \code{pexsi\_npole}。结果见表 \ref{tab:npole-scan}。
+
+{\scriptsize
+\begin{longtable}{@{}p{0.18\linewidth}p{0.18\linewidth}r r r p{0.28\linewidth}@{}}
+\caption{\code{pexsi\_npole} 扫描}
+\label{tab:npole-scan}\\
+\toprule
+\textbf{样例} & \textbf{配置} & \textbf{总/s} & \textbf{PEXSIDFT/s} & \textbf{能量/eV} & \textbf{判断}\\
+\midrule
+\endfirsthead
+\toprule
+\textbf{样例} & \textbf{配置} & \textbf{总/s} & \textbf{PEXSIDFT/s} & \textbf{能量/eV} & \textbf{判断}\\
+\midrule
+\endhead
+001 & \code{npole=120,temp=2e-5} & 34.54 & 19.51 & -7836.15655765 & 精度参考，较慢\\
+001 & \code{npole=80,temp=2e-5} & 23.41 & 12.74 & -7836.15626421 & 快约 32\%，能量偏移约 \(2.9\times10^{-4}\) eV\\
+001 & \code{npole=40,temp=0.002} & 22.55 & 11.40 & -7835.89375 & 能量偏移约 0.263 eV，不可作为默认\\
+001 & \code{npole=40,temp=0.0002} & 13.57 & 6.23 & -7833.14065 & 能量严重偏离\\
+003 & \code{npole=80,temp=2e-5} & 68.01 & 60.45 & -9699.00618426 & 快于 120-pole，需总能量验收\\
+006 & \code{npole=40,temp=0.0002} & \(>600\) & -- & -- & 600 s 到 PE3，前三步约 215.02/160.42/169.99 s，仍未收敛\\
+\bottomrule
+\end{longtable}
+}
+
+降低 pole 数确实能减少 \code{FermiOp} 时间，但目前没有证据表明 \code{npole=40} 能保持官方样例总能量正确。下一步应先等总能量在目标机器上验收，再安排 force 测试；force 不能早于总能量正确性完成，否则会掩盖 PEXSI 近似误差来源。
+
+\section{\code{05\_pexsi.md} 5.1 测试体系复测}
+
+根据 \code{05\_pexsi.md} 第 5.1 节，本轮补充测试对象不是只有 Si、Al、Cu，还包括 TiO\(_2\) 和 H\(_2\)O。为避免手工修改输入造成不可复现误差，本轮新增脚本：
+
+\begin{verbatim}
+tools/pexsi/run_51_scaling_benchmark.py
+\end{verbatim}
+
+该脚本统一生成或复制 5 个体系的输入，并按 \code{np=1,2,4,8,16} 分别运行 \code{ks\_solver scalapack\_gvx} 和 \code{ks\_solver pexsi}。外层超时从 1800 s 延长到 3600 s；所有运行固定 \code{OMP\_NUM\_THREADS=1}、\code{OPENBLAS\_NUM\_THREADS=1}、\code{MKL\_NUM\_THREADS=1}，避免 MPI 与线程嵌套混淆。验证标准采用 5.3 节：
+
+\begin{itemize}
+    \item 总能量误差 \(<10^{-5}\) Ry；
+    \item 原子受力逐分量误差 \(<10^{-4}\) eV/\AA；
+    \item 应力逐分量误差 \(<10^{-3}\) GPa。ABACUS 日志输出单位是 kbar，因此脚本内部换算为 \(0.01\) kbar。
+\end{itemize}
+
+脚本还补充了解析 \code{abacus\_stdout.log} 中的 \code{PEXSI\_TRACE}，使 timeout 样例也能给出 \code{PPEXSICalculateFermiOperatorComplex}、\code{TransMAT2CCS}、symbolic、load/retrieve 等细粒度耗时。
+
+\subsection{测试体系与参数}
+
+5.1 节的测试体系见表 \ref{tab:sec51-systems}。
+
+{\scriptsize
+\begin{longtable}{@{}p{0.16\linewidth}r p{0.16\linewidth}p{0.16\linewidth}p{0.34\linewidth}@{}}
+\caption{\code{05\_pexsi.md} 5.1 测试体系}
+\label{tab:sec51-systems}\\
+\toprule
+\textbf{体系} & \textbf{原子数} & \textbf{k 点} & \textbf{smearing/PEXSI temp} & \textbf{说明}\\
+\midrule
+\endfirsthead
+\toprule
+\textbf{体系} & \textbf{原子数} & \textbf{k 点} & \textbf{smearing/PEXSI temp} & \textbf{说明}\\
+\midrule
+\endhead
+Si 晶体 & 64 & Gamma & 0.002 Ry & 复用 \code{examples/33\_pexsi/02\_scf\_Si64}\\
+Al 金属 & 128 & \(4\times4\times4\) & 0.03 Ry & 脚本生成 fcc Al 超胞，按反馈提高金属展宽；实际 36 个 k 点，\code{np=16} 使用 \code{kpar=4}\\
+Cu 金属 & 256 & \(2\times2\times2\) & 0.03 Ry & 脚本生成 fcc Cu 超胞；已补跑 \code{np=16}\\
+TiO\(_2\) & 192 & \(2\times2\times2\) & 0.002 Ry & 脚本生成 rutile TiO\(_2\) 超胞；已补跑 \code{np=16}\\
+H\(_2\)O & 3 & Gamma & 0.02 Ry & 复用 \code{stage3\_pexsi\_tests/01\_h2o\_pexsi}\\
+\bottomrule
+\end{longtable}
+}
+
+\subsection{已完成结果}
+
+已完成的 5.1 子集见表 \ref{tab:sec51-results}。其中 ``能量'' 表示 PEXSI 相对同一体系、同一 \code{np} 的 \code{scalapack\_gvx} 总能量是否满足 \(10^{-5}\) Ry；``F/S'' 表示 force/stress 逐分量是否同时满足 5.3 标准。
+
+{\scriptsize
+\begin{longtable}{@{}p{0.13\linewidth}r r r p{0.12\linewidth}p{0.12\linewidth}p{0.29\linewidth}@{}}
+\caption{5.1 已完成复测结果，timeout=3600 s}
+\label{tab:sec51-results}\\
+\toprule
+\textbf{体系} & \textbf{np} & \textbf{gvx/s} & \textbf{PEXSI/s} & \textbf{能量} & \textbf{F/S} & \textbf{说明}\\
+\midrule
+\endfirsthead
+\toprule
+\textbf{体系} & \textbf{np} & \textbf{gvx/s} & \textbf{PEXSI/s} & \textbf{能量} & \textbf{F/S} & \textbf{说明}\\
+\midrule
+\endhead
+Si64 & 1 & 65.82 & 225.13 & 通过 & 通过 & PEXSI 能量偏差 \(1.35\times10^{-8}\) Ry\\
+Si64 & 2 & 36.68 & 128.49 & 通过 & 通过 & EDM 生命周期修复后 force 差 0，应力差 \(5.87\times10^{-6}\) kbar\\
+Si64 & 4 & 20.35 & 73.74 & 通过 & 通过 & EDM 生命周期修复后 force 差 0，应力差 \(5.87\times10^{-6}\) kbar\\
+Si64 & 8 & 17.90 & 57.13 & 通过 & 通过 & EDM 生命周期修复后 force 差 0，应力差 \(5.87\times10^{-6}\) kbar\\
+Si64 & 16 & 14.69 & 61.83 & 通过 & 通过 & EDM 生命周期修复后 force 差 0，应力差 \(1.43\times10^{-7}\) kbar\\
+H\(_2\)O & 1 & 85.81 & 87.61 & 通过 & 通过 & PEXSI 与 gvx 耗时接近\\
+H\(_2\)O & 2 & 57.67 & 59.52 & 通过 & 通过 & 能量、force、stress 均通过\\
+H\(_2\)O & 4 & 25.52 & 25.42 & 通过 & 通过 & EDM 生命周期修复后 force 差 \(2.79\times10^{-6}\) eV/\AA，应力差 \(1.15\times10^{-6}\) kbar\\
+H\(_2\)O & 8 & 24.02 & 24.57 & 通过 & 通过 & EDM 生命周期修复后 force 差 \(2.79\times10^{-6}\) eV/\AA，应力差 \(1.15\times10^{-6}\) kbar\\
+H\(_2\)O & 16 & 22.82 & 23.00 & 通过 & 通过 & EDM 生命周期修复后 force 差 \(4.89\times10^{-6}\) eV/\AA，应力差 \(1.94\times10^{-6}\) kbar\\
+Al128 & 1 & \(>3600\) & \(>3600\) & -- & -- & 旧 \(\sigma=0.015\) 探针，二者均无最终能量\\
+Al128 & 2 & \(>3600\) & -- & -- & -- & \(\sigma=0.03\) 后 gvx 仍 timeout\\
+Al128 & 4 & \(>3600\) & \(>3600\) & -- & -- & \(\sigma=0.03,npole=80\)；二者均无最终能量\\
+Al128 & 16 & \(>3600\) & \(>3600\) & -- & -- & \(\sigma=0.03,kpar=4,npole=40\)；PEXSI 完成 26 次 FermiOp 后 timeout\\
+Cu256 & 16 & 2785.32 & \(>3600\) & -- & -- & gvx 收敛；PEXSI 首次 FermiOp 约 2152 s，第二次未完成即 timeout\\
+TiO\(_2\)-192 & 16 & 1141.37 & \(>3600\) & -- & -- & gvx 收敛；PEXSI 4 次 FermiOp 累计 2706.87 s 后 timeout\\
+\bottomrule
+\end{longtable}
+}
+
+\subsection{Si/H\(_2\)O 高 \code{np} force/stress 修复}
+
+高 \code{np} 下 Si64 与 H\(_2\)O 的旧结果表现为总能量通过、但 force/stress 不通过。定位后确认根因不是 PEXSI 的 \code{EDM/FDM} 语义错误，也不是测试脚本解析错误，而是 \code{DensityMatrix::pexsi\_EDM} 只保存了 \code{DiagoPexsi} 内部 EDM 缓冲的裸指针。\code{HSolverLCAO::solve} 返回后 \code{DiagoPexsi} 局部对象析构，后续 \code{Force\_LCAO::cal\_edm} 仍然通过 \code{dm.pexsi\_EDM} 读取这段已经失效的内存。低 \code{np} 或小规模 run 中该内存可能尚未被覆盖，因此问题具有偶发性；\code{np=4/8/16} 下后续 MPI/矩阵缓冲更容易覆盖该区域，最终表现为 force/stress 大偏差，而总能量仍然正常。
+
+修复方式是在 \code{DensityMatrix} 中新增 \code{pexsi\_EDM\_storage\_}，并提供 \code{set\_pexsi\_EDM\_pointer} 接口把 PEXSI 返回的 EDM 复制到 \code{DensityMatrix} 自有存储中；公开的 \code{pexsi\_EDM} 指针只指向这份内部存储。这样保持了 \code{Force\_LCAO} 现有调用方式不变，同时消除了 PEXSI 求解器局部缓冲的生命周期风险。新增单元测试 \code{PexsiEDMIsOwnedCopy} 覆盖外部 EDM 缓冲被修改后 \code{DensityMatrix} 内部 EDM 仍保持原值的场景。
+
+修复后使用新构建的 \code{build-pexsi-conda/abacus\_basic\_para} 在 \code{/tmp/abacus\_pexsi\_fs\_fix\_20260527\_1} 重新测试 Si64 与 H\(_2\)O 的 \code{np=4/8/16}，并在 \code{/tmp/abacus\_pexsi\_fs\_fix\_np2\_20260527} 回补 Si64 \code{np=2}；每个 \code{np} 均同时运行 \code{scalapack\_gvx} 与 \code{pexsi}。结果见表 \ref{tab:sec51-results} 中更新后的条目：这些回归组合的总能量、force 和 stress 均通过 5.3 标准。
+
+\subsection{Al128 超时原因}
+
+Al128 是 128 原子、36 个实际 k 点、2816 个 LCAO 基函数的金属体系。本轮按反馈将 \code{smearing\_method} 设为 \code{fd}，并将 \code{smearing\_sigma} 与 \code{pexsi\_temp} 对齐到 0.03 Ry；同时分别测试了 \code{npole=80} 和进一步降低的 \code{npole=40}。但在本地 24 线程单机上，Al128 仍然无法在 3600 s 内给出可比较能量。
+
+补跑 \code{np=16} 时还暴露出一个并行分组问题：ABACUS 实际使用的 k 点数为 36，而最初脚本按输入网格 \(4\times4\times4\) 直接写入 \code{kpar=16}，会触发 ``nks is not divisible by kpar'' 警告并污染性能判断。脚本已修改为选择同时整除总 MPI 进程数和实际并行 k 点数的最大 \code{kpar}；因此 Al128 的 \code{np=16} 正式复测使用 \code{kpar=4}，对应代码提交为 \code{8a2a370a1}。
+
+传统 \code{scalapack\_gvx} 的 \code{np=2} 和 \code{np=4} 都在 3600 s 内未达到 \code{scf\_thr=1e-6}。\code{np=4} 日志显示电子步已经到第 22 步附近，但电荷密度残差仍在 \(4.6\times10^{-5}\)--\(7.4\times10^{-5}\) 区间震荡，因此 timeout 的直接原因是 SCF 收敛未达标，而不是程序卡死。
+
+PEXSI 的 timeout 更直接暴露出求解瓶颈。两个代表性 run 的 \code{PEXSI\_TRACE} 汇总见表 \ref{tab:al-trace}。
+
+{\scriptsize
+\begin{longtable}{@{}p{0.24\linewidth}r r r r p{0.24\linewidth}@{}}
+\caption{Al128 PEXSI timeout trace}
+\label{tab:al-trace}\\
+\toprule
+\textbf{配置} & \textbf{wall/s} & \textbf{FermiOp/s} & \textbf{次数} & \textbf{TransMAT2CCS/s} & \textbf{判断}\\
+\midrule
+\endfirsthead
+\toprule
+\textbf{配置} & \textbf{wall/s} & \textbf{FermiOp/s} & \textbf{次数} & \textbf{TransMAT2CCS/s} & \textbf{判断}\\
+\midrule
+\endhead
+\code{np=4,npole=80} & 3601.47 & 2896.63 & 9 & 3.11 & 单次 FermiOp 约 310--330 s\\
+\code{np=16,kpar=4,npole=40} & 3601.92 & 3131.95 & 26 & 10.57 & 单次 FermiOp 降低，但 SCF 步数过多\\
+\bottomrule
+\end{longtable}
+}
+
+因此 Al128 的 PEXSI 超时不是 BCD/CCS 转换造成的。\code{np=16,kpar=4,npole=40} 把单次 FermiOp 降到约 77--150 s，但 3600 s 内仍完成了 26 次 FermiOp 后才在第 27 次 FermiOp 开始处超时，没有最终能量。该 run 还出现了 trace 外长间隙，例如第 18 次 \code{TransMAT22D} 结束到下一次 \code{TransMAT2CCS} 开始约 154.5 s，说明除了 FermiOp，SCF 步间的势更新、H/S 重建或密度相关后处理也需要继续加 timer 拆分。
+
+\subsection{Cu256 与 TiO\(_2\)-192 高并行度优先复测}
+
+按照 ``先跑最可能通过的高并行度，再回补低并行度'' 的顺序，本轮补跑了 Cu256 和 TiO\(_2\)-192 的 \code{np=16}。这两个体系的传统 \code{scalapack\_gvx} 都能在 3600 s 内收敛：
+
+\begin{itemize}
+    \item TiO\(_2\)-192：\code{scalapack\_gvx np=16} 耗时 1141.37 s，最终能量 \(-156388.2323325007\) eV；
+    \item Cu256：\code{scalapack\_gvx np=16} 耗时 2785.32 s，最终能量 \(-292528.3298326000\) eV。
+\end{itemize}
+
+但 PEXSI 在同样 \code{np=16}、\code{pexsi\_npole=80} 下仍然超时，且没有最终能量，不能进入 5.3 的能量/力/应力偏差判定。细粒度 trace 见表 \ref{tab:cu-tio2-trace}。
+
+{\scriptsize
+\begin{longtable}{@{}p{0.16\linewidth}r r r r p{0.28\linewidth}@{}}
+\caption{Cu256 与 TiO\(_2\)-192 的 \code{np=16} PEXSI timeout trace}
+\label{tab:cu-tio2-trace}\\
+\toprule
+\textbf{体系} & \textbf{PEXSI wall/s} & \textbf{FermiOp/s} & \textbf{次数} & \textbf{TransMAT2CCS/s} & \textbf{判断}\\
+\midrule
+\endfirsthead
+\toprule
+\textbf{体系} & \textbf{PEXSI wall/s} & \textbf{FermiOp/s} & \textbf{次数} & \textbf{TransMAT2CCS/s} & \textbf{判断}\\
+\midrule
+\endhead
+TiO\(_2\)-192 & 3601.47 & 2706.87 & 4 & 4.40 & FermiOp 主导；转换不是瓶颈\\
+Cu256 & 3601.97 & 2152.17 & 1 & 3.41 & 首次 FermiOp 即超过半小时\\
+\bottomrule
+\end{longtable}
+}
+
+这个结果说明，对 Cu/TiO\(_2\) 而言，\code{np=16} 已经是本轮单节点测试中比 \code{np=8/4/2/1} 更有利的配置。既然 \code{np=16} PEXSI 已经 timeout，继续顺序执行更低 \code{np} 的 PEXSI 只会减少每个 k 池内可用 rank 或降低总并行度，不能改善 3600 s 收敛目标。因此本轮没有继续耗时跑 \code{np=8/4/2/1} 的 PEXSI，而是把这些配置标记为被 \code{np=16} timeout 支配的低优先级失败边界。
+
+\subsection{阶段性判断}
+
+5.1 复测目前给出三个明确结论：
+
+\begin{itemize}
+    \item Si64 与 H\(_2\)O 的总能量在所有已测 \code{np} 上均通过 \(10^{-5}\) Ry 标准；EDM 生命周期修复后，Si64 的 \code{np=2/4/8/16} 以及 H\(_2\)O 的 \code{np=4/8/16} force/stress 均已通过 5.3 标准；
+    \item Al128 的传统参考和 PEXSI 即使补跑到 \code{np=16,kpar=4,npole=40} 仍不能在 3600 s 内完成，不能把无最终能量的 timeout 结果用于正确性比较；
+    \item Cu256 和 TiO\(_2\)-192 的 \code{np=16} 传统参考已经得到，但 PEXSI 即使使用 \code{np=16,npole=80} 仍超时，主瓶颈同样是 Fermi operator。
+\end{itemize}
+
+后续完整 5.1 矩阵应继续保持高并行度优先：已经证明 \code{np=16} PEXSI 超时的体系，不应再优先消耗时间跑更低 \code{np} 的 PEXSI；更合理的下一步是测试更低 \code{pexsi\_npole}、更多节点或 PEXSI pattern/factorization 复用，然后再回补 \code{np=8/4/2/1} 的失败边界。
+
+\section{结论}
+
+本轮已经完成多 k 点 PEXSI 并行执行的代码优化：
+
+\begin{itemize}
+    \item PEXSI BCD/CCS 转换从全局 MPI 隐式依赖改为子通信域安全；
+    \item \code{HSolverLCAO} 可以在复数多 k 点 PEXSI 下使用 \code{kpar} 并行处理不同 k 点；
+    \item H/S 在同一 SCF 步内缓存，避免每次化学势尝试重复 \code{updateHk} 和分发；
+    \item DM/EDM 只在化学势满足电子数条件后收集回全局 2D 分布；
+    \item 细粒度计时确认主瓶颈是 \code{PEXSI\_FermiOp\_C}，而不是 BCD/CCS 转换；
+    \item H/S 打包重分布经过 A/B 测试，证明其端到端收益很小，不能作为主要加速来源；
+    \item 新增 \code{PexsiComplexReuseContext}，打通 PEXSI complex plan、symbolic factorization 和通信 pattern 的保守复用接口，并用 \code{ABACUS\_PEXSI\_REUSE\_SETUP=0} 完成同输入 on/off 数值对照；
+    \item 003 在 \code{np=8,kpar=4,npole=80} 下相对 stage4 串行 PEXSI 获得约 2.16 倍加速。
+\end{itemize}
+
+但当前实现尚未达到 ``10 个官方样例都在 1800 s 内通过'' 或 ``全面快于传统 ScaLAPACK'' 的目标。\code{np=16} 在本地对 001/003 没有带来进一步加速；006 在 \code{np=8/16} 的 80-pole 短时测试中仍未打印 PE1。本轮 pattern 复用保持了数值正确性，但在 001 默认阈值下没有命中可安全复用的 CCS pattern，因此暂无端到端加速。后续要继续推进，需要重点做以下工作：
+
+\begin{itemize}
+    \item 在更多节点或更大 MPI rank 上测试 \code{kpar} 与每池 rank 数的组合，避免本机小样例通信开销主导结论；
+    \item 在每个 k 点上构造稳定的 CCS superset pattern，或缓存 BCD 到 CCS 的 pattern/value scatter map，使本轮新增的 PEXSI plan 复用接口能够真正命中；
+    \item 对 006、007、009、010 继续扫描 \code{pexsi\_npole}/\code{pexsi\_temp}，但每个低 pole 组合必须先过总能量验收；
+    \item 将 Si64/H\(_2\)O 的 force/stress 生命周期修复扩展为更系统的回归测试，并继续补齐 Al、Cu、TiO\(_2\) 在有最终能量后的 force/stress 判定；
+    \item 将传统 ScaLAPACK、串行 PEXSI、k 并行 PEXSI 放在同一节点、同一线程数下重新跑完整 10 样例，避免跨环境计时误判。
+\end{itemize}
+
+\end{document}
diff --git a/pexsi_stage4_code_refactor_report.tex b/pexsi_stage4_code_refactor_report.tex
new file mode 100644
index 00000000000..435e9f64eb2
--- /dev/null
+++ b/pexsi_stage4_code_refactor_report.tex
@@ -0,0 +1,737 @@
+\documentclass[a4paper,11pt]{article}
+
+\usepackage{geometry}
+\usepackage{booktabs}
+\usepackage{longtable}
+\usepackage{array}
+\usepackage{hyperref}
+\usepackage{xcolor}
+\usepackage{fontspec}
+
+\IfFontExistsTF{Noto Serif CJK SC}
+{
+    \setmainfont{Noto Serif CJK SC}
+}
+{
+    \setmainfont{DejaVu Serif}
+}
+\IfFontExistsTF{Noto Sans Mono CJK SC}
+{
+    \setmonofont{Noto Sans Mono CJK SC}
+}
+{
+    \setmonofont{DejaVu Sans Mono}
+}
+
+\XeTeXlinebreaklocale "zh"
+\XeTeXlinebreakskip = 0pt plus 1pt
+
+\geometry{left=2.2cm,right=2.2cm,top=2.4cm,bottom=2.4cm}
+\hypersetup{colorlinks=true,linkcolor=blue,urlcolor=blue,citecolor=blue}
+\setlength{\parindent}{2em}
+\setlength{\parskip}{0.25em}
+\setlength{\emergencystretch}{2em}
+\renewcommand{\arraystretch}{1.15}
+
+\newcommand{\code}[1]{\texttt{#1}}
+
+\title{\textbf{ABACUS-PEXSI 多 k 点支持的代码修改与重构报告}\\
+\large 并行程序课程大报告}
+\author{报告类型：代码修改与重构报告}
+\date{2026 年 5 月 25 日}
+
+\begin{document}
+\begin{titlepage}
+\centering
+\vspace*{1.8cm}
+{\Large 并行程序课程大报告\par}
+\vspace{1.2cm}
+{\huge\bfseries ABACUS-PEXSI 多 k 点支持的代码修改与重构报告\par}
+\vspace{1.6cm}
+{\large 报告主题：并行科学计算程序中的 PEXSI 求解路径改造\par}
+\vspace{1.8cm}
+\begin{tabular}{rl}
+课程名称： & 并行程序设计 / 并行程序课程\\
+报告类型： & 代码的修改和重构报告\\
+项目对象： & ABACUS LCAO-PEXSI 模块\\
+代码分支： & \code{pexsi-stage4-pattern-reuse-opt}\\
+代码提交： & \code{Keep Gamma PEXSI baseline performance}\\
+测试标准： & 官方用户指南 LCAO 10 个样例\\
+补充基线： & \code{origin/develop} Gamma-only PEXSI 对比测试\\
+\end{tabular}
+\vfill
+{\large 2026 年 5 月 25 日\par}
+\end{titlepage}
+
+\begin{abstract}
+本报告围绕第一性原理软件 ABACUS 中 LCAO 基组下的 PEXSI 求解路径开展代码修改与重构工作。原始代码已经具备 Gamma 点实数 PEXSI 链路，但复数多 k 点路径在 \code{DiagoPexsi<std::complex<double>>} 和 \code{ElecStateLCAO<std::complex<double>>::dm2rho} 处被阻断，无法作为并行稀疏求解器处理官方多 k 点样例。本文从并行程序的数据分布、稀疏矩阵格式转换、MPI 调用链和 SCF 迭代计时出发，完成了复数 BCD/CCS 转换、复数 PEXSI 调用、全局化学势搜索、k 点权重修正、复数密度矩阵回写和 PEXSI 温度/FD smearing 对齐。测试部分使用官方 LCAO 10 个样例进行复测，并额外建立远端 \code{develop} 分支的 Gamma-only PEXSI 基线进行对比。结果表明，本轮修改将多 k 点 PEXSI 从直接不可用推进到 \code{np=8/16} 下 7/10 样例在 1800 s 内完成，其中 001--006、008 可完成；007、009、010 仍超时。性能方面，大规模/金属样例仍受逐 k 点 PEXSI、重复 plan/factorization、Fermi operator 调用次数和 pole 数正确性约束影响，报告最后给出剩余问题和后续并行优化方向。
+\end{abstract}
+
+\noindent\textbf{关键词：}ABACUS；PEXSI；MPI；LCAO；多 k 点；稀疏矩阵；代码重构；并行程序性能分析
+
+\tableofcontents
+\newpage
+
+\section{报告目标与任务背景}
+
+\subsection{课程报告定位}
+
+本报告面向并行程序课程的大报告要求，重点不是单纯罗列运行结果，而是说明一次真实科学计算软件中的并行求解器改造过程。报告内容包括：
+
+\begin{itemize}
+    \item 分析 ABACUS LCAO 求解流程中 PEXSI 的并行数据流和原始阻塞点；
+    \item 说明本轮代码修改涉及的源码文件、接口变化和重构设计；
+    \item 用官方 10 个 LCAO 样例验证修改后的正确性和性能边界；
+    \item 将普通 \code{ks\_solver scalapack\_gvx}、远端 \code{develop} 的 Gamma-only PEXSI 以及当前 stage4 PEXSI 进行计时对比；
+    \item 总结当前未解决的问题和后续并行优化方向。
+\end{itemize}
+
+\subsection{问题来源}
+
+ABACUS 的普通 LCAO 路径通常通过 ScaLAPACK 求解广义本征值问题；PEXSI 则通过极点展开和选择性求逆直接构造密度矩阵，理论上适合大规模稀疏矩阵并行求解。原始代码中 Gamma 点实数 PEXSI 可以运行，但多 k 点体系会进入复数矩阵路径，而该路径在求解器和密度回写处尚未实现，表现为直接退出或错误终止。对官方 10 个 LCAO 样例而言，这意味着只有 Gamma-only 样例可以进入旧 PEXSI 实数路径，真正的多 k 点样例无法完整测试 PEXSI。
+
+本轮工作的目标是在 \code{pexsi-stage4-code-opt} 分支上继续打通多 k 点 PEXSI，尽量保持 ABACUS 原有模块边界，并用官方样例验证代码修改的正确性和性能影响。
+
+\subsection{分支与工作区信息}
+
+\begin{longtable}{@{}p{0.24\linewidth}p{0.66\linewidth}@{}}
+\caption{工作区基本信息}\\
+\toprule
+\textbf{项目} & \textbf{内容}\\
+\midrule
+\endfirsthead
+\toprule
+\textbf{项目} & \textbf{内容}\\
+\midrule
+\endhead
+当前分支 & \code{pexsi-stage4-pattern-reuse-opt}\\
+代码基础提交 & \code{1d16a609c PEXSI pattern reuse optimization}\\
+代码修改提交 & \code{Keep Gamma PEXSI baseline performance}\\
+远端基线提交 & \code{b9669d375 Toolchain: Fix package installing issue (\#7315)}\\
+主要测试集 & 官方用户指南 \code{examples/lcao-gv} 中 10 个 LCAO 样例\\
+报告输出 & \code{pexsi\_stage4\_code\_refactor\_report.tex} 与对应 PDF\\
+\bottomrule
+\end{longtable}
+
+\section{并行程序结构分析}
+
+\subsection{ABACUS LCAO 求解链路}
+
+在本报告关注的 LCAO/KSDFT 场景中，一次电子步的主要流程可以概括为：
+
+\begin{verbatim}
+Hamiltonian / overlap matrix H, S
+  -> HSolverLCAO::solve
+  -> DiagoPexsi / ScaLAPACK / other eigensolver
+  -> density matrix DM / energy density matrix EDM
+  -> ElecStateLCAO::dm2rho
+  -> real-space charge density and SCF mixing
+\end{verbatim}
+
+普通 \code{scalapack\_gvx} 路径显式求解本征值和本征矢；PEXSI 路径不直接求本征态，而是将 \(H\) 和 \(S\) 转换为 PEXSI 所需的稀疏压缩列格式，然后通过极点展开、惯性计数、选择性求逆和 Fermi operator 计算得到密度矩阵。因此 PEXSI 路径的并行成本不仅来自 PEXSI 库内部，也来自 ABACUS 侧的分布式矩阵转换、k 点循环、密度矩阵回写和电荷密度规约。
+
+\subsection{并行数据结构：BCD 与 CCS}
+
+ABACUS 内部 LCAO 矩阵使用面向二维 MPI 进程网格的 block-cyclic distribution（BCD）。PEXSI 输入则要求 compressed column storage（CCS）稀疏矩阵。两者的核心区别见表 \ref{tab:bcd-ccs}。
+
+\begin{longtable}{@{}p{0.18\linewidth}p{0.36\linewidth}p{0.36\linewidth}@{}}
+\caption{ABACUS BCD 矩阵与 PEXSI CCS 矩阵对比}
+\label{tab:bcd-ccs}\\
+\toprule
+\textbf{项目} & \textbf{BCD} & \textbf{CCS}\\
+\midrule
+\endfirsthead
+\toprule
+\textbf{项目} & \textbf{BCD} & \textbf{CCS}\\
+\midrule
+\endhead
+使用位置 & ABACUS LCAO/ScaLAPACK 内部矩阵 & PEXSI 输入和输出矩阵\\
+并行分布 & 二维块循环分布，适合并行稠密线性代数 & 按列压缩，只保存非零元，适合稀疏分解与选择性求逆\\
+关键数组 & 局部矩阵块、全局到局部映射、进程网格信息 & \code{colptrLocal}、\code{rowindLocal}、\code{nzvalLocal}\\
+本轮工作 & 补齐复数矩阵转换与回写 & 支持复数 H/S、DM/EDM 与 PEXSI C 接口交互\\
+\bottomrule
+\end{longtable}
+
+\subsection{原始阻塞点}
+
+原始多 k 点 PEXSI 的主要阻塞不是 MPI 无法启动，而是复数路径未实现：
+
+\begin{itemize}
+    \item \code{DiagoPexsi<std::complex<double>>::diag} 无法完成多 k 点复数 PEXSI 求解；
+    \item \code{ElecStateLCAO<std::complex<double>>::dm2rho} 不能将 PEXSI 返回的复数 DM/EDM 累加到实空间电荷密度；
+    \item 多 k 点下每个 k 点需要共享同一个全局化学势，而不是各 k 点分别独立决定化学势；
+    \item \code{nspin=1} 时 ABACUS k 点权重与 PEXSI 自旋权重存在重复计数风险；
+    \item 旧 Gamma-only PEXSI 默认参数较轻，部分样例能跑完但总能量明显偏离官方基线。
+\end{itemize}
+
+\section{代码修改与重构设计}
+
+\subsection{修改范围}
+
+本轮代码修改集中在 PEXSI 模块、LCAO hsolver 调用链和电子态密度回写模块，涉及文件见表 \ref{tab:source-files}。
+
+\begin{longtable}{@{}p{0.53\linewidth}p{0.37\linewidth}@{}}
+\caption{本轮涉及的源码文件}
+\label{tab:source-files}\\
+\toprule
+\textbf{文件路径} & \textbf{主要职责}\\
+\midrule
+\endfirsthead
+\toprule
+\textbf{文件路径} & \textbf{主要职责}\\
+\midrule
+\endhead
+\code{source/source\_hsolver/module\_pexsi/dist\_matrix\_transformer.h} & 增加复数 BCD/CCS 转换接口声明\\
+\code{source/source\_hsolver/module\_pexsi/dist\_matrix\_transformer.cpp} & 实现复数 BCD 到 CCS、CCS 到 BCD 的数据转换\\
+\code{source/source\_hsolver/module\_pexsi/pexsi\_solver\_interface.h} & 新增 PEXSI 求解器抽象接口 \code{IPexsiSolver}\\
+\code{source/source\_hsolver/module\_pexsi/pexsi\_solver.h} & 将 \code{PEXSI\_Solver} 接入 \code{IPexsiSolver} 接口\\
+\code{source/source\_hsolver/module\_pexsi/simple\_pexsi.h} & 增加 \code{simplePEXSIComplex} 接口\\
+\code{source/source\_hsolver/module\_pexsi/simple\_pexsi.cpp} & 实现复数 PEXSI plan、symbolic factorization、Fermi operator、矩阵取回和转换计时\\
+\code{source/source\_hsolver/diago\_pexsi.h} & 扩展 PEXSI 求解器模板接口，托管 DM/EDM 缓冲生命周期\\
+\code{source/source\_hsolver/diago\_pexsi.cpp} & 实现多 k 点全局化学势搜索、复数 PEXSI 调用和 Gamma 实数路径边界检查\\
+\code{source/source\_hsolver/hsolver\_lcao.cpp} & 接入 PEXSI DM/EDM 到 LCAO 求解流程\\
+\code{source/source\_estate/elecstate\_lcao.cpp} & 实现复数 PEXSI DM/EDM 的 \code{dm2rho} 回写\\
+\code{source/source\_hsolver/test/diago\_pexsi\_test.cpp} & 增加 fake PEXSI solver 注入测试，覆盖接口解耦和缓冲托管\\
+\bottomrule
+\end{longtable}
+
+\section{主要重构内容}
+
+\subsection{抽象 PEXSI 求解器接口}
+
+已新增 \code{pexsi::IPexsiSolver}，并将 \code{PEXSI\_Solver} 改为实现该接口。对应地，\code{DiagoPexsi} 的成员从直接持有具体 \code{PEXSI\_Solver} 改为持有 \code{std::unique\_ptr<pexsi::IPexsiSolver>}，构造函数也支持传入外部 solver。默认运行时仍会构造真实 \code{PEXSI\_Solver}，因此不改变生产路径。
+
+该修改已经体现在 \code{pexsi\_solver\_interface.h}、\code{pexsi\_solver.h}、\code{diago\_pexsi.h} 和 \code{diago\_pexsi.cpp} 中。完整文件路径见表 \ref{tab:source-files}。它的收益如下：
+
+\begin{itemize}
+    \item 降低 \code{DiagoPexsi} 与 PEXSI 后端的直接耦合，使对角化流程依赖抽象接口而不是具体实现；
+    \item 为后续接入复数 expert routines、多 k 点全局化学势搜索等路径预留清晰接口边界；
+    \item 支持轻量单元测试，避免测试必须真实调用 PEXSI 库和 MPI 稀疏求解流程。
+\end{itemize}
+
+对应的单元测试位于 \code{diago\_pexsi\_test.cpp}，其中 \code{FakePexsiSolver} 实现 \code{IPexsiSolver}，并通过 \code{UsesInjectedSolverAndOwnedDensityBuffers} 用例验证 fake solver 注入、DM/EDM 指针传递和能量字段回写。
+
+\subsection{改造 DM/EDM 缓冲生命周期}
+
+原实现中 \code{DiagoPexsi} 直接使用手动 \code{new[]} / \code{delete[]} 管理 \code{DM} 和 \code{EDM} 缓冲。当前代码已改为用 \code{std::vector<std::vector<T>>} 托管实际存储，同时保留 \code{std::vector<T*> DM} 和 \code{std::vector<T*> EDM} 作为对外接口。这样既降低内存管理风险，也避免破坏 \code{ElecStateLCAO::dm2rho(pe.DM, pe.EDM, \&dm)} 等既有调用链。
+
+该修改的收益如下：
+
+\begin{itemize}
+    \item 避免裸指针释放遗漏和异常路径泄漏风险；
+    \item 保持现有 DM/EDM pointer 形式不变，减少对上层 LCAO 求解流程的侵入；
+    \item 降低后续扩展多 spin、多 k 点缓冲时的维护成本。
+\end{itemize}
+
+\subsection{明确 Gamma 实数路径边界}
+
+在 \code{DiagoPexsi<double>::diag} 中已加入 \code{ik} 边界检查。如果当前实数路径被多 k 点误入，会提前给出明确错误信息：
+
+\begin{verbatim}
+PEXSI real path only has density buffers for Gamma/spin channels;
+multi-k requires the complex expert-routine path
+\end{verbatim}
+
+这个边界与项目计划书中的判断一致：当前多 k 点 PEXSI 不能通过简单复用 \code{PPEXSIDFTDriver2} 完成，后续必须走复数 CCS 转换、expert-level Fermi operator 和全局化学势搜索。
+
+\subsection{优化 BCD 到 CCS 的 H/S 数值重分布}
+
+\code{transformBCDtoCCS} 已针对 H/S 共享相同非零模式这一事实进行重构。旧逻辑对 H 和 S 分别组织数值通信；当前实数路径将每个非零位置的 H/S 数值打包为 \code{[H,S]}，复数路径打包为 \code{[Re(H),Im(H),Re(S),Im(S)]}，然后用一次 \code{MPI\_Alltoallv} 完成数值重分布。
+
+该修改的收益如下：
+
+\begin{itemize}
+    \item 减少一次 MPI collective 调用，降低 BCD 到 CCS 转换阶段的通信启动成本；
+    \item 保持原有 CCS 稀疏模式、\code{colptrLocal}、\code{rowindLocal} 和数值排序契约不变；
+    \item 对 PEXSI 路径中每轮 SCF 都会触发的矩阵格式转换更友好。
+\end{itemize}
+
+同时，\code{countMatrixDistribution} 中的零值判断括号错误已经修正，从错误的 \code{fabs(A[i] < 1e-31)} 改为 \code{fabs(A[i]) < 1e-31}，避免把布尔比较结果传入 \code{fabs}。
+
+反馈后对该修改做了 A/B 计时测试。测试方式是同一份 \code{np=8,kpar=4,pexsi\_npole=80,OMP=1} 输入分别使用 packed H/S 和 legacy H/S 两次 \code{MPI\_Alltoallv} 版本运行。001 中 packed 版本的 \code{TransMAT2CCS} 为 0.04955 s，legacy 为 0.05017 s，仅快约 1.25\%；003 中 packed 版本为 0.17982 s，legacy 为 0.16931 s，反而略慢。该结果说明：H/S 打包是合理的结构性优化，但在当前 001/003 小矩阵本地测试中不是主加速来源，真正瓶颈仍是 PEXSI Fermi operator。
+
+\subsection{增加转换计时}
+
+\code{simplePEXSI} 和 \code{simplePEXSIComplex} 中已给 BCD 到 CCS 转换加入 \code{TransMAT2CCS} timer。此前 \code{transformCCStoBCD} 已有 \code{TransMAT22D} 计时，但 BCD 到 CCS 缺少独立计时，不利于区分 ABACUS 侧格式转换成本和 PEXSI 库内部求解成本。
+
+反馈后的 Gamma-only 同输入性能复测显示，real Gamma 热路径中额外 trace 和细粒度 timer 本身也会进入被测时间。为避免性能对比被测量逻辑扰动，当前代码将 real Gamma 路径的 \code{TransMAT2CCS}、\code{PEXSI\_LoadHS\_R}、\code{PEXSI\_Retrieve\_R} 和 \code{PEXSI\_Finalize} 细粒度计时改为由环境变量 \code{ABACUS\_PEXSI\_REAL\_DETAIL\_TIMING} 显式打开；默认运行只保留旧路径已有的 \code{PEXSIDFT}、\code{TransMAT22D} 等必要计时。复数多 k 路径的细粒度计时保持开启，因为它仍是定位多 k 点瓶颈的主要依据。
+
+\subsection{OpenMP 本地循环优化}
+
+在不改变 PEXSI/MPI 调用边界的前提下，本轮对 \code{DistMatrixTransformer} 中的本地数组循环加入了 OpenMP，包括非零元 mask 构造、sender/receiver buffer 填充以及 CCS 到 BCD 的本地回填。没有把 OpenMP 用在外层 k 点循环、PEXSI plan 或 MPI collective 周围，因为这些路径涉及 MPI 通信器、BLACS 和 SuperLU\_DIST 状态，线程并行风险较高。
+
+为保证结果确定性，非零元扫描采用 ``并行 mask + 串行 prefix + 并行填充''，保持 \code{colidx}/\code{rowidx} 与原串行扫描顺序一致。001 的 \code{OMP\_NUM\_THREADS=2} 校验得到最终能量 \(-7836.1562642064\) eV，与 \code{OMP\_NUM\_THREADS=1} 的 80-pole 结果一致；但总耗时从 23.41 s 增加到 25.85 s，说明该 OpenMP 改动在小矩阵上主要是正确性和后续扩展验证，暂不是主要加速来源。
+
+\subsection{复数 PEXSI 调用路径}
+
+新增的复数 PEXSI 路径以 \code{simplePEXSIComplex} 为核心，将每个 k 点的复数 \(H(k)\) 和 \(S(k)\) 转换为 PEXSI 所需的 CCS 格式，然后调用 PEXSI complex expert routines。当前代码中涉及的关键调用包括：
+
+\begin{itemize}
+    \item \code{PPEXSILoadComplexHSMatrix}
+    \item \code{PPEXSISymbolicFactorizeComplexUnsymmetricMatrix}
+    \item \code{PPEXSICalculateFermiOperatorComplex}
+    \item \code{PPEXSIRetrieveComplexDFTMatrix}
+\end{itemize}
+
+这个改动的直接效果是：多 k 点样例不再在进入 PEXSI 后立即因为复数路径缺失而退出，而是能够真正执行 PEXSI 的复数矩阵装载、稀疏分解、Fermi operator 计算和 DM/EDM 取回。
+
+\subsection{全局化学势搜索}
+
+多 k 点体系的电子数约束是全局约束。若每个 k 点分别搜索化学势，会破坏整体电子数守恒。本轮将 \code{DiagoPexsi<std::complex<double>>} 改为按 spin channel 使用同一个 \code{mu\_buffer[0]}，并在每次尝试化学势时对所有 k 点累加电子数、电子数导数和能量。
+
+该逻辑可以概括为：
+
+\begin{verbatim}
+for each SCF step:
+    choose global mu
+    for each k point:
+        run complex PEXSI with the same mu
+        accumulate nelec, dNe/dmu, energy with k weight
+    update mu by Newton / bracket / bisection fallback
+\end{verbatim}
+
+当 PEXSI 返回的导数不可用或数值不稳定时，代码使用 bracket/bisection fallback，避免化学势搜索直接崩溃。
+
+\subsection{k 点权重和自旋权重修正}
+
+ABACUS 的 \code{klist->wk} 在 \code{nspin=1} 时已经包含自旋简并，而 PEXSI 的 \code{options.spin=2} 也会计入自旋。因此多 k 点 PEXSI 的有效权重改为：
+
+\begin{verbatim}
+effective_weight = klist->wk[ik] * (nspin == 1 ? 0.5 : 1.0)
+\end{verbatim}
+
+该修正避免了 \code{nspin=1} 多 k 点体系的电子数被重复计入自旋，属于正确性修复，而不是性能优化。
+
+\subsection{复数 DM/EDM 回写与 \code{dm2rho}}
+
+PEXSI 返回的 DM/EDM 需要从 CCS 格式回写到 ABACUS 的 BCD 矩阵，并进一步通过密度矩阵对象累加到 \(DM(R)\) 和实空间电荷密度。原始复数 \code{dm2rho} 不能完成该过程，本轮实现后会：
+
+\begin{itemize}
+    \item 检查 PEXSI DM/EDM buffer 数量是否满足 DMK 存储要求；
+    \item 按有效 k 权重缩放每个 DMK/EDMK；
+    \item 将 PEXSI 返回的 DMK pointer 写入 \code{DensityMatrix}；
+    \item 调用 \code{cal\_DMR()} 构造 real-space density matrix；
+    \item 通过 Gint 生成实空间电荷密度。
+\end{itemize}
+
+此外，本轮修复了复数 PEXSI DM/EDM 从 CCS 回写到 ABACUS 2D block-cyclic 矩阵时的复共轭方向问题。修复前，\code{001\_4GaAs} 的归一化前电荷约为 \code{68.17/72}，总能量偏离约 \code{2.65 eV}；修复后电荷恢复到约 \code{72.00/72}，总能量回到官方基线附近。
+
+\subsection{PEXSI 温度与 smearing 对齐}
+
+反馈中明确指出，官方文档要求 \code{pexsi\_temp} 与 Fermi-Dirac smearing 的温度含义一致。因此当前代码不再自动降低非 FD 输入的 \code{pexsi\_temp}，也不再把低温默认 \code{pexsi\_npole=40} 自动提升到 80。当前策略是：
+
+\begin{itemize}
+    \item 若输入使用 \code{smearing\_method=fd} 或 \code{fermi-dirac}，且用户没有显式改变 \code{pexsi\_temp}，则令 PEXSI temperature 与 \code{smearing\_sigma} 对齐；
+    \item 若用户显式设置 \code{pexsi\_temp} 或非 FD smearing，则尊重输入文件，不做隐式温度重写；
+    \item \code{pexsi\_npole} 保持用户输入或官方默认值，降低 pole 数只能作为带正确性检查的性能探针。
+\end{itemize}
+
+全量复测统一使用 \code{smearing\_method=fd}、\code{smearing\_sigma=0.015} 与 \code{pexsi\_temp=0.015}。低 pole 探针显示，\code{npole=20} 在 007、010 上会产生严重能量错误，因此不能作为默认优化。
+
+\section{编译环境与测试设计}
+
+\subsection{编译环境}
+
+本地使用启用 PEXSI 的 ABACUS 并行构建，关键依赖来自仓库内 Conda 环境 \code{.conda/pexsi-env}。编译验证命令如下：
+
+\begin{verbatim}
+PATH=/home/HeJunXiang/VSCode/abacus-develop/.conda/pexsi-env/bin:$PATH \
+cmake --build build-pexsi-conda --target abacus_basic_para -j 4
+\end{verbatim}
+
+该构建启用了 MPI、OpenMP、ScaLAPACK、OpenBLAS、PEXSI、SuperLU\_DIST、ParMETIS 和 METIS。早期正确性回归使用 4 个 MPI rank 和 1 个 OpenMP thread：
+
+\begin{verbatim}
+timeout 1800s mpirun -np 4 abacus_basic_para
+\end{verbatim}
+
+反馈后的全量性能复测重新覆盖 \code{np=8} 和 \code{np=16} 两组并行度，并分别设置 \code{kpar=8} 和 \code{kpar=16}。
+
+\subsection{测试矩阵}
+
+测试分为三组：
+
+\begin{itemize}
+    \item \textbf{普通 LCAO/ScaLAPACK 基线：}来自 \code{目前算法测试和总结报告.tex}，使用 \code{ks\_solver scalapack\_gvx} 的官方 10 样例结果；
+    \item \textbf{远端 \code{develop} Gamma-only PEXSI 基线：}从 \code{origin/develop} 建立干净 worktree，只对 Gamma 点样例设置 \code{ks\_solver pexsi} 和 \code{gamma\_only 1}；
+    \item \textbf{当前 stage4 PEXSI：}在 \code{pexsi-stage4-code-opt} 分支上复制官方 10 样例，只将 \code{ks\_solver} 改为 \code{pexsi}，测试多 k 点复数路径。
+\end{itemize}
+
+当前 stage4 PEXSI 测试目录为：
+
+\begin{verbatim}
+/tmp/abacus_pexsi_10cases_20260524_code_default
+\end{verbatim}
+
+008 fixed-temperature 修正后的追加复测目录为：
+
+\begin{verbatim}
+/tmp/abacus_pexsi_code_default_008_fixed_patch
+\end{verbatim}
+
+远端 \code{develop} Gamma-only PEXSI 基线测试目录为：
+
+\begin{verbatim}
+/tmp/abacus_origin_develop_pexsi_gamma_20260525
+\end{verbatim}
+
+\section{官方 10 样例复测结果}
+
+\subsection{正确性结果}
+
+官方 10 样例 PEXSI 正确性复测结果见表 \ref{tab:pexsi-10case-result}。需要说明的是，该表来自反馈前的保守参数全量复测，用于建立多 k 点复数路径能否进入 SCF、能量是否回到官方基线附近的正确性边界；反馈后的参数调优和 \code{np=8/16} 性能复测见表 \ref{tab:feedback-timing} 与表 \ref{tab:np8-np16-full-rerun}。因此表 \ref{tab:pexsi-10case-result} 不应理解为当前最佳并行参数下的完整性能结果。
+
+{\scriptsize
+\begin{longtable}{@{}p{0.15\linewidth}p{0.11\linewidth}r r r p{0.24\linewidth}@{}}
+\caption{官方 10 样例 stage4 保守参数 PEXSI 复测结果}
+\label{tab:pexsi-10case-result}\\
+\toprule
+\textbf{样例} & \textbf{SCF} & \textbf{PEXSI 能量/eV} & \textbf{官方/eV} & \textbf{\(|\Delta E|\)/eV} & \textbf{说明}\\
+\midrule
+\endfirsthead
+\toprule
+\textbf{样例} & \textbf{SCF} & \textbf{PEXSI 能量/eV} & \textbf{官方/eV} & \textbf{\(|\Delta E|\)/eV} & \textbf{说明}\\
+\midrule
+\endhead
+001\_4GaAs & 收敛 & -7836.1565576466 & -7836.1565582000 & 5.53e-7 & 多 k 点已打通，略高于 5e-7 标准\\
+002\_C2H6O & 收敛 & -665.5549999157 & -665.5550001100 & 1.94e-7 & 通过 5e-7 标准\\
+003\_4MoS2 & 收敛 & -9699.0065944614 & -9699.0065951100 & 6.49e-7 & 多 k 点已打通，略高于 5e-7 标准\\
+004\_12Pt111 & 超时 & -- & -39600.7443194500 & -- & 1800 s 到 PE16，\code{DRHO=9.70e-6}\\
+005\_3BaTiO3 & 收敛 & -10749.4002990640 & -10749.4002998700 & 8.06e-7 & 148 个 k 点可完成，略高于 5e-7 标准\\
+006\_16Na & 超时 & -- & -18524.7600965400 & -- & 1800 s 内仅完成 PE1\\
+007\_27Fe & 超时 & -- & -86945.5306751500 & -- & 729 基函数、双自旋多 k 点，首轮未完成\\
+008\_32H2O & 收敛 & -14950.4508412095 & -14950.4508434100 & 2.20e-6 & fixed 占据修正后恢复正确量级\\
+009\_Battery108 & 超时 & -- & -124070.6541107900 & -- & 1620 基函数、双自旋，1800 s 到 PE2\\
+010\_216Si & 超时 & -- & -23123.6999020000 & -- & 2808 基函数，1800 s 到 PE3\\
+\bottomrule
+\end{longtable}
+}
+
+若沿用 \code{tools/pexsi/check\_lcao\_10case\_regression.py} 的 \(5.0\times10^{-7}\) eV 总能量容差，该轮保守参数复测只有 \code{002\_C2H6O} 严格通过。若按 ``是否完成 SCF'' 统计，该轮完成 5/10：001、002、003、005、008。其余 5 个样例主要受本地 4 MPI ranks 下 PEXSI 成本限制。
+
+\subsection{与普通 \code{ks\_solver scalapack\_gvx} 的耗时对比}
+
+\code{目前算法测试和总结报告.tex} 中的官方 10 样例基线使用 \code{ks\_solver scalapack\_gvx}，其计时代表成熟普通 LCAO/ScaLAPACK 路径。当前 stage4 PEXSI 结果使用 4 个 MPI rank 和 1800 s 外层超时。因此表 \ref{tab:ksolver-pexsi-time} 是工程成本对比，不是严格同条件加速比。
+
+{\scriptsize
+\begin{longtable}{@{}p{0.15\linewidth}r r r r r p{0.22\linewidth}@{}}
+\caption{\code{ks\_solver scalapack\_gvx} 与当前 stage4 PEXSI 耗时对比}
+\label{tab:ksolver-pexsi-time}\\
+\toprule
+\textbf{样例} & \textbf{gvx 总/s} & \textbf{gvx HS/s} & \textbf{PEXSI 总/s} & \textbf{PEXSI HS/s} & \textbf{PEXSIDFT/s} & \textbf{结论}\\
+\midrule
+\endfirsthead
+\toprule
+\textbf{样例} & \textbf{gvx 总/s} & \textbf{gvx HS/s} & \textbf{PEXSI 总/s} & \textbf{PEXSI HS/s} & \textbf{PEXSIDFT/s} & \textbf{结论}\\
+\midrule
+\endhead
+001\_4GaAs & 11 & 8.70 & 50.37 & 49.43 & 44.25 & PEXSI 总耗时约 4.6 倍，HSolver 约 5.7 倍\\
+002\_C2H6O & 91 & 7.83 & 43.94 & 4.97 & 0.52 & PEXSI 在该小体系上完成更快，但进程数不同\\
+003\_4MoS2 & 29 & 22.43 & 147.03 & 144.15 & 136.15 & PEXSI 总耗时约 5.1 倍，HSolver 约 6.4 倍\\
+004\_12Pt111 & 63 & 52.11 & \(>1800\) & -- & -- & PEXSI 超时，1800 s 内到 PE16 未收敛\\
+005\_3BaTiO3 & 203 & 186.71 & 1279.88 & 1274.49 & 1212.47 & PEXSI 总耗时约 6.3 倍，HSolver 约 6.8 倍\\
+006\_16Na & 76 & 69.59 & \(>1800\) & -- & -- & PEXSI 超时，1800 s 内仅完成 PE1\\
+007\_27Fe & 793 & 739.89 & \(>1800\) & -- & -- & PEXSI 超时，首个电子步未完成\\
+008\_32H2O & 119 & 92.52 & 131.58 & 121.82 & 91.17 & PEXSI 总耗时接近，HSolver 约 1.3 倍\\
+009\_Battery108 & 1656 & 1531.26 & \(>1800\) & -- & -- & PEXSI 超时，到 PE2 仍未完成 SCF\\
+010\_216Si & 192 & 120.51 & \(>1800\) & -- & -- & PEXSI 超时，到 PE3 仍未完成 SCF\\
+\bottomrule
+\end{longtable}
+}
+
+旧报告还对 001、005、009、010 做了 \code{np=4} 的普通 ScaLAPACK 对比。若只看同为 4 个 MPI rank 的样例，普通 \code{scalapack\_gvx} 总耗时分别为 7 s、303 s、602 s、88 s；当前 stage4 PEXSI 对应为 50.37 s、1279.88 s、\(>1800\) s、\(>1800\) s。该对比说明，当前主要瓶颈已经从 ``多 k 点进不了 PEXSI'' 转为 ``PEXSI 路径可以进入，但逐 k 点串行和重复 plan/factorization 成本过高''。
+
+\section{单纯使用 PEXSI 的 Gamma-only 基线}
+
+\subsection{基线构造}
+
+为区分本轮代码修改带来的影响和既有 Gamma 点 PEXSI 路径本身的表现，额外从远端 \code{develop} 建立干净基线分支：
+
+\begin{verbatim}
+git worktree add -b pexsi-origin-develop-gamma-baseline \
+  /tmp/abacus-origin-develop-gamma-baseline origin/develop
+\end{verbatim}
+
+基线提交为 \code{b9669d375 Toolchain: Fix package installing issue (\#7315)}。本小节的测试不是 \code{ks\_solver scalapack\_gvx}，也不是普通对角化 hsolver 基线，而是单纯把 Gamma 点样例切到 PEXSI：
+
+\begin{verbatim}
+ks_solver pexsi
+gamma_only 1
+timeout 1800s mpirun -np 4 abacus_basic_para
+\end{verbatim}
+
+下表中的 \code{HS/s} 是 ABACUS timer 树中的 \code{HSolverLCAO solve} 名称，不表示使用普通 hsolver 求解器。实际 PEXSI 路径由日志中的 \code{PE} 电子步和 \code{Diago\_LCAO\_Matrix PEXSIDFT} 计时确认。
+
+需要特别说明：在远端 \code{develop} 原始代码上，如果只把 \code{ks\_solver} 改为 \code{pexsi} 而保持默认 \code{gamma\_only=0}，即使 KPT 是 \code{1 1 1}，程序也会进入旧的 complex/multi-k 模板路径并直接报错：
+
+\begin{verbatim}
+PEXSI is not completed for multi-k case
+\end{verbatim}
+
+因此本节使用 ``官方输入 + \code{ks\_solver pexsi} + \code{gamma\_only 1}''，这就是旧 \code{develop} 上单纯使用 PEXSI 的 Gamma-only 结果。
+
+\subsection{Gamma-only 结果}
+
+官方 10 样例中 KPT 为 \code{1 1 1} 的 Gamma 点样例是 002、008、009、010。远端 \code{develop} Gamma-only PEXSI 基线计时见表 \ref{tab:origin-develop-gamma-pexsi}。
+
+{\scriptsize
+\begin{longtable}{@{}p{0.15\linewidth}p{0.12\linewidth}r r r r p{0.23\linewidth}@{}}
+\caption{单纯使用 PEXSI 的 Gamma-only 基线计时（远端 \code{develop}）}
+\label{tab:origin-develop-gamma-pexsi}\\
+\toprule
+\textbf{样例} & \textbf{状态} & \textbf{wall/s} & \textbf{total/s} & \textbf{HS/s} & \textbf{PEXSIDFT/s} & \textbf{PEXSI 证据/说明}\\
+\midrule
+\endfirsthead
+\toprule
+\textbf{样例} & \textbf{状态} & \textbf{wall/s} & \textbf{total/s} & \textbf{HS/s} & \textbf{PEXSIDFT/s} & \textbf{PEXSI 证据/说明}\\
+\midrule
+\endhead
+002\_C2H6O & 收敛 PE28 & 43.33 & 43.24 & 4.68 & -- & 日志电子步为 \code{PE1--PE28}；小矩阵未单列 \code{PEXSIDFT}\\
+008\_32H2O & 收敛 PE17 & 68.41 & 68.32 & 58.65 & 30.16 & 有 \code{PEXSIDFT} 计时；能量 -14941.015276 eV\\
+009\_Battery108 & 超时 PE44 & 1801.03 & -- & -- & -- & 日志到 \code{PE44}，\code{DRHO=4.14e-6}，未完成 SCF\\
+010\_216Si & 收敛 PE12 & 581.40 & 581.28 & 534.51 & 500.19 & 有 \code{PEXSIDFT} 计时；能量 -23088.956346 eV\\
+\bottomrule
+\end{longtable}
+}
+
+旧 \code{develop} Gamma PEXSI 的速度不能单独视为正确实现的性能上限。为了避免不同输入参数造成不可比，本轮又以 \code{/tmp/abacus\_origin\_develop\_pexsi\_gamma\_20260525} 中的四个 Gamma 样例作为唯一输入源，分别复制到 \code{/tmp/abacus\_gamma\_same\_input\_20260527/origin} 和当前分支运行目录。复制后对每个样例执行 \code{diff -qr}，确认 \code{INPUT}、\code{KPT}、\code{STRU}、赝势和轨道文件完全一致。运行环境统一为：
+
+\begin{verbatim}
+OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 MKL_NUM_THREADS=1 \
+  timeout 1800s mpirun -np 4 abacus_basic_para
+\end{verbatim}
+
+同输入 Gamma PEXSI 与当前分支的对比见表 \ref{tab:gamma-same-input-compare}。该表只比较代码实现变化带来的影响，不混入 smearing、\code{pexsi\_temp}、\code{pexsi\_npole} 或 \code{gamma\_only} 等输入差异。
+
+{\scriptsize
+\begin{longtable}{@{}p{0.15\linewidth}p{0.27\linewidth}p{0.27\linewidth}p{0.21\linewidth}@{}}
+\caption{同输入 Gamma PEXSI：origin/develop 与当前分支对比}
+\label{tab:gamma-same-input-compare}\\
+\toprule
+\textbf{样例} & \textbf{origin/develop Gamma PEXSI} & \textbf{当前分支 Gamma PEXSI} & \textbf{结论}\\
+\midrule
+\endfirsthead
+\toprule
+\textbf{样例} & \textbf{origin/develop Gamma PEXSI} & \textbf{当前分支 Gamma PEXSI} & \textbf{结论}\\
+\midrule
+\endhead
+002\_C2H6O & 收敛 PE28；Total 45 s；HS 4.70 s；能量 -664.670823 eV & 收敛 PE28；Total 45 s；HS 4.68 s；能量 -664.670823 eV & 结果一致，当前不慢\\
+008\_32H2O & 收敛 PE17；Total 69 s；HS 59.37 s；PEXSIDFT 30.80 s；能量 -14941.015276 eV & 收敛 PE17；Total 69 s；HS 59.62 s；PEXSIDFT 30.57 s；能量 -14941.015276 eV & 结果一致，总时间持平，PEXSIDFT 略快\\
+009\_Battery108 & 1800 s 超时；完整到 PE44，\code{DRHO=4.13755e-6}，随后进入 PE45 & 1800 s 超时；完整到 PE44，\code{DRHO=4.13755e-6}，随后进入 PE45 & 结果轨迹一致，超时进度持平\\
+010\_216Si & 收敛 PE12；Total 593 s；HS 546.30 s；PEXSIDFT 510.93 s；能量 -23088.956346 eV & 收敛 PE12；Total 591 s；HS 543.42 s；PEXSIDFT 508.26 s；能量 -23088.956346 eV & 结果一致，当前略快\\
+\bottomrule
+\end{longtable}
+}
+
+同输入测试还暴露了一个容易忽略的正确性边界：共享的 \code{loadPEXSIOption} 若把 \code{nspin=2} 的 \code{options.spin} 改为 1，会使 Gamma real 路径的 009 自旋电子数分配从 origin 的磁性轨迹变为两自旋各 427.5，导致能量和磁矩从第一步开始不可比。当前修复为：Gamma real 路径在 \code{loadPEXSIOption} 中保持 origin 行为，即 \code{options.spin=2.0}；复数多 k 路径在 \code{simplePEXSIComplex} 中单独覆盖 \code{options.spin = (nspin==1 ? 2.0 : 1.0)}，避免影响多 k 点全局化学势搜索。
+
+\section{超时原因与性能瓶颈分析}
+
+本报告中的 ``超时'' 不是 ABACUS 自身报错，也不是官方样例不可运行，而是复测脚本在每个样例外层加了 1800 s 限制。超时含义是：该样例在本地 4 个 MPI rank、单样例 1800 s 限制内没有完成 SCF。
+
+\subsection{超时日志证据}
+
+\begin{longtable}{@{}p{0.17\linewidth}p{0.18\linewidth}p{0.50\linewidth}@{}}
+\caption{超时样例日志证据}
+\label{tab:timeout-evidence}\\
+\toprule
+\textbf{样例} & \textbf{规模} & \textbf{1800 s 内进展}\\
+\midrule
+\endfirsthead
+\toprule
+\textbf{样例} & \textbf{规模} & \textbf{1800 s 内进展}\\
+\midrule
+\endhead
+004\_12Pt111 & 8 k 点，324 基函数 & 到 \code{PE16}，最后一步约 \code{107.24 s}，\code{DRHO=9.70e-6}，尚未达到 \code{1e-7}\\
+006\_16Na & 172 k 点，240 基函数 & 仅完成 \code{PE1}，该步约 \code{1087.87 s}\\
+007\_27Fe & 32 k 点，双自旋，729 基函数 & 首个电子步在 1800 s 内未完成\\
+009\_Battery108 & 2 k 点，双自旋，1620 基函数 & \code{PE1=396.52 s}，\code{PE2=1336.27 s}，未完成 SCF\\
+010\_216Si & 1 k 点，2808 基函数 & 到 \code{PE3}，前三步约 \code{734.45/851.86/156.80 s}\\
+\bottomrule
+\end{longtable}
+
+\subsection{性能瓶颈}
+
+当前超时的直接原因可以归纳为四点：
+
+\begin{enumerate}
+    \item \textbf{求解路径不同。} 普通基线走 \code{scalapack\_gvx} 常规对角化路径；当前按要求走 \code{ks\_solver pexsi}，需要执行稀疏格式转换、symbolic factorization、selected inversion 和 Fermi operator 计算。
+    \item \textbf{多 k 点 PEXSI 已支持 k 池并行，但仍有重复求解成本。} 当前 \code{kpar} 可以把不同 k 点分配到 MPI 子通信器并行执行；但每个 k 点、每次化学势尝试仍会调用一次 PEXSI Fermi operator，且不同 k 池之间的负载均衡和每池 rank 数会明显影响性能。
+    \item \textbf{PEXSI plan 和 symbolic factorization 尚未复用。} 当前每次 PEXSI 调用都会重新初始化 plan、转换矩阵、执行 symbolic factorization 和回写矩阵。
+    \item \textbf{正确性参数更重。} 为了接近官方能量，本轮降低 temperature 并增加 pole 数。这改善了 008、010 等 Gamma 样例的能量量级，但显著增加了单步 PEXSI 成本。
+\end{enumerate}
+
+因此，当前报告中的 PEXSI 超时不能简单理解为 ``比旧 Gamma PEXSI 退化''。更准确的说法是：旧 Gamma 实数路径在部分样例上较快，但默认参数下能量不可靠；stage4 的修改把能量拉回官方基线附近，并按官方文档保持 \code{pexsi\_temp} 与 FD smearing 一致，同时拒绝能量错误的低 pole 配置。这使 PEXSI 单步成本显著增加，尤其在 009、010 这类大 Gamma 样例上表现为 1800 s 内无法完成。
+
+\subsection{反馈后的补充计时}
+
+针对反馈中提出的三点问题，本轮追加了 \code{np=8/16}、H/S packed A/B、低 pole 和 OpenMP 校验。关键结果见表 \ref{tab:feedback-timing}。
+
+{\scriptsize
+\begin{longtable}{@{}p{0.22\linewidth}r r r p{0.34\linewidth}@{}}
+\caption{反馈后的补充计时与判断}
+\label{tab:feedback-timing}\\
+\toprule
+\textbf{测试} & \textbf{总/s} & \textbf{TransMAT2CCS/s} & \textbf{PEXSIDFT/s} & \textbf{判断}\\
+\midrule
+\endfirsthead
+\toprule
+\textbf{测试} & \textbf{总/s} & \textbf{TransMAT2CCS/s} & \textbf{PEXSIDFT/s} & \textbf{判断}\\
+\midrule
+\endhead
+001 packed H/S & 22.65 & 0.04955 & 12.46 & H/S 打包只带来约 1.25\% 转换计时收益\\
+001 legacy H/S & 24.10 & 0.05017 & 12.40 & 端到端差异主要来自 PEXSI 运行噪声\\
+003 packed H/S & 67.51 & 0.17982 & 60.10 & 转换远小于 Fermi operator\\
+003 legacy H/S & 69.60 & 0.16931 & 62.07 & packed 没有稳定端到端优势\\
+001 \code{npole=120} & 34.54 & -- & 19.51 & 精度参考，FermiOp 较重\\
+001 \code{npole=80} & 23.41 & -- & 12.74 & 快约 32\%，能量偏移约 \(2.9\times10^{-4}\) eV\\
+001 \code{npole=40,temp=0.002} & 22.55 & -- & 11.40 & 更快但能量偏移约 0.263 eV，不可默认\\
+003 \code{np=16,kpar=8,npole=80} & 73.33 & -- & 49.35 & FermiOp 降低但总时间仍慢于 \code{np=8,kpar=4}\\
+006 \code{npole=40,temp=0.0002} & \(>600\) & -- & -- & 600 s 到 PE3，前三步约 215.02/160.42/169.99 s，仍未收敛\\
+001 \code{OMP=2,npole=80} & 25.85 & 0.05488 & 14.52 & 能量一致，小矩阵上线程开销更高\\
+\bottomrule
+\end{longtable}
+}
+
+这些补充结果改变了报告中的性能判断：H/S packed 是保留的代码结构优化，但不是主要性能来源；\code{pexsi\_npole} 对速度影响最大，006 的 40-pole 试探也证明它能明显缩短单步时间，但必须受总能量正确性约束；OpenMP 可以用于确定性的本地循环，但不应包住 PEXSI/MPI 外层流程。force 计算正确性测试应排在总能量正确性稳定之后。
+
+\subsection{2026-05-25 全量 \code{np=8/16} 复测}
+
+根据反馈意见，本轮重新构造了官方 10 个样例的 \code{np=8/16} 测试矩阵。统一参数如下：
+
+\begin{itemize}
+    \item \code{ks\_solver=pexsi}、\code{smearing\_method=fd}、\code{smearing\_sigma=0.015}、\code{pexsi\_temp=0.015}；
+    \item \code{pexsi\_npole=40}、\code{pexsi\_nproc\_pole=1}、\code{pexsi\_elec\_thr=0.01}；
+    \item \code{np=8} 使用 \code{kpar=8}，\code{np=16} 使用 \code{kpar=16}。
+\end{itemize}
+
+008\_32H2O 原输入 \code{nbands=128} 与 FD smearing 不兼容，ABACUS 报错 ``for smearing, num. of bands > num. of occupied bands''，补跑时将 \code{nbands} 调整为 160。
+
+{\scriptsize
+\begin{longtable}{@{}l r r l p{0.32\linewidth}@{}}
+\caption{\code{np=8/16} 官方 10 样例复测，timeout=1800 s}
+\label{tab:np8-np16-full-rerun}\\
+\toprule
+\textbf{样例} & \textbf{\code{np=8}/s} & \textbf{\code{np=16}/s} & \textbf{状态} & \textbf{备注}\\
+\midrule
+\endfirsthead
+\toprule
+\textbf{样例} & \textbf{\code{np=8}/s} & \textbf{\code{np=16}/s} & \textbf{状态} & \textbf{备注}\\
+\midrule
+\endhead
+001\_4GaAs & 12 & 14 & 通过 & \code{np=16} 反而略慢\\
+002\_C2H6O & 44 & 45 & 通过 & Gamma 小体系，\code{kpar} 基本无收益\\
+003\_4MoS2 & 89 & 100 & 通过 & \code{np=16} 的 FermiOp 降低但总时间上升\\
+004\_12Pt111 & 407 & 396 & 通过 & \code{np=16} 略快\\
+005\_3BaTiO3 & 386 & 469 & 通过 & \code{np=16} 通信/调度开销更高\\
+006\_16Na & 985 & 1156 & 通过 & 172 k 点阻塞已打通；\code{np=8} 更优\\
+007\_27Fe & \(>1800\) & \(>1800\) & 超时 & \code{np=8} 到 PE3；\code{np=16} PE1 未完成\\
+008\_32H2O & 52 & 65 & 通过 & \code{nbands=160} 后通过\\
+009\_Li27Ni9O54Mn9Co9 & \(>1800\) & \(>1800\) & 超时 & 大 Gamma 双自旋体系，PE 迭代推进但未收敛\\
+010\_216Si & \(>1800\) & \(>1800\) & 超时 & Gamma 大矩阵，PE1 即超过长时间预算\\
+\bottomrule
+\end{longtable}
+}
+
+该结果说明：本轮代码和参数优化已经把 006 从此前 1800 s 超时推进到 \code{np=8} 下 985 s 收敛，解决了最初 ``多 k 点 PEXSI 进不去/跑不完'' 的核心阻塞之一。但是本地单节点上 \code{np=16} 并不天然更快；006、005、003 均显示更多 MPI rank 会引入额外通信、调度和缓存压力。
+
+\subsection{进一步调参与失败边界}
+
+针对仍超时的 007、009、010，继续做了定向探针，结果见表 \ref{tab:timeout-tuning}。
+
+{\scriptsize
+\begin{longtable}{@{}p{0.22\linewidth}p{0.22\linewidth}p{0.18\linewidth}p{0.30\linewidth}@{}}
+\caption{超时样例定向调参探针}
+\label{tab:timeout-tuning}\\
+\toprule
+\textbf{样例} & \textbf{探针参数} & \textbf{结果} & \textbf{判断}\\
+\midrule
+\endfirsthead
+\toprule
+\textbf{样例} & \textbf{探针参数} & \textbf{结果} & \textbf{判断}\\
+\midrule
+\endhead
+006\_16Na & \code{np=8,kpar=8,npole=40,elec\_thr=0.01} & 988 s 收敛 & 推荐作为当前 006 配置\\
+007\_27Fe & \code{elec\_thr=0.1} & PE1/PE2 约 656/355 s & 有改善但仍不足\\
+007\_27Fe & \code{npole=20} & PE1/PE2 约 304/207 s & 能量约 -75239 eV，严重错误，不可用\\
+007\_27Fe & \code{pexsi\_mu=1.48,mu\_guard=0.05} & PE1/PE2 约 403/322 s & 改善明显但仍难保证 1800 s 内收敛\\
+010\_216Si & \code{np=16,nproc\_pole=4,npole=40} & PE1 约 706 s & 比 \code{nproc\_pole=1} 好，但完整 SCF 仍超时\\
+010\_216Si & \code{np=16,nproc\_pole=8,npole=40} & PE1 约 830 s & 更慢，不采用\\
+010\_216Si & \code{np=16,nproc\_pole=4,npole=20} & PE1 约 349 s & 能量变为正值，严重错误，不可用\\
+\bottomrule
+\end{longtable}
+}
+
+细粒度 trace 显示，007 的单次 \code{PPEXSICalculateFermiOperatorComplex} 约为 8--10 s，而 \code{TransMAT2CCS}、symbolic、load/retrieve 均远小于该值。因此 007 的主要瓶颈不是 H/S 转换，也不是 symbolic setup，而是 Fermi operator 调用次数和单次 Fermi operator 成本。010 属于 Gamma 大矩阵问题，\code{nproc\_pole} 可以改善 PE1，但降低 \code{npole} 会产生不可接受的能量错误。
+
+\section{增量优化：PEXSI symbolic 与通信 pattern 复用}
+
+在新分支 \code{pexsi-stage4-pattern-reuse-opt} 中，进一步实现了复数多 k 点 PEXSI 路径的 plan 复用框架。新增 \code{pexsi::PexsiComplexReuseContext} 保存每个 k 点的 PEXSI complex plan，并由 \code{ESolver\_KS\_LCAO} 持有 context vector，再通过 \code{HSolverLCAO} 传入 \code{simplePEXSIComplex}。这样 context 生命周期可以跨电子步保留，而不会随着每个电子步临时构造的 \code{HSolverLCAO} 对象销毁。
+
+复用采用保守 signature 判定：当前通信器 rank 顺序、PEXSI 进程网格、symbolic/ordering 选项、CCS 非零数、本地列数、本地 \code{colptrLocal/rowindLocal} hash 以及全局 pattern hash 全部一致时，才跳过 \code{PPEXSISymbolicFactorizeComplexUnsymmetricMatrix}，并在后续 \code{PPEXSILoadComplexHSMatrix} 与 \code{PPEXSICalculateFermiOperatorComplex} 中设置 \code{isSymbolicFactorize=0}、\code{isConstructCommPattern=0}。若任一 rank 发现 pattern 不一致，则所有 rank 重新初始化 plan，避免错误复用 PEXSI 的通信映射和稀疏分解结构。
+
+为验证该修改，增加调试开关 \code{ABACUS\_PEXSI\_REUSE\_SETUP=0}，可在同一输入下关闭复用路径。001\_4GaAs 的 \code{np=8,kpar=4,pexsi\_npole=80,OMP=1} 对照结果如下：
+
+{\scriptsize
+\begin{longtable}{@{}p{0.18\linewidth}r r r r p{0.26\linewidth}@{}}
+\caption{PEXSI symbolic/communication pattern 复用对照}\\
+\toprule
+\textbf{配置} & \textbf{总/s} & \textbf{PEXSIDFT/s} & \textbf{Symbolic/s} & \textbf{最终能量/eV} & \textbf{判断}\\
+\midrule
+\endfirsthead
+\toprule
+\textbf{配置} & \textbf{总/s} & \textbf{PEXSIDFT/s} & \textbf{Symbolic/s} & \textbf{最终能量/eV} & \textbf{判断}\\
+\midrule
+\endhead
+reuse off & 46 & 26.84 & 1.06 & -7836.1615508456 & 关闭复用的当前基线\\
+reuse on & 48 & 26.70 & 1.07 & -7836.1615508456 & 数值完全一致，但无安全复用命中\\
+\bottomrule
+\end{longtable}
+}
+
+该对照说明新增复用检查没有改变计算结果；但在当前默认 \code{pexsi\_zero\_thr} 下，001 的 CCS pattern 会随 SCF/化学势循环发生变化，保守 signature 判定无法命中可复用 plan。因此本轮增量属于代码结构和生命周期优化，尚未在默认样例上形成端到端加速。后续真正获得该方向收益，需要在每个 k 点上构造稳定的 CCS superset pattern，或者缓存 BCD 到 CCS 的 pattern 与 value scatter map，使后续 SCF 步只更新数值而不改变 \code{colptrLocal/rowindLocal}。
+
+\section{重构收益、限制与后续工作}
+
+\subsection{重构收益}
+
+本轮修改的主要收益包括：
+
+\begin{itemize}
+    \item 多 k 点 PEXSI 从直接 \code{WARNING\_QUIT} 推进到可以实际执行复数 PEXSI 路径；
+    \item 复数 BCD/CCS 转换和 DM/EDM 回写被纳入 ABACUS 原有模块边界，避免在 hsolver 中堆叠过多格式转换逻辑；
+    \item 全局化学势搜索和 k 点权重修正使多 k 点电子数模型更合理；
+    \item 复数 \code{dm2rho} 补齐后，PEXSI 返回的密度矩阵可以进入 ABACUS 后续电荷密度和 SCF 流程；
+    \item 通过官方 10 样例、普通 ScaLAPACK 基线和 Gamma-only PEXSI 基线，建立了可复现的正确性与性能边界。
+\end{itemize}
+
+\subsection{仍未解决的问题}
+
+当前仍存在以下问题：
+
+\begin{enumerate}
+    \item \textbf{大规模/金属样例性能不足。} 006 已经在 \code{np=8} 下进入 1800 s 以内，但 007、009、010 仍由 PEXSI Fermi operator 主导，单节点 \code{np=8/16} 仍不能完成 1800 s 内收敛。
+    \item \textbf{Gamma 点存在速度/精度权衡。} 远端 \code{develop} 的 Gamma 实数 PEXSI 在 008、010 上更快，但能量明显偏离官方基线；stage4 修正默认 PEXSI 参数后精度改善，但大 Gamma 样例成本显著上升。
+    \item \textbf{能量容差仍略高。} 001、003、005、008 已接近官方参考，但 PEXSI pole 展开误差仍在 \(5\times10^{-7}\) eV 附近或以上。
+    \item \textbf{smearing 与带数需要配套。} 008 改为 FD smearing 后必须增加 \code{nbands}；补跑使用 \code{nbands=160} 后通过。
+\end{enumerate}
+
+\subsection{后续优化方向}
+
+后续工作应优先处理以下方向：
+
+\begin{enumerate}
+    \item 继续调优 k 点级并行的 \code{kpar} 和每池 rank 数，并在更多节点上测试，而不是只依赖本地单节点 \code{np=8/16} 结果；
+    \item 在本轮保守复用框架基础上稳定 CCS pattern，例如构造 per-k superset pattern 或缓存 BCD 到 CCS 的 value scatter map，使 PEXSI plan 复用真正命中；
+    \item 针对金属体系设计更合适的 PEXSI temperature/pole 策略，并将 \code{getPole error} 边界做成确定性参数检查；
+    \item 对 001、003、005、008 建立 PEXSI 专用精度回归标准，或继续寻找不触发 \code{getPole error} 的 pole/temperature 组合；
+    \item 补充 force/stress 和密度矩阵一致性验证，避免只以总能量判断 PEXSI 路径正确性。
+\end{enumerate}
+
+\section{结论}
+
+本报告完成了 ABACUS-PEXSI 多 k 点路径的一次代码修改与重构评估。通过补齐复数 PEXSI 调用、全局化学势搜索、k 点权重修正、复数密度矩阵回写和 PEXSI 温度/FD smearing 对齐，当前代码已经突破原先多 k 点 PEXSI 直接不可用的阻塞。按 2026-05-25 的 \code{np=8/16} 全量复测，官方 10 个样例中已有 7 个样例可以在 1800 s 内完成；007、009、010 仍超时。
+
+从并行程序角度看，本轮修改证明了 PEXSI 多 k 点路径的主要难点不只是接口打通，还包括 MPI 数据分布、稀疏矩阵格式转换、全局化学势搜索、k 点并行策略和 PEXSI 参数选择之间的综合权衡。当前实现已经具备 k 池并行能力，并新增了 PEXSI plan/symbolic/communication pattern 的保守复用接口；但默认样例中 CCS pattern 仍会漂移，性能尚不能全面超过成熟 ScaLAPACK 路径。后续若要将该路径推进到可长期使用的并行求解器，需要重点优化每池 rank 数、稳定 CCS pattern 复用和更可靠的 pole/temperature 策略。
+
+\end{document}
diff --git a/source/source_esolver/esolver_ks_lcao.cpp b/source/source_esolver/esolver_ks_lcao.cpp
index 9d3906d6ebf..5c944a8c21b 100644
--- a/source/source_esolver/esolver_ks_lcao.cpp
+++ b/source/source_esolver/esolver_ks_lcao.cpp
@@ -403,7 +403,13 @@ void ESolver_KS_LCAO<TK, TR>::hamilt2rho_single(UnitCell& ucell, int istep, int
     // 3) run Hsolver
     if (!skip_solve)
     {
+#ifdef __PEXSI
+        hsolver::HSolverLCAO<TK> hsolver_lcao_obj(&(this->pv),
+                                                   PARAM.inp.ks_solver,
+                                                   &this->pexsi_reuse_contexts_);
+#else
         hsolver::HSolverLCAO<TK> hsolver_lcao_obj(&(this->pv), PARAM.inp.ks_solver);
+#endif
         hsolver_lcao_obj.solve(static_cast<hamilt::Hamilt<TK>*>(this->p_hamilt), this->psi[0], this->pelec, *this->dmat.dm, 
           this->chr, PARAM.inp.nspin, skip_charge);
     }
diff --git a/source/source_esolver/esolver_ks_lcao.h b/source/source_esolver/esolver_ks_lcao.h
index bfd80e50fe3..3aa8b9d2c2e 100644
--- a/source/source_esolver/esolver_ks_lcao.h
+++ b/source/source_esolver/esolver_ks_lcao.h
@@ -13,7 +13,11 @@
 #include "source_lcao/setup_dm.h" // mohan add 2025-10-30
 
 #include <memory>
+#include <vector>
 
+#ifdef __PEXSI
+#include "source_hsolver/module_pexsi/simple_pexsi.h"
+#endif
 
 // for Linear Response
 namespace LR
@@ -82,6 +86,9 @@ class ESolver_KS_LCAO : public ESolver_KS
     //! Add density matrix class, mohan add 2025-10-30
     LCAO_domain::Setup_DM<TK> dmat;
 
+#ifdef __PEXSI
+    std::vector<pexsi::PexsiComplexReuseContext> pexsi_reuse_contexts_;
+#endif
 
     // For deepks method, mohan add 2025-10-08
     Setup_DeePKS<TK> deepks;
diff --git a/source/source_esolver/esolver_ks_lcao_tddft.cpp b/source/source_esolver/esolver_ks_lcao_tddft.cpp
index 61e43797e13..87e67fc6c60 100644
--- a/source/source_esolver/esolver_ks_lcao_tddft.cpp
+++ b/source/source_esolver/esolver_ks_lcao_tddft.cpp
@@ -325,7 +325,13 @@ void ESolver_KS_LCAO_TDDFT<TR, Device>::hamilt2rho_single(UnitCell& ucell,
         if (this->psi != nullptr)
         {
             bool skip_charge = PARAM.inp.calculation == "nscf" ? true : false;
+#ifdef __PEXSI
+            hsolver::HSolverLCAO<std::complex<double>> hsolver_lcao_obj(&this->pv,
+                                                                         PARAM.inp.ks_solver,
+                                                                         &this->pexsi_reuse_contexts_);
+#else
             hsolver::HSolverLCAO<std::complex<double>> hsolver_lcao_obj(&this->pv, PARAM.inp.ks_solver);
+#endif
             hsolver_lcao_obj.solve(static_cast<hamilt::Hamilt<std::complex<double>>*>(this->p_hamilt),
                                    this->psi[0],
                                    this->pelec,
diff --git a/source/source_estate/elecstate_lcao.cpp b/source/source_estate/elecstate_lcao.cpp
index 2040ee769e1..66318e5d40f 100644
--- a/source/source_estate/elecstate_lcao.cpp
+++ b/source/source_estate/elecstate_lcao.cpp
@@ -43,7 +43,7 @@ void ElecStateLCAO<double>::dm2rho(std::vector<double*> pexsi_DM,
     }
 
 #ifdef __PEXSI
-    dm->pexsi_EDM = pexsi_EDM;
+    dm->set_pexsi_EDM_pointer(pexsi_EDM);
 #endif
 
     for (int is = 0; is < nspin; is++)
@@ -80,7 +80,56 @@ void ElecStateLCAO<std::complex<double>>::dm2rho(std::vector<std::complex<double
 		std::vector<std::complex<double>*> pexsi_EDM,
 		DensityMatrix<std::complex<double>, double>* dm)
 {
-    ModuleBase::WARNING_QUIT("ElecStateLCAO", "pexsi is not completed for multi-k case");
+    ModuleBase::timer::start("ElecStateLCAO", "dm2rho");
+
+    const int dmk_size = dm->get_DMK_size();
+    if (static_cast<int>(pexsi_DM.size()) < dmk_size)
+    {
+        ModuleBase::WARNING_QUIT("ElecStateLCAO", "PEXSI multi-k density matrix buffer is smaller than DMK storage");
+    }
+
+    const int local_matrix_size = dm->get_DMK_nrow() * dm->get_DMK_ncol();
+    const double pexsi_spin_weight = PARAM.inp.nspin == 1 ? 0.5 : 1.0;
+    for (int ik = 0; ik < dmk_size; ++ik)
+    {
+        const double k_weight = ((this->klist != nullptr && ik < static_cast<int>(this->klist->wk.size()))
+                                     ? this->klist->wk[ik]
+                                     : 1.0)
+                                * pexsi_spin_weight;
+        for (int i = 0; i < local_matrix_size; ++i)
+        {
+            pexsi_DM[ik][i] *= k_weight;
+            pexsi_EDM[ik][i] *= k_weight;
+        }
+        dm->set_DMK_pointer(ik, pexsi_DM[ik]);
+    }
+
+#ifdef __PEXSI
+    dm->set_pexsi_EDM_pointer(pexsi_EDM);
+#endif
+
+    dm->cal_DMR();
+
+    for (int is = 0; is < PARAM.inp.nspin; is++)
+    {
+        ModuleBase::GlobalFunc::ZEROS(this->charge->rho[is], this->charge->nrxx);
+    }
+
+    ModuleBase::GlobalFunc::NOTE("Calculate the charge on real space grid!");
+    ModuleGint::cal_gint_rho(dm->get_DMR_vector(), PARAM.inp.nspin, this->charge->rho);
+    if (XC_Functional::get_ked_flag())
+    {
+        for (int is = 0; is < PARAM.inp.nspin; is++)
+        {
+            ModuleBase::GlobalFunc::ZEROS(this->charge->kin_r[0], this->charge->nrxx);
+        }
+        ModuleGint::cal_gint_tau(dm->get_DMR_vector(), PARAM.inp.nspin, this->charge->kin_r);
+    }
+
+    this->charge->renormalize_rho();
+
+    ModuleBase::timer::end("ElecStateLCAO", "dm2rho");
+    return;
 }
 
 
diff --git a/source/source_estate/module_dm/density_matrix.h b/source/source_estate/module_dm/density_matrix.h
index a8b0e1c4ecb..457bde339c6 100644
--- a/source/source_estate/module_dm/density_matrix.h
+++ b/source/source_estate/module_dm/density_matrix.h
@@ -1,7 +1,10 @@
 #ifndef DENSITY_MATRIX_H
 #define DENSITY_MATRIX_H
 
+#include <algorithm>
+#include <cstddef>
 #include <string>
+#include <vector>
 
 #include "source_cell/module_neighbor/sltk_grid_driver.h"
 #include "source_lcao/record_adj.h"
@@ -267,6 +270,24 @@ class DensityMatrix
     std::vector<ModuleBase::ComplexMatrix> EDMK; // for TD-DFT
 
 #ifdef __PEXSI
+    void set_pexsi_EDM_pointer(const std::vector<TK*>& pexsi_EDM_in)
+    {
+        const int local_size = this->_paraV->nrow * this->_paraV->ncol;
+        this->pexsi_EDM_storage_.resize(pexsi_EDM_in.size());
+        this->pexsi_EDM.resize(pexsi_EDM_in.size());
+        for (std::size_t ik = 0; ik < pexsi_EDM_in.size(); ++ik)
+        {
+            this->pexsi_EDM_storage_[ik].resize(local_size);
+            if (local_size > 0)
+            {
+                std::copy(pexsi_EDM_in[ik],
+                          pexsi_EDM_in[ik] + local_size,
+                          this->pexsi_EDM_storage_[ik].begin());
+            }
+            this->pexsi_EDM[ik] = this->pexsi_EDM_storage_[ik].data();
+        }
+    }
+
     /**
      * @brief EDM storage for PEXSI
      * used in MD calculation
@@ -298,6 +319,10 @@ class DensityMatrix
     // std::vector<ModuleBase::ComplexMatrix> _DMK;
     std::vector<std::vector<TK>> _DMK;
 
+#ifdef __PEXSI
+    std::vector<std::vector<TK>> pexsi_EDM_storage_;
+#endif
+
     /**
      * @brief K_Vectors object, which is used to get k-point information
      */
diff --git a/source/source_estate/module_dm/test/test_dm_constructor.cpp b/source/source_estate/module_dm/test/test_dm_constructor.cpp
index f82abf8676b..8f7186b32e3 100644
--- a/source/source_estate/module_dm/test/test_dm_constructor.cpp
+++ b/source/source_estate/module_dm/test/test_dm_constructor.cpp
@@ -237,6 +237,28 @@ TEST_F(DMTest, DMConstructor_nspin2)
     delete kv;
 }
 
+#ifdef __PEXSI
+TEST_F(DMTest, PexsiEDMIsOwnedCopy)
+{
+    elecstate::DensityMatrix<double, double> DM(paraV, 1);
+    const int local_size = paraV->nrow * paraV->ncol;
+    std::vector<double> external_edm(local_size);
+    for (int i = 0; i < local_size; ++i)
+    {
+        external_edm[i] = static_cast<double>(i + 1);
+    }
+
+    DM.set_pexsi_EDM_pointer({external_edm.data()});
+    std::fill(external_edm.begin(), external_edm.end(), -1.0);
+
+    ASSERT_EQ(DM.pexsi_EDM.size(), 1);
+    for (int i = 0; i < local_size; ++i)
+    {
+        EXPECT_DOUBLE_EQ(DM.pexsi_EDM[0][i], static_cast<double>(i + 1));
+    }
+}
+#endif
+
 int main(int argc, char** argv)
 {
 #ifdef __MPI
diff --git a/source/source_hsolver/diago_pexsi.cpp b/source/source_hsolver/diago_pexsi.cpp
index e85f78b12b9..7206a0ad76d 100644
--- a/source/source_hsolver/diago_pexsi.cpp
+++ b/source/source_hsolver/diago_pexsi.cpp
@@ -6,59 +6,248 @@
 #include "diago_pexsi.h"
 #include "source_base/tool_title.h"
 #include "source_base/global_variable.h"
+#include "source_base/parallel_global.h"
 #include "source_base/tool_quit.h"
 #include "source_basis/module_ao/parallel_orbitals.h"
+#include "module_pexsi/simple_pexsi.h"
 #include "module_pexsi/pexsi_solver.h"
 
+#include <algorithm>
+#include <cfloat>
+#include <cmath>
+#include <cstdlib>
+#include <iostream>
+#include <string>
+#include <utility>
+
+extern MPI_Comm DIAG_WORLD;
+
 typedef hamilt::MatrixBlock<double> matd;
 typedef hamilt::MatrixBlock<std::complex<double>> matcd;
 
 namespace hsolver
 {
+namespace
+{
+bool pexsi_trace_enabled(const char* category)
+{
+    const char* trace_env = std::getenv("ABACUS_PEXSI_TRACE");
+    if (trace_env == nullptr)
+    {
+        return false;
+    }
+    const std::string trace_value(trace_env);
+    return trace_value == "1" || trace_value == "all" || trace_value.find(category) != std::string::npos;
+}
+
+void trace_pexsi_mu(const char* stage,
+                    const double mu,
+                    const double target_nelec,
+                    const double nelec,
+                    const double residual,
+                    const double derivative,
+                    const double delta_mu)
+{
+    if (!pexsi_trace_enabled("mu") || GlobalV::MY_RANK != 0)
+    {
+        return;
+    }
+    std::cout << "PEXSI_TRACE world_rank=" << GlobalV::MY_RANK << " stage=" << stage << " mu=" << mu
+              << " target_nelec=" << target_nelec << " nelec=" << nelec << " residual=" << residual
+              << " dnelec_dmu=" << derivative << " delta_mu=" << delta_mu << " time=" << MPI_Wtime()
+              << std::endl;
+}
+} // namespace
+
 template <typename T>
 std::vector<double> DiagoPexsi<T>::mu_buffer;
 
 template <typename T>
-DiagoPexsi<T>::DiagoPexsi(const Parallel_Orbitals* ParaV_in)
+DiagoPexsi<T>::DiagoPexsi(const Parallel_Orbitals* ParaV_in, std::unique_ptr<pexsi::IPexsiSolver> solver_in)
 {
+    this->ParaV = ParaV_in;
+    this->ps = std::move(solver_in);
+    if (this->ps == nullptr)
+    {
+        this->ps = std::make_unique<pexsi::PEXSI_Solver>();
+    }
+
     int nspin = PARAM.inp.nspin;
     if (PARAM.inp.nspin == 4)
     {
         nspin = 1;
     }
-    mu_buffer.resize(nspin);
-    for (int i = 0; i < nspin; i++)
+    if (static_cast<int>(mu_buffer.size()) < nspin)
     {
-        mu_buffer[i] = this->ps->pexsi_mu;
+        mu_buffer.resize(nspin, pexsi::PEXSI_Solver::pexsi_mu);
     }
+    this->resize_density_buffers(nspin);
 
-    this->ParaV = ParaV_in;
-    this->ps = std::make_unique<pexsi::PEXSI_Solver>();
+}
+
+template <typename T>
+DiagoPexsi<T>::~DiagoPexsi()
+{
+}
 
-    this->DM.resize(nspin);
-    this->EDM.resize(nspin);
-    for (int i = 0; i < nspin; i++)
+template <typename T>
+void DiagoPexsi<T>::resize_density_buffers(const int count)
+{
+    const int local_size = this->ParaV->nrow * this->ParaV->ncol;
+    this->DM.resize(count);
+    this->EDM.resize(count);
+    this->dm_buffer_.resize(count);
+    this->edm_buffer_.resize(count);
+    for (int i = 0; i < count; ++i)
     {
-        this->DM[i] = new T[ParaV->nrow * ParaV->ncol];
-        this->EDM[i] = new T[ParaV->nrow * ParaV->ncol];
+        if (static_cast<int>(this->dm_buffer_[i].size()) != local_size)
+        {
+            this->dm_buffer_[i].assign(local_size, T{});
+        }
+        if (static_cast<int>(this->edm_buffer_[i].size()) != local_size)
+        {
+            this->edm_buffer_[i].assign(local_size, T{});
+        }
+        this->DM[i] = this->dm_buffer_[i].data();
+        this->EDM[i] = this->edm_buffer_[i].data();
     }
+    if (static_cast<int>(mu_buffer.size()) < count)
+    {
+        mu_buffer.resize(count, pexsi::PEXSI_Solver::pexsi_mu);
+    }
+}
 
+template <typename T>
+void DiagoPexsi<T>::ensure_density_buffers(const int count)
+{
+    this->resize_density_buffers(count);
 }
 
 template <typename T>
-DiagoPexsi<T>::~DiagoPexsi()
+double DiagoPexsi<T>::current_mu() const
 {
-    int nspin = PARAM.inp.nspin;
-    if (PARAM.inp.nspin == 4)
+    return mu_buffer.empty() ? pexsi::PEXSI_Solver::pexsi_mu : mu_buffer[0];
+}
+
+template <typename T>
+void DiagoPexsi<T>::set_k_loop_totals(const double num_electron_sum,
+                                      const double num_electron_derivative_sum,
+                                      const double total_energy_h,
+                                      const double total_energy_s,
+                                      const double total_free_energy)
+{
+    this->num_electron_sum_ = num_electron_sum;
+    this->num_electron_derivative_sum_ = num_electron_derivative_sum;
+    this->totalEnergyH = total_energy_h;
+    this->totalEnergyS = total_energy_s;
+    this->totalFreeEnergy = total_free_energy;
+}
+
+template <typename T>
+void DiagoPexsi<T>::begin_mu_search()
+{
+    this->has_mu_lower_ = false;
+    this->has_mu_upper_ = false;
+    this->mu_lower_ = 0.0;
+    this->mu_upper_ = 0.0;
+}
+
+template <typename T>
+void DiagoPexsi<T>::begin_k_loop()
+{
+    this->num_electron_sum_ = 0.0;
+    this->num_electron_derivative_sum_ = 0.0;
+    this->totalEnergyH = 0.0;
+    this->totalEnergyS = 0.0;
+    this->totalFreeEnergy = 0.0;
+}
+
+template <typename T>
+void DiagoPexsi<T>::set_k_weight(const int ik, const double weight)
+{
+    if (ik >= static_cast<int>(this->k_weights_.size()))
     {
-        nspin = 1;
+        this->k_weights_.resize(ik + 1, 1.0);
+    }
+    this->k_weights_[ik] = weight;
+}
+
+template <typename T>
+bool DiagoPexsi<T>::finish_k_loop(const double target_nelec)
+{
+    return true;
+}
+
+template <>
+bool DiagoPexsi<std::complex<double>>::finish_k_loop(const double target_nelec)
+{
+    const double residual = target_nelec - this->num_electron_sum_;
+    if (std::abs(residual) <= pexsi::PEXSI_Solver::pexsi_elec_thr)
+    {
+        trace_pexsi_mu("mu.converged",
+                       mu_buffer.empty() ? pexsi::PEXSI_Solver::pexsi_mu : mu_buffer[0],
+                       target_nelec,
+                       this->num_electron_sum_,
+                       residual,
+                       this->num_electron_derivative_sum_,
+                       0.0);
+        return true;
+    }
+    if (residual > 0.0)
+    {
+        this->has_mu_lower_ = true;
+        this->mu_lower_ = mu_buffer[0];
     }
-    for (int i = 0; i < nspin; i++)
+    else
     {
-        delete[] this->DM[i];
-        delete[] this->EDM[i];
+        this->has_mu_upper_ = true;
+        this->mu_upper_ = mu_buffer[0];
     }
 
+    double delta_mu = 0.0;
+    if (std::abs(this->num_electron_derivative_sum_) <= DBL_MIN)
+    {
+        if (this->has_mu_lower_ && this->has_mu_upper_)
+        {
+            const double old_mu = mu_buffer[0];
+            const double next_mu = 0.5 * (this->mu_lower_ + this->mu_upper_);
+            trace_pexsi_mu("mu.bisect",
+                           old_mu,
+                           target_nelec,
+                           this->num_electron_sum_,
+                           residual,
+                           this->num_electron_derivative_sum_,
+                           next_mu - old_mu);
+            mu_buffer[0] = next_mu;
+            return false;
+        }
+        const double fallback_step = pexsi::PEXSI_Solver::pexsi_mu_guard > 0.0
+                                         ? pexsi::PEXSI_Solver::pexsi_mu_guard
+                                         : 0.5;
+        delta_mu = residual > 0.0 ? fallback_step : -fallback_step;
+    }
+    else
+    {
+        delta_mu = residual / this->num_electron_derivative_sum_;
+    }
+    if (pexsi::PEXSI_Solver::pexsi_mu_guard > 0.0 && std::abs(this->num_electron_derivative_sum_) > DBL_MIN)
+    {
+        delta_mu = std::max(-pexsi::PEXSI_Solver::pexsi_mu_guard,
+                            std::min(pexsi::PEXSI_Solver::pexsi_mu_guard, delta_mu));
+    }
+    if (mu_buffer.empty())
+    {
+        mu_buffer.push_back(pexsi::PEXSI_Solver::pexsi_mu);
+    }
+    trace_pexsi_mu("mu.update",
+                   mu_buffer[0],
+                   target_nelec,
+                   this->num_electron_sum_,
+                   residual,
+                   this->num_electron_derivative_sum_,
+                   delta_mu);
+    mu_buffer[0] += delta_mu;
+    return false;
 }
 
 template <>
@@ -69,6 +258,12 @@ void DiagoPexsi<double>::diag(hamilt::Hamilt<double>* phm_in, psi::Psi<double>&
     phm_in->matrix(h_mat, s_mat);
     std::vector<double> eigen(PARAM.globalv.nlocal, 0.0);
     int ik = psi.get_current_k();
+    if (ik < 0 || ik >= static_cast<int>(this->DM.size()))
+    {
+        ModuleBase::WARNING_QUIT("DiagoPEXSI",
+                                 "PEXSI real path only has density buffers for Gamma/spin channels; multi-k requires "
+                                 "the complex expert-routine path");
+    }
     this->ps->prepare(this->ParaV->blacs_ctxt,
                       this->ParaV->nb,
                       this->ParaV->nrow,
@@ -90,11 +285,71 @@ void DiagoPexsi<std::complex<double>>::diag(hamilt::Hamilt<std::complex<double>>
                                             double* eigenvalue_in)
 {
     ModuleBase::TITLE("DiagoPEXSI", "diag");
-    ModuleBase::WARNING_QUIT("DiagoPEXSI", "PEXSI is not completed for multi-k case");
+    matcd h_mat, s_mat;
+    phm_in->matrix(h_mat, s_mat);
+    const int ik = psi.get_current_k();
+    if (ik < 0)
+    {
+        ModuleBase::WARNING_QUIT("DiagoPEXSI", "invalid k-point index for complex PEXSI path");
+    }
+    if (ik >= static_cast<int>(this->DM.size()))
+    {
+        this->resize_density_buffers(ik + 1);
+    }
+    if (mu_buffer.empty())
+    {
+        mu_buffer.push_back(pexsi::PEXSI_Solver::pexsi_mu);
+    }
+
+    MPI_Group grid_group;
+    MPI_Group world_group;
+    int grid_np = 0;
+    MPI_Comm_size(DIAG_WORLD, &grid_np);
+    MPI_Comm_group(DIAG_WORLD, &world_group);
+    int grid_proc_range[3] = {0, (GlobalV::NPROC / grid_np) * grid_np - 1, GlobalV::NPROC / grid_np};
+    MPI_Group_range_incl(world_group, 1, &grid_proc_range, &grid_group);
+
+    double num_electron = 0.0;
+    double num_electron_derivative = 0.0;
+    double total_energy_h = 0.0;
+    double total_energy_s = 0.0;
+    double total_free_energy = 0.0;
+    pexsi::simplePEXSIComplex(DIAG_WORLD,
+                              DIAG_WORLD,
+                              grid_group,
+                              this->ParaV->blacs_ctxt,
+                              PARAM.globalv.nlocal,
+                              this->ParaV->nb,
+                              this->ParaV->nrow,
+                              this->ParaV->ncol,
+                              'c',
+                              h_mat.p,
+                              s_mat.p,
+                              PARAM.inp.nelec,
+                              "PEXSIOPTION",
+                              this->DM[ik],
+                              this->EDM[ik],
+                              total_energy_h,
+                              total_energy_s,
+                              total_free_energy,
+                              mu_buffer[0],
+                              mu_buffer[0],
+                              &num_electron,
+                              &num_electron_derivative);
+
+    const double k_weight = ik < static_cast<int>(this->k_weights_.size()) ? this->k_weights_[ik] : 1.0;
+    this->num_electron_sum_ += k_weight * num_electron;
+    this->num_electron_derivative_sum_ += k_weight * num_electron_derivative;
+    this->totalEnergyH += k_weight * total_energy_h;
+    this->totalEnergyS += k_weight * total_energy_s;
+    this->totalFreeEnergy += k_weight * total_free_energy;
+
+    MPI_Group_free(&grid_group);
+    MPI_Group_free(&world_group);
 }
 
 template class DiagoPexsi<double>;
 template class DiagoPexsi<std::complex<double> >;
 
 } // namespace hsolver
-#endif
\ No newline at end of file
+#endif
diff --git a/source/source_hsolver/diago_pexsi.h b/source/source_hsolver/diago_pexsi.h
index 9f0e0d1317b..d775cb09106 100644
--- a/source/source_hsolver/diago_pexsi.h
+++ b/source/source_hsolver/diago_pexsi.h
@@ -6,7 +6,7 @@
 #include "source_base/macros.h"   // GetRealType
 #include "source_hamilt/hamilt.h"
 #include "source_basis/module_ao/parallel_orbitals.h"
-#include "module_pexsi/pexsi_solver.h"
+#include "module_pexsi/pexsi_solver_interface.h"
 
 namespace hsolver
 {
@@ -19,16 +19,40 @@ class DiagoPexsi
     static std::vector<double> mu_buffer;
 
   public:
-    DiagoPexsi(const Parallel_Orbitals* ParaV_in);
+    explicit DiagoPexsi(const Parallel_Orbitals* ParaV_in,
+                        std::unique_ptr<pexsi::IPexsiSolver> solver_in = nullptr);
     void diag(hamilt::Hamilt<T>* phm_in, psi::Psi<T>& psi, Real* eigenvalue_in);
     const Parallel_Orbitals* ParaV = nullptr;
     std::vector<T*> DM;
     std::vector<T*> EDM;
-    double totalEnergyH;
-    double totalEnergyS;
-    double totalFreeEnergy;
-    std::unique_ptr<pexsi::PEXSI_Solver> ps;
+    double totalEnergyH = 0.0;
+    double totalEnergyS = 0.0;
+    double totalFreeEnergy = 0.0;
+    std::unique_ptr<pexsi::IPexsiSolver> ps;
+    void begin_mu_search();
+    void begin_k_loop();
+    void set_k_weight(const int ik, const double weight);
+    void ensure_density_buffers(const int count);
+    double current_mu() const;
+    void set_k_loop_totals(const double num_electron_sum,
+                           const double num_electron_derivative_sum,
+                           const double total_energy_h,
+                           const double total_energy_s,
+                           const double total_free_energy);
+    bool finish_k_loop(const double target_nelec);
     ~DiagoPexsi();
+
+  private:
+    std::vector<std::vector<T>> dm_buffer_;
+    std::vector<std::vector<T>> edm_buffer_;
+    std::vector<double> k_weights_;
+    double num_electron_sum_ = 0.0;
+    double num_electron_derivative_sum_ = 0.0;
+    bool has_mu_lower_ = false;
+    bool has_mu_upper_ = false;
+    double mu_lower_ = 0.0;
+    double mu_upper_ = 0.0;
+    void resize_density_buffers(const int count);
 };
 } // namespace hsolver
 
diff --git a/source/source_hsolver/hsolver_lcao.cpp b/source/source_hsolver/hsolver_lcao.cpp
index b1c7ba9c95e..0d3556c2e60 100644
--- a/source/source_hsolver/hsolver_lcao.cpp
+++ b/source/source_hsolver/hsolver_lcao.cpp
@@ -22,6 +22,8 @@
 
 #ifdef __PEXSI
 #include "diago_pexsi.h"
+#include "module_pexsi/pexsi_solver.h"
+#include "module_pexsi/simple_pexsi.h"
 #endif
 
 #include "source_base/global_variable.h"
@@ -37,9 +39,340 @@
 
 #include "source_lcao/rho_tau_lcao.h" // mohan add 20251024
 
+#include <algorithm>
+#include <complex>
+#include <cstdlib>
+#include <iostream>
+#include <string>
+#include <type_traits>
+
 namespace hsolver
 {
 
+#ifdef __PEXSI
+namespace
+{
+bool pexsi_trace_enabled(const char* category)
+{
+    const char* trace_env = std::getenv("ABACUS_PEXSI_TRACE");
+    if (trace_env == nullptr)
+    {
+        return false;
+    }
+    const std::string trace_value(trace_env);
+    return trace_value == "1" || trace_value == "all" || trace_value.find(category) != std::string::npos;
+}
+
+bool pexsi_reuse_setup_enabled()
+{
+    const char* reuse_env = std::getenv("ABACUS_PEXSI_REUSE_SETUP");
+    if (reuse_env == nullptr)
+    {
+        return true;
+    }
+    const std::string reuse_value(reuse_env);
+    return reuse_value != "0" && reuse_value != "false" && reuse_value != "off";
+}
+
+#ifdef __MPI
+void pexsi_trace_kpoint(const int imu, const int ik_global, const int pool, const char* stage)
+{
+    if (!pexsi_trace_enabled("kpoint"))
+    {
+        return;
+    }
+    int world_rank = 0;
+    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
+    std::cout << "PEXSI_TRACE world_rank=" << world_rank << " pool=" << pool << " imu=" << imu
+              << " ik=" << ik_global << " stage=" << stage << " time=" << MPI_Wtime() << std::endl;
+}
+#endif
+
+template <typename TK, typename Device>
+struct PexsiKParallelRunner
+{
+    static bool run(const Parallel_Orbitals* ParaV,
+                    hamilt::Hamilt<TK>* pHamilt,
+                    psi::Psi<TK>& psi,
+                    elecstate::ElecState* pes,
+                    elecstate::DensityMatrix<TK, double>& dm,
+                    const int requested_kpar,
+                    std::vector<pexsi::PexsiComplexReuseContext>* pexsi_reuse_contexts)
+    {
+        return false;
+    }
+};
+
+#ifdef __MPI
+template <typename Device>
+struct PexsiKParallelRunner<std::complex<double>, Device>
+{
+    static bool run(const Parallel_Orbitals* ParaV,
+                    hamilt::Hamilt<std::complex<double>>* pHamilt,
+                    psi::Psi<std::complex<double>>& psi,
+                    elecstate::ElecState* pes,
+                    elecstate::DensityMatrix<std::complex<double>, double>& dm,
+                    const int requested_kpar,
+                    std::vector<pexsi::PexsiComplexReuseContext>* pexsi_reuse_contexts)
+    {
+        const int nks = psi.get_nk();
+        if (requested_kpar <= 1 || nks <= 1)
+        {
+            return false;
+        }
+
+        const int min_pool_size = std::max(1, pexsi::PEXSI_Solver::pexsi_nproc_pole);
+        int kpar = std::min(std::min(requested_kpar, nks), GlobalV::NPROC);
+        while (kpar > 1 && GlobalV::NPROC / kpar < min_pool_size)
+        {
+            --kpar;
+        }
+        if (kpar <= 1)
+        {
+            return false;
+        }
+        if (kpar != requested_kpar)
+        {
+            ModuleBase::WARNING("HSolverLCAO::pexsiKpar",
+                                "adjust kpar for PEXSI so each k pool has enough MPI ranks");
+        }
+
+        ModuleBase::timer::start("HSolverLCAO", "pexsiKparSolve");
+        DiagoPexsi<std::complex<double>> pe(ParaV);
+        pe.ensure_density_buffers(nks);
+
+        auto k2d = Parallel_K2D<std::complex<double>>();
+        k2d.set_kpar(kpar);
+        const int nrow = ParaV->get_global_row_size();
+        const int nb2d = ParaV->get_block_size();
+        k2d.set_para_env(nks, nrow, nb2d, GlobalV::NPROC, GlobalV::MY_RANK, PARAM.inp.nspin);
+
+        MPI_Group pool_group = MPI_GROUP_NULL;
+        MPI_Comm_group(k2d.POOL_WORLD_K2D, &pool_group);
+        int rank_in_pool = 0;
+        MPI_Comm_rank(k2d.POOL_WORLD_K2D, &rank_in_pool);
+
+        const int pool_local_size = k2d.get_p2D_pool()->get_local_size();
+        std::vector<std::complex<double>> dm_pool(pool_local_size);
+        std::vector<std::complex<double>> edm_pool(pool_local_size);
+        std::vector<std::vector<std::complex<double>>> hk_pool_cache(nks);
+        std::vector<std::vector<std::complex<double>>> sk_pool_cache(nks);
+        std::vector<std::vector<std::complex<double>>> dm_pool_cache(nks);
+        std::vector<std::vector<std::complex<double>>> edm_pool_cache(nks);
+        const bool reuse_setup = pexsi_reuse_setup_enabled();
+        if (reuse_setup && pexsi_reuse_contexts == nullptr)
+        {
+            ModuleBase::WARNING_QUIT("HSolverLCAO::pexsiKparSolve", "PEXSI reuse context storage is null");
+        }
+        if (reuse_setup && pexsi_reuse_contexts->size() != static_cast<std::size_t>(nks))
+        {
+            for (auto& pexsi_reuse_context : *pexsi_reuse_contexts)
+            {
+                pexsi_reuse_context.clear();
+            }
+            pexsi_reuse_contexts->clear();
+            pexsi_reuse_contexts->resize(nks);
+        }
+        const int pexsi_mu_loops = std::min(40, std::max(1, pexsi::PEXSI_Solver::pexsi_nmax));
+
+        ModuleBase::timer::start("HSolverLCAO", "cache_pexsi_hs");
+        for (int ik = 0; ik < k2d.get_pKpoints()->get_max_nks_pool(); ++ik)
+        {
+            std::vector<int> ik_kpar;
+            for (int ipool = 0; ipool < k2d.get_kpar(); ++ipool)
+            {
+                if (ik + k2d.get_pKpoints()->startk_pool[ipool] < nks
+                    && ik < k2d.get_pKpoints()->nks_pool[ipool])
+                {
+                    ik_kpar.push_back(ik + k2d.get_pKpoints()->startk_pool[ipool]);
+                }
+            }
+            if (ik_kpar.empty())
+            {
+                ModuleBase::WARNING_QUIT("HSolverLCAO::pexsiKparSolve", "ik_kpar is empty");
+            }
+            k2d.distribute_hsk(pHamilt, ik_kpar, nrow);
+
+            const int my_pool = k2d.get_my_pool();
+            const int ik_global = ik + k2d.get_pKpoints()->startk_pool[my_pool];
+            const bool has_local_k = ik_global < nks && ik < k2d.get_pKpoints()->nks_pool[my_pool];
+            if (has_local_k)
+            {
+                hk_pool_cache[ik_global] = k2d.hk_pool;
+                sk_pool_cache[ik_global] = k2d.sk_pool;
+            }
+        }
+        ModuleBase::timer::end("HSolverLCAO", "cache_pexsi_hs");
+
+        pe.begin_mu_search();
+        for (int imu = 0; imu < pexsi_mu_loops; ++imu)
+        {
+            pe.begin_k_loop();
+            double local_totals[5] = {0.0, 0.0, 0.0, 0.0, 0.0};
+
+            for (int ik = 0; ik < k2d.get_pKpoints()->get_max_nks_pool(); ++ik)
+            {
+                std::vector<int> ik_kpar;
+                for (int ipool = 0; ipool < k2d.get_kpar(); ++ipool)
+                {
+                    if (ik + k2d.get_pKpoints()->startk_pool[ipool] < nks
+                        && ik < k2d.get_pKpoints()->nks_pool[ipool])
+                    {
+                        ik_kpar.push_back(ik + k2d.get_pKpoints()->startk_pool[ipool]);
+                    }
+                }
+                if (ik_kpar.empty())
+                {
+                    ModuleBase::WARNING_QUIT("HSolverLCAO::pexsiKparSolve", "ik_kpar is empty");
+                }
+
+                const int my_pool = k2d.get_my_pool();
+                const int ik_global = ik + k2d.get_pKpoints()->startk_pool[my_pool];
+                const bool has_local_k = ik_global < nks && ik < k2d.get_pKpoints()->nks_pool[my_pool];
+                std::fill(dm_pool.begin(), dm_pool.end(), std::complex<double>(0.0, 0.0));
+                std::fill(edm_pool.begin(), edm_pool.end(), std::complex<double>(0.0, 0.0));
+
+                if (has_local_k)
+                {
+                    double num_electron = 0.0;
+                    double num_electron_derivative = 0.0;
+                    double total_energy_h = 0.0;
+                    double total_energy_s = 0.0;
+                    double total_free_energy = 0.0;
+                    double mu = pe.current_mu();
+                    std::complex<double>* dm_pool_ptr = dm_pool.data();
+                    std::complex<double>* edm_pool_ptr = edm_pool.data();
+
+                    if (rank_in_pool == 0)
+                    {
+                        pexsi_trace_kpoint(imu, ik_global, my_pool, "begin");
+                    }
+                    pexsi::simplePEXSIComplex(k2d.POOL_WORLD_K2D,
+                                              k2d.POOL_WORLD_K2D,
+                                              pool_group,
+                                              k2d.get_p2D_pool()->blacs_ctxt,
+                                              PARAM.globalv.nlocal,
+                                              nb2d,
+                                              k2d.get_p2D_pool()->get_row_size(),
+                                              k2d.get_p2D_pool()->get_col_size(),
+                                              'c',
+                                              hk_pool_cache[ik_global].data(),
+                                              sk_pool_cache[ik_global].data(),
+                                              PARAM.inp.nelec,
+                                              "PEXSIOPTION",
+                                              dm_pool_ptr,
+                                              edm_pool_ptr,
+                                              total_energy_h,
+                                              total_energy_s,
+                                              total_free_energy,
+                                              mu,
+                                              pe.current_mu(),
+                                              &num_electron,
+                                              &num_electron_derivative,
+                                              reuse_setup ? &(*pexsi_reuse_contexts)[ik_global] : nullptr);
+                    if (rank_in_pool == 0)
+                    {
+                        pexsi_trace_kpoint(imu, ik_global, my_pool, "end");
+                    }
+
+                    if (rank_in_pool == 0)
+                    {
+                        const double k_weight = (pes->klist != nullptr
+                                                 && ik_global < static_cast<int>(pes->klist->wk.size()))
+                                                    ? pes->klist->wk[ik_global]
+                                                    : 1.0;
+                        const double pexsi_spin_weight = PARAM.inp.nspin == 1 ? 0.5 : 1.0;
+                        const double effective_weight = k_weight * pexsi_spin_weight;
+                        local_totals[0] += effective_weight * num_electron;
+                        local_totals[1] += effective_weight * num_electron_derivative;
+                        local_totals[2] += effective_weight * total_energy_h;
+                        local_totals[3] += effective_weight * total_energy_s;
+                        local_totals[4] += effective_weight * total_free_energy;
+                    }
+                    dm_pool_cache[ik_global] = dm_pool;
+                    edm_pool_cache[ik_global] = edm_pool;
+                }
+            }
+
+            double global_totals[5] = {0.0, 0.0, 0.0, 0.0, 0.0};
+            MPI_Allreduce(local_totals, global_totals, 5, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+            pe.set_k_loop_totals(global_totals[0],
+                                 global_totals[1],
+                                 global_totals[2],
+                                 global_totals[3],
+                                 global_totals[4]);
+            if (pe.finish_k_loop(PARAM.inp.nelec))
+            {
+                ModuleBase::timer::start("HSolverLCAO", "collect_pexsi_dm");
+                const int my_pool = k2d.get_my_pool();
+                for (int ik = 0; ik < k2d.get_pKpoints()->get_max_nks_pool(); ++ik)
+                {
+                    std::vector<int> ik_kpar;
+                    for (int ipool = 0; ipool < k2d.get_kpar(); ++ipool)
+                    {
+                        if (ik + k2d.get_pKpoints()->startk_pool[ipool] < nks
+                            && ik < k2d.get_pKpoints()->nks_pool[ipool])
+                        {
+                            ik_kpar.push_back(ik + k2d.get_pKpoints()->startk_pool[ipool]);
+                        }
+                    }
+                    for (const int ik_collect : ik_kpar)
+                    {
+                        const int source_pool = k2d.get_pKpoints()->whichpool[ik_collect];
+                        int desc_pool[9];
+                        std::copy(k2d.get_p2D_pool()->desc, k2d.get_p2D_pool()->desc + 9, desc_pool);
+                        if (my_pool != source_pool)
+                        {
+                            desc_pool[1] = -1;
+                        }
+                        std::complex<double>* dm_src = (my_pool == source_pool && !dm_pool_cache[ik_collect].empty())
+                                                           ? dm_pool_cache[ik_collect].data()
+                                                           : nullptr;
+                        std::complex<double>* edm_src = (my_pool == source_pool && !edm_pool_cache[ik_collect].empty())
+                                                            ? edm_pool_cache[ik_collect].data()
+                                                            : nullptr;
+                        Cpxgemr2d(nrow,
+                                  nrow,
+                                  dm_src,
+                                  1,
+                                  1,
+                                  desc_pool,
+                                  pe.DM[ik_collect],
+                                  1,
+                                  1,
+                                  k2d.get_p2D_global()->desc,
+                                  k2d.get_p2D_global()->blacs_ctxt);
+                        Cpxgemr2d(nrow,
+                                  nrow,
+                                  edm_src,
+                                  1,
+                                  1,
+                                  desc_pool,
+                                  pe.EDM[ik_collect],
+                                  1,
+                                  1,
+                                  k2d.get_p2D_global()->desc,
+                                  k2d.get_p2D_global()->blacs_ctxt);
+                    }
+                }
+                ModuleBase::timer::end("HSolverLCAO", "collect_pexsi_dm");
+                break;
+            }
+        }
+        MPI_Group_free(&pool_group);
+        k2d.unset_para_env();
+
+        auto _pes = dynamic_cast<elecstate::ElecStateLCAO<std::complex<double>>*>(pes);
+        pes->f_en.eband = pe.totalFreeEnergy;
+        _pes->dm2rho(pe.DM, pe.EDM, &dm);
+        ModuleBase::timer::end("HSolverLCAO", "pexsiKparSolve");
+        return true;
+    }
+};
+#endif
+} // namespace
+#endif
+
 template <typename TK, typename Device>
 void HSolverLCAO<TK, Device>::solve(hamilt::Hamilt<TK>* pHamilt,
                                    psi::Psi<TK>& psi,
@@ -113,14 +446,42 @@ void HSolverLCAO<TK, Device>::solve(hamilt::Hamilt<TK>* pHamilt,
     else if (this->method == "pexsi")
     {
 #ifdef __PEXSI // other purification methods should follow this routine
+        if (PexsiKParallelRunner<TK, Device>::run(ParaV,
+                                                  pHamilt,
+                                                  psi,
+                                                  pes,
+                                                  dm,
+                                                  PARAM.globalv.kpar_lcao,
+                                                  this->pexsi_reuse_contexts_ptr_))
+        {
+            ModuleBase::timer::end("HSolverLCAO", "solve");
+            return;
+        }
         DiagoPexsi<TK> pe(ParaV);
-        for (int ik = 0; ik < psi.get_nk(); ++ik)
+        const int pexsi_mu_loops = std::is_same<TK, std::complex<double>>::value
+                                       ? std::min(40, std::max(1, pexsi::PEXSI_Solver::pexsi_nmax))
+                                       : 1;
+        pe.begin_mu_search();
+        for (int imu = 0; imu < pexsi_mu_loops; ++imu)
         {
-            /// update H(k) for each k point
-            pHamilt->updateHk(ik);
-            psi.fix_k(ik);
-            // solve eigenvector and eigenvalue for H(k)
-            pe.diag(pHamilt, psi, nullptr);
+            pe.begin_k_loop();
+            for (int ik = 0; ik < psi.get_nk(); ++ik)
+            {
+                const double k_weight = (pes->klist != nullptr && ik < static_cast<int>(pes->klist->wk.size()))
+                                            ? pes->klist->wk[ik]
+                                            : 1.0;
+                const double pexsi_spin_weight = PARAM.inp.nspin == 1 ? 0.5 : 1.0;
+                pe.set_k_weight(ik, k_weight * pexsi_spin_weight);
+                /// update H(k) for each k point
+                pHamilt->updateHk(ik);
+                psi.fix_k(ik);
+                // solve eigenvector and eigenvalue for H(k)
+                pe.diag(pHamilt, psi, nullptr);
+            }
+            if (pe.finish_k_loop(PARAM.inp.nelec))
+            {
+                break;
+            }
         }
         auto _pes = dynamic_cast<elecstate::ElecStateLCAO<TK>*>(pes);
         pes->f_en.eband = pe.totalFreeEnergy;
@@ -475,4 +836,4 @@ void HSolverLCAO<T, Device>::parakSolve_cusolver(hamilt::Hamilt<T>* pHamilt,
 template class HSolverLCAO<double>;
 template class HSolverLCAO<std::complex<double>>;
 
-} // namespace hsolver
\ No newline at end of file
+} // namespace hsolver
diff --git a/source/source_hsolver/hsolver_lcao.h b/source/source_hsolver/hsolver_lcao.h
index 87ed6ac4c9d..6197bf2f133 100644
--- a/source/source_hsolver/hsolver_lcao.h
+++ b/source/source_hsolver/hsolver_lcao.h
@@ -8,6 +8,12 @@
 #include "source_estate/module_charge/charge.h" // mohan add 20251024
 #include "source_estate/module_dm/density_matrix.h" // mohan add 20251103
 
+#include <vector>
+
+#ifdef __PEXSI
+#include "module_pexsi/simple_pexsi.h"
+#endif
+
 namespace hsolver
 {
 
@@ -15,7 +21,17 @@ template <typename TK, typename Device = base_device::DEVICE_CPU>
 class HSolverLCAO
 {
   public:
+#ifdef __PEXSI
+    HSolverLCAO(const Parallel_Orbitals* ParaV_in,
+                std::string method_in,
+                std::vector<pexsi::PexsiComplexReuseContext>* pexsi_reuse_contexts_in = nullptr)
+        : ParaV(ParaV_in),
+          method(method_in),
+          pexsi_reuse_contexts_ptr_(pexsi_reuse_contexts_in != nullptr ? pexsi_reuse_contexts_in
+                                                                        : &owned_pexsi_reuse_contexts_) {};
+#else
     HSolverLCAO(const Parallel_Orbitals* ParaV_in, std::string method_in) : ParaV(ParaV_in), method(method_in) {};
+#endif
 
     void solve(hamilt::Hamilt<TK>* pHamilt,
                psi::Psi<TK>& psi,
@@ -38,6 +54,11 @@ class HSolverLCAO
     const Parallel_Orbitals* ParaV = nullptr;
     
     const std::string method;
+
+#ifdef __PEXSI
+    std::vector<pexsi::PexsiComplexReuseContext> owned_pexsi_reuse_contexts_;
+    std::vector<pexsi::PexsiComplexReuseContext>* pexsi_reuse_contexts_ptr_ = nullptr;
+#endif
 };
 
 } // namespace hsolver
diff --git a/source/source_hsolver/module_pexsi/dist_bcd_matrix.cpp b/source/source_hsolver/module_pexsi/dist_bcd_matrix.cpp
index 93a6dcb5b0d..8335b07c4d1 100644
--- a/source/source_hsolver/module_pexsi/dist_bcd_matrix.cpp
+++ b/source/source_hsolver/module_pexsi/dist_bcd_matrix.cpp
@@ -45,28 +45,35 @@ DistBCDMatrix::DistBCDMatrix(MPI_Comm comm,
         this->myproc = -1;
         this->myprow = -1;
         this->mypcol = -1;
+        this->nprocs = 0;
+        this->nprows = 0;
+        this->npcols = 0;
+        this->prowpcol2pnum = nullptr;
+        return;
     }
 
-    // synchronize matrix parameters to all processes, including those are not in bcd group
-    int myid_in_comm_world = 0;
-    MPI_Comm_rank(MPI_COMM_WORLD, &myid_in_comm_world);
-    if (myid_in_comm_world == 0)
+    // Synchronize matrix parameters inside the BCD communicator.  Older code used
+    // MPI_COMM_WORLD here, which prevents multiple independent k-point pools from
+    // constructing BCD descriptors concurrently.
+    int myid_in_comm = 0;
+    MPI_Comm_rank(comm, &myid_in_comm);
+    if (myid_in_comm == 0)
     {
         MPI_Comm_size(comm, &this->nprocs);
         int PARA_BCAST[4] = {this->nblk, this->nprocs, this->nprows, this->npcols};
-        MPI_Bcast(&PARA_BCAST[0], 4, MPI_INT, 0, MPI_COMM_WORLD);
+        MPI_Bcast(&PARA_BCAST[0], 4, MPI_INT, 0, comm);
     }
     else
     {
         int PARA_BCAST[4];
-        MPI_Bcast(&PARA_BCAST[0], 4, MPI_INT, 0, MPI_COMM_WORLD);
+        MPI_Bcast(&PARA_BCAST[0], 4, MPI_INT, 0, comm);
         this->nblk = PARA_BCAST[0];
         this->nprocs = PARA_BCAST[1];
         this->nprows = PARA_BCAST[2];
         this->npcols = PARA_BCAST[3];
     }
     this->prowpcol2pnum = new int[this->nprocs];
-    if (myid_in_comm_world == 0)
+    if (myid_in_comm == 0)
     {
         for (int i = 0; i < this->nprows; ++i)
         {
@@ -76,7 +83,7 @@ DistBCDMatrix::DistBCDMatrix(MPI_Comm comm,
             }
         }
     }
-    MPI_Bcast(this->prowpcol2pnum, this->nprocs, MPI_INT, 0, MPI_COMM_WORLD);
+    MPI_Bcast(this->prowpcol2pnum, this->nprocs, MPI_INT, 0, comm);
 }
 
 DistBCDMatrix::~DistBCDMatrix()
@@ -111,4 +118,4 @@ int DistBCDMatrix::pnum(const int prow, const int pcol)
     return this->prowpcol2pnum[prow * this->npcols + pcol];
 }
 } // namespace pexsi
-#endif
\ No newline at end of file
+#endif
diff --git a/source/source_hsolver/module_pexsi/dist_matrix_transformer.cpp b/source/source_hsolver/module_pexsi/dist_matrix_transformer.cpp
index 43f069c1238..8f1dd37df83 100644
--- a/source/source_hsolver/module_pexsi/dist_matrix_transformer.cpp
+++ b/source/source_hsolver/module_pexsi/dist_matrix_transformer.cpp
@@ -3,13 +3,19 @@
 
 #include <mpi.h>
 
+#include <algorithm>
 #include <climits>
 #include <cmath>
+#include <complex>
 #include <cstdlib>
 #include <iostream>
 #include <map>
 #include <vector>
 
+#ifdef _OPENMP
+#include <omp.h>
+#endif
+
 #include "dist_bcd_matrix.h"
 #include "dist_ccs_matrix.h"
 
@@ -137,6 +143,9 @@ inline void DistMatrixTransformer::buffer2CCSvalue(int nnzLocal,
                                                    double* buffer,
                                                    double* nzvalLocal)
 {
+#ifdef _OPENMP
+#pragma omp parallel for schedule(static) if (nnzLocal >= 4096)
+#endif
     for (int i = 0; i < nnzLocal; ++i)
     {
         nzvalLocal[i] = buffer[buffer2ccsIndex[i]];
@@ -147,7 +156,7 @@ inline void DistMatrixTransformer::countMatrixDistribution(int N, double* A, std
     for (int i = 0; i < N; ++i)
     {
         int key = 0;
-        if (fabs(A[i] < 1e-31))
+        if (fabs(A[i]) < 1e-31)
             key = -100;
         else
             key = floor(log10(fabs(A[i])));
@@ -172,6 +181,43 @@ inline int DistMatrixTransformer::getNonZeroIndex(char layout,
     rowidx.clear();
     if (layout == 'c')
     {
+#ifdef _OPENMP
+        const int total = nrow * ncol;
+        if (total >= 4096)
+        {
+            std::vector<unsigned char> nonzero(total, 0);
+#pragma omp parallel for schedule(static)
+            for (int linear = 0; linear < total; ++linear)
+            {
+                const int i = linear / nrow;
+                const int j = linear - i * nrow;
+                const int idx_local = i * nrow + j;
+                nonzero[linear] = (fabs(H_2d[idx_local]) > ZERO_Limit || fabs(S_2d[idx_local]) > ZERO_Limit) ? 1 : 0;
+            }
+
+            std::vector<int> prefix(total + 1, 0);
+            for (int linear = 0; linear < total; ++linear)
+            {
+                prefix[linear + 1] = prefix[linear] + static_cast<int>(nonzero[linear]);
+            }
+            nnz = prefix[total];
+            colidx.resize(nnz);
+            rowidx.resize(nnz);
+#pragma omp parallel for schedule(static)
+            for (int linear = 0; linear < total; ++linear)
+            {
+                if (nonzero[linear])
+                {
+                    const int i = linear / nrow;
+                    const int j = linear - i * nrow;
+                    const int pos = prefix[linear];
+                    colidx[pos] = i;
+                    rowidx[pos] = j;
+                }
+            }
+            return 0;
+        }
+#endif
         for (int i = 0; i < ncol; ++i)
         {
             for (int j = 0; j < nrow; ++j)
@@ -188,6 +234,43 @@ inline int DistMatrixTransformer::getNonZeroIndex(char layout,
     }
     else if (layout == 'r')
     {
+#ifdef _OPENMP
+        const int total = nrow * ncol;
+        if (total >= 4096)
+        {
+            std::vector<unsigned char> nonzero(total, 0);
+#pragma omp parallel for schedule(static)
+            for (int linear = 0; linear < total; ++linear)
+            {
+                const int i = linear / nrow;
+                const int j = linear - i * nrow;
+                const int idx_local = j * ncol + i;
+                nonzero[linear] = (fabs(H_2d[idx_local]) > ZERO_Limit || fabs(S_2d[idx_local]) > ZERO_Limit) ? 1 : 0;
+            }
+
+            std::vector<int> prefix(total + 1, 0);
+            for (int linear = 0; linear < total; ++linear)
+            {
+                prefix[linear + 1] = prefix[linear] + static_cast<int>(nonzero[linear]);
+            }
+            nnz = prefix[total];
+            colidx.resize(nnz);
+            rowidx.resize(nnz);
+#pragma omp parallel for schedule(static)
+            for (int linear = 0; linear < total; ++linear)
+            {
+                if (nonzero[linear])
+                {
+                    const int i = linear / nrow;
+                    const int j = linear - i * nrow;
+                    const int pos = prefix[linear];
+                    colidx[pos] = i;
+                    rowidx[pos] = j;
+                }
+            }
+            return 0;
+        }
+#endif
         for (int i = 0; i < ncol; ++i)
         {
             for (int j = 0; j < nrow; ++j)
@@ -209,6 +292,135 @@ inline int DistMatrixTransformer::getNonZeroIndex(char layout,
     return 0;
 }
 
+inline int DistMatrixTransformer::getNonZeroIndex(char layout,
+                                                  const int nrow,
+                                                  const int ncol,
+                                                  std::complex<double>* H_2d,
+                                                  std::complex<double>* S_2d,
+                                                  const double ZERO_Limit,
+                                                  int& nnz,
+                                                  std::vector<int>& rowidx,
+                                                  std::vector<int>& colidx)
+{
+    int idx = 0;
+    nnz = 0;
+    colidx.clear();
+    rowidx.clear();
+    if (layout == 'c' || layout == 'C')
+    {
+#ifdef _OPENMP
+        const int total = nrow * ncol;
+        if (total >= 4096)
+        {
+            std::vector<unsigned char> nonzero(total, 0);
+#pragma omp parallel for schedule(static)
+            for (int linear = 0; linear < total; ++linear)
+            {
+                const int i = linear / nrow;
+                const int j = linear - i * nrow;
+                const int idx_local = i * nrow + j;
+                nonzero[linear]
+                    = (std::abs(H_2d[idx_local]) > ZERO_Limit || std::abs(S_2d[idx_local]) > ZERO_Limit) ? 1 : 0;
+            }
+
+            std::vector<int> prefix(total + 1, 0);
+            for (int linear = 0; linear < total; ++linear)
+            {
+                prefix[linear + 1] = prefix[linear] + static_cast<int>(nonzero[linear]);
+            }
+            nnz = prefix[total];
+            colidx.resize(nnz);
+            rowidx.resize(nnz);
+#pragma omp parallel for schedule(static)
+            for (int linear = 0; linear < total; ++linear)
+            {
+                if (nonzero[linear])
+                {
+                    const int i = linear / nrow;
+                    const int j = linear - i * nrow;
+                    const int pos = prefix[linear];
+                    colidx[pos] = i;
+                    rowidx[pos] = j;
+                }
+            }
+            return 0;
+        }
+#endif
+        for (int i = 0; i < ncol; ++i)
+        {
+            for (int j = 0; j < nrow; ++j)
+            {
+                idx = i * nrow + j;
+                if (std::abs(H_2d[idx]) > ZERO_Limit || std::abs(S_2d[idx]) > ZERO_Limit)
+                {
+                    ++nnz;
+                    colidx.push_back(i);
+                    rowidx.push_back(j);
+                }
+            }
+        }
+    }
+    else if (layout == 'r' || layout == 'R')
+    {
+#ifdef _OPENMP
+        const int total = nrow * ncol;
+        if (total >= 4096)
+        {
+            std::vector<unsigned char> nonzero(total, 0);
+#pragma omp parallel for schedule(static)
+            for (int linear = 0; linear < total; ++linear)
+            {
+                const int i = linear / nrow;
+                const int j = linear - i * nrow;
+                const int idx_local = j * ncol + i;
+                nonzero[linear]
+                    = (std::abs(H_2d[idx_local]) > ZERO_Limit || std::abs(S_2d[idx_local]) > ZERO_Limit) ? 1 : 0;
+            }
+
+            std::vector<int> prefix(total + 1, 0);
+            for (int linear = 0; linear < total; ++linear)
+            {
+                prefix[linear + 1] = prefix[linear] + static_cast<int>(nonzero[linear]);
+            }
+            nnz = prefix[total];
+            colidx.resize(nnz);
+            rowidx.resize(nnz);
+#pragma omp parallel for schedule(static)
+            for (int linear = 0; linear < total; ++linear)
+            {
+                if (nonzero[linear])
+                {
+                    const int i = linear / nrow;
+                    const int j = linear - i * nrow;
+                    const int pos = prefix[linear];
+                    colidx[pos] = i;
+                    rowidx[pos] = j;
+                }
+            }
+            return 0;
+        }
+#endif
+        for (int i = 0; i < ncol; ++i)
+        {
+            for (int j = 0; j < nrow; ++j)
+            {
+                idx = j * ncol + i;
+                if (std::abs(H_2d[idx]) > ZERO_Limit || std::abs(S_2d[idx]) > ZERO_Limit)
+                {
+                    ++nnz;
+                    colidx.push_back(i);
+                    rowidx.push_back(j);
+                }
+            }
+        }
+    }
+    else
+    {
+        return 1;
+    }
+    return 0;
+}
+
 int DistMatrixTransformer::buildTransformParameter(DistBCDMatrix& SRC_Matrix,
                                                    DistCCSMatrix& DST_Matrix,
                                                    const int NPROC_TRANS,
@@ -225,14 +437,14 @@ int DistMatrixTransformer::buildTransformParameter(DistBCDMatrix& SRC_Matrix,
                                                    std::vector<int>& receiver_displacement_process,
                                                    std::vector<int>& buffer2ccsIndex)
 {
-    int myproc;
-    MPI_Comm_rank(MPI_COMM_WORLD, &myproc);
+    int myproc_trans = 0;
+    MPI_Comm_rank(COMM_TRANS, &myproc_trans);
     sender_size = nnz;
     std::fill(sender_size_process.begin(), sender_size_process.end(), 0);
     // create process id map from group_data to group_trans
     int nproc_data;
     std::vector<int> proc_map_data_trans;
-    if (myproc == 0)
+    if (myproc_trans == 0)
     {
         MPI_Group_size(DST_Matrix.get_group_data(), &nproc_data);
         MPI_Bcast(&nproc_data, 1, MPI_INT, 0, COMM_TRANS);
@@ -250,6 +462,36 @@ int DistMatrixTransformer::buildTransformParameter(DistBCDMatrix& SRC_Matrix,
         MPI_Bcast(&proc_map_data_trans[0], nproc_data, MPI_INT, 0, COMM_TRANS);
     }
 
+#ifdef _OPENMP
+    if (nnz >= 4096 && NPROC_TRANS > 1)
+    {
+        const int max_threads = omp_get_max_threads();
+        std::vector<std::vector<int>> thread_sender_size(max_threads, std::vector<int>(NPROC_TRANS, 0));
+#pragma omp parallel
+        {
+            const int tid = omp_get_thread_num();
+            std::vector<int>& local_sender_size = thread_sender_size[tid];
+#pragma omp for schedule(static)
+            for (int i = 0; i < nnz; ++i)
+            {
+                const int l_col = colidx[i];
+                const int g_col = SRC_Matrix.globalCol(l_col);
+                int dst_process = 0;
+                DST_Matrix.localCol(g_col, dst_process);
+                const int dst_process_trans = proc_map_data_trans[dst_process];
+                ++local_sender_size[dst_process_trans];
+            }
+        }
+        for (int tid = 0; tid < max_threads; ++tid)
+        {
+            for (int iproc = 0; iproc < NPROC_TRANS; ++iproc)
+            {
+                sender_size_process[iproc] += thread_sender_size[tid][iproc];
+            }
+        }
+    }
+    else
+#endif
     for (int i = 0; i < nnz; ++i)
     {
         int l_col = colidx[i];
@@ -279,6 +521,9 @@ int DistMatrixTransformer::buildTransformParameter(DistBCDMatrix& SRC_Matrix,
     // setup receiver index
     // setup sender_index
     std::vector<int> sender_index(sender_size);
+#ifdef _OPENMP
+#pragma omp parallel for schedule(static) if (nnz >= 4096)
+#endif
     for (int i = 0; i < nnz; ++i)
     {
         int l_col = colidx[i];
@@ -292,13 +537,13 @@ int DistMatrixTransformer::buildTransformParameter(DistBCDMatrix& SRC_Matrix,
 
     // transfer index to receiver
     std::vector<int> receiver_index(receiver_size);
-    MPI_Alltoallv(&sender_index[0],
-                  &sender_size_process[0],
-                  &sender_displacement_process[0],
+    MPI_Alltoallv(sender_index.data(),
+                  sender_size_process.data(),
+                  sender_displacement_process.data(),
                   MPI_INT,
-                  &receiver_index[0],
-                  &receiver_size_process[0],
-                  &receiver_displacement_process[0],
+                  receiver_index.data(),
+                  receiver_size_process.data(),
+                  receiver_displacement_process.data(),
                   MPI_INT,
                   COMM_TRANS);
 
@@ -309,9 +554,9 @@ int DistMatrixTransformer::buildTransformParameter(DistBCDMatrix& SRC_Matrix,
                       NPROC_TRANS,
                       receiver_size_process,
                       receiver_displacement_process,
-                      &receiver_index[0],
+                      receiver_index.data(),
                       DST_Matrix,
-                      &buffer2ccsIndex[0]);
+                      buffer2ccsIndex.data());
     return 0;
 }
 
@@ -323,7 +568,7 @@ int DistMatrixTransformer::newGroupCommTrans(DistBCDMatrix& SRC_Matrix,
     // build transfortram communicator which contains both processes of BCD processors and
     // CCS processors with nonzero elements
     MPI_Group_union(DST_Matrix.get_group_data(), SRC_Matrix.get_group(), &GROUP_TRANS);
-    MPI_Comm_create(MPI_COMM_WORLD, GROUP_TRANS, &COMM_TRANS);
+    MPI_Comm_create(SRC_Matrix.get_comm(), GROUP_TRANS, &COMM_TRANS);
     return 0;
 }
 
@@ -337,10 +582,8 @@ int DistMatrixTransformer::deleteGroupCommTrans(MPI_Group& GROUP_TRANS, MPI_Comm
     return 0;
 }
 
-// transform two sparse matrices from block cyclic distribution (BCD) to Compressed Column Storage (CCS) distribution
-// two destination matrices share the same non-zero elements positions
-// if either of two elements in source matrices is non-zeros, the elements in the destination matrices are non-zero,
-// even if one of them is acturely zero All matrices must have same MPI communicator
+// Transform two sparse matrices from BCD to CCS with a shared sparse pattern.  H and S are packed together so that the
+// value redistribution needs one MPI_Alltoallv instead of one collective per matrix.
 int DistMatrixTransformer::transformBCDtoCCS(DistBCDMatrix& SRC_Matrix,
                                              double* H_2d,
                                              double* S_2d,
@@ -397,70 +640,242 @@ int DistMatrixTransformer::transformBCDtoCCS(DistBCDMatrix& SRC_Matrix,
                                 receiver_size_process,
                                 receiver_displacement_process,
                                 buffer2ccsIndex);
-// Do transformation
-        std::vector<double> sender_buffer(sender_size);
-        std::vector<double> receiver_buffer(receiver_size);
-        // put H to sender buffer
+        int myproc_trans = 0;
+        MPI_Comm_rank(COMM_TRANS, &myproc_trans);
+        int nproc_data;
+        std::vector<int> proc_map_data_trans;
+        if (myproc_trans == 0)
+        {
+            MPI_Group_size(DST_Matrix.get_group_data(), &nproc_data);
+            MPI_Bcast(&nproc_data, 1, MPI_INT, 0, COMM_TRANS);
+            proc_map_data_trans.resize(nproc_data, 0);
+            for (int i = 0; i < nproc_data; ++i)
+            {
+                MPI_Group_translate_ranks(DST_Matrix.get_group_data(), 1, &i, GROUP_TRANS, &proc_map_data_trans[i]);
+            }
+            MPI_Bcast(proc_map_data_trans.data(), nproc_data, MPI_INT, 0, COMM_TRANS);
+        }
+        else
+        {
+            MPI_Bcast(&nproc_data, 1, MPI_INT, 0, COMM_TRANS);
+            proc_map_data_trans.resize(nproc_data, 0);
+            MPI_Bcast(proc_map_data_trans.data(), nproc_data, MPI_INT, 0, COMM_TRANS);
+        }
+        std::vector<int> next_sender_position(sender_displacement_process);
+        std::vector<double> sender_buffer(2 * sender_size);
         if (SRC_Matrix.get_layout() == 'R' || SRC_Matrix.get_layout() == 'r')
         {
             for (int i = 0; i < sender_size; ++i)
             {
-                sender_buffer[i] = H_2d[rowidx[i] * SRC_Matrix.get_ncol() + colidx[i]];
+                const int idx = rowidx[i] * SRC_Matrix.get_ncol() + colidx[i];
+                int dst_process;
+                DST_Matrix.localCol(SRC_Matrix.globalCol(colidx[i]), dst_process);
+                const int pos = next_sender_position[proc_map_data_trans[dst_process]]++;
+                sender_buffer[2 * pos] = H_2d[idx];
+                sender_buffer[2 * pos + 1] = S_2d[idx];
             }
         }
         else
         {
             for (int i = 0; i < sender_size; ++i)
             {
-                sender_buffer[i] = H_2d[colidx[i] * SRC_Matrix.get_nrow() + rowidx[i]];
+                const int idx = colidx[i] * SRC_Matrix.get_nrow() + rowidx[i];
+                int dst_process;
+                DST_Matrix.localCol(SRC_Matrix.globalCol(colidx[i]), dst_process);
+                const int pos = next_sender_position[proc_map_data_trans[dst_process]]++;
+                sender_buffer[2 * pos] = H_2d[idx];
+                sender_buffer[2 * pos + 1] = S_2d[idx];
             }
         }
-        // do all2all transformation
-        MPI_Alltoallv(&sender_buffer[0],
-                      &sender_size_process[0],
-                      &sender_displacement_process[0],
+
+        std::vector<int> sender_value_size_process(sender_size_process);
+        std::vector<int> sender_value_displacement_process(sender_displacement_process);
+        std::vector<int> receiver_value_size_process(receiver_size_process);
+        std::vector<int> receiver_value_displacement_process(receiver_displacement_process);
+        for (int i = 0; i < NPROC_TRANS; ++i)
+        {
+            sender_value_size_process[i] *= 2;
+            sender_value_displacement_process[i] *= 2;
+            receiver_value_size_process[i] *= 2;
+            receiver_value_displacement_process[i] *= 2;
+        }
+
+        std::vector<double> receiver_buffer(2 * receiver_size);
+        MPI_Alltoallv(sender_buffer.data(),
+                      sender_value_size_process.data(),
+                      sender_value_displacement_process.data(),
                       MPI_DOUBLE,
-                      &receiver_buffer[0],
-                      &receiver_size_process[0],
-                      &receiver_displacement_process[0],
+                      receiver_buffer.data(),
+                      receiver_value_size_process.data(),
+                      receiver_value_displacement_process.data(),
                       MPI_DOUBLE,
                       COMM_TRANS);
-// collect H from receiver buffer
+
         delete[] H_ccs;
         H_ccs = new double[receiver_size];
-        buffer2CCSvalue(receiver_size, &buffer2ccsIndex[0], &receiver_buffer[0], H_ccs);
+        delete[] S_ccs;
+        S_ccs = new double[receiver_size];
+#ifdef _OPENMP
+#pragma omp parallel for schedule(static) if (receiver_size >= 4096)
+#endif
+        for (int i = 0; i < receiver_size; ++i)
+        {
+            const int buffer_index = buffer2ccsIndex[i];
+            H_ccs[i] = receiver_buffer[2 * buffer_index];
+            S_ccs[i] = receiver_buffer[2 * buffer_index + 1];
+        }
+    }
+    // clear and return
+    deleteGroupCommTrans(GROUP_TRANS, COMM_TRANS);
+    return 0;
+}
+
+int DistMatrixTransformer::transformBCDtoCCS(DistBCDMatrix& SRC_Matrix,
+                                             std::complex<double>* H_2d,
+                                             std::complex<double>* S_2d,
+                                             const double ZERO_Limit,
+                                             DistCCSMatrix& DST_Matrix,
+                                             std::complex<double>*& H_ccs,
+                                             std::complex<double>*& S_ccs)
+{
+    MPI_Group GROUP_TRANS;
+    MPI_Comm COMM_TRANS = MPI_COMM_NULL;
+    newGroupCommTrans(SRC_Matrix, DST_Matrix, GROUP_TRANS, COMM_TRANS);
+    if (COMM_TRANS != MPI_COMM_NULL)
+    {
+        int NPROC_TRANS;
+        MPI_Comm_size(COMM_TRANS, &NPROC_TRANS);
+        int sender_size;
+        std::vector<int> sender_size_process(NPROC_TRANS);
+        std::vector<int> sender_displacement_process(NPROC_TRANS);
+        int receiver_size;
+        std::vector<int> receiver_size_process(NPROC_TRANS);
+        std::vector<int> receiver_displacement_process(NPROC_TRANS);
+
+        std::vector<int> rowidx;
+        std::vector<int> colidx;
+        int nnz = 0;
+        if (SRC_Matrix.get_comm() != MPI_COMM_NULL)
+        {
+            getNonZeroIndex(SRC_Matrix.get_layout(),
+                            SRC_Matrix.get_nrow(),
+                            SRC_Matrix.get_ncol(),
+                            H_2d,
+                            S_2d,
+                            ZERO_Limit,
+                            nnz,
+                            rowidx,
+                            colidx);
+        }
 
-        // put S to sender buffer
+        std::vector<int> buffer2ccsIndex;
+        buildTransformParameter(SRC_Matrix,
+                                DST_Matrix,
+                                NPROC_TRANS,
+                                GROUP_TRANS,
+                                COMM_TRANS,
+                                nnz,
+                                rowidx,
+                                colidx,
+                                sender_size,
+                                sender_size_process,
+                                sender_displacement_process,
+                                receiver_size,
+                                receiver_size_process,
+                                receiver_displacement_process,
+                                buffer2ccsIndex);
+
+        int myproc_trans = 0;
+        MPI_Comm_rank(COMM_TRANS, &myproc_trans);
+        int nproc_data;
+        std::vector<int> proc_map_data_trans;
+        if (myproc_trans == 0)
+        {
+            MPI_Group_size(DST_Matrix.get_group_data(), &nproc_data);
+            MPI_Bcast(&nproc_data, 1, MPI_INT, 0, COMM_TRANS);
+            proc_map_data_trans.resize(nproc_data, 0);
+            for (int i = 0; i < nproc_data; ++i)
+            {
+                MPI_Group_translate_ranks(DST_Matrix.get_group_data(), 1, &i, GROUP_TRANS, &proc_map_data_trans[i]);
+            }
+            MPI_Bcast(proc_map_data_trans.data(), nproc_data, MPI_INT, 0, COMM_TRANS);
+        }
+        else
+        {
+            MPI_Bcast(&nproc_data, 1, MPI_INT, 0, COMM_TRANS);
+            proc_map_data_trans.resize(nproc_data, 0);
+            MPI_Bcast(proc_map_data_trans.data(), nproc_data, MPI_INT, 0, COMM_TRANS);
+        }
+        std::vector<int> next_sender_position(sender_displacement_process);
+        std::vector<double> sender_buffer(4 * sender_size);
         if (SRC_Matrix.get_layout() == 'R' || SRC_Matrix.get_layout() == 'r')
         {
             for (int i = 0; i < sender_size; ++i)
             {
-                sender_buffer[i] = S_2d[rowidx[i] * SRC_Matrix.get_ncol() + colidx[i]];
+                const int idx = rowidx[i] * SRC_Matrix.get_ncol() + colidx[i];
+                int dst_process;
+                DST_Matrix.localCol(SRC_Matrix.globalCol(colidx[i]), dst_process);
+                const int pos = next_sender_position[proc_map_data_trans[dst_process]]++;
+                sender_buffer[4 * pos] = H_2d[idx].real();
+                sender_buffer[4 * pos + 1] = H_2d[idx].imag();
+                sender_buffer[4 * pos + 2] = S_2d[idx].real();
+                sender_buffer[4 * pos + 3] = S_2d[idx].imag();
             }
         }
         else
         {
             for (int i = 0; i < sender_size; ++i)
             {
-                sender_buffer[i] = S_2d[colidx[i] * SRC_Matrix.get_nrow() + rowidx[i]];
+                const int idx = colidx[i] * SRC_Matrix.get_nrow() + rowidx[i];
+                int dst_process;
+                DST_Matrix.localCol(SRC_Matrix.globalCol(colidx[i]), dst_process);
+                const int pos = next_sender_position[proc_map_data_trans[dst_process]]++;
+                sender_buffer[4 * pos] = H_2d[idx].real();
+                sender_buffer[4 * pos + 1] = H_2d[idx].imag();
+                sender_buffer[4 * pos + 2] = S_2d[idx].real();
+                sender_buffer[4 * pos + 3] = S_2d[idx].imag();
             }
         }
-        // do all2all transformation
-        MPI_Alltoallv(&sender_buffer[0],
-                      &sender_size_process[0],
-                      &sender_displacement_process[0],
+
+        std::vector<int> sender_value_size_process(sender_size_process);
+        std::vector<int> sender_value_displacement_process(sender_displacement_process);
+        std::vector<int> receiver_value_size_process(receiver_size_process);
+        std::vector<int> receiver_value_displacement_process(receiver_displacement_process);
+        for (int i = 0; i < NPROC_TRANS; ++i)
+        {
+            sender_value_size_process[i] *= 4;
+            sender_value_displacement_process[i] *= 4;
+            receiver_value_size_process[i] *= 4;
+            receiver_value_displacement_process[i] *= 4;
+        }
+
+        std::vector<double> receiver_buffer(4 * receiver_size);
+        MPI_Alltoallv(sender_buffer.data(),
+                      sender_value_size_process.data(),
+                      sender_value_displacement_process.data(),
                       MPI_DOUBLE,
-                      &receiver_buffer[0],
-                      &receiver_size_process[0],
-                      &receiver_displacement_process[0],
+                      receiver_buffer.data(),
+                      receiver_value_size_process.data(),
+                      receiver_value_displacement_process.data(),
                       MPI_DOUBLE,
                       COMM_TRANS);
-// collect S from receiver buffer
+
+        delete[] H_ccs;
+        H_ccs = new std::complex<double>[receiver_size];
         delete[] S_ccs;
-        S_ccs = new double[receiver_size];
-        buffer2CCSvalue(receiver_size, &buffer2ccsIndex[0], &receiver_buffer[0], S_ccs);
+        S_ccs = new std::complex<double>[receiver_size];
+#ifdef _OPENMP
+#pragma omp parallel for schedule(static) if (receiver_size >= 4096)
+#endif
+        for (int i = 0; i < receiver_size; ++i)
+        {
+            const int buffer_index = buffer2ccsIndex[i];
+            H_ccs[i] = std::complex<double>(receiver_buffer[4 * buffer_index],
+                                            receiver_buffer[4 * buffer_index + 1]);
+            S_ccs[i] = std::complex<double>(receiver_buffer[4 * buffer_index + 2],
+                                            receiver_buffer[4 * buffer_index + 3]);
+        }
     }
-    // clear and return
     deleteGroupCommTrans(GROUP_TRANS, COMM_TRANS);
     return 0;
 }
@@ -474,14 +889,15 @@ int DistMatrixTransformer::transformCCStoBCD(DistCCSMatrix& SRC_Matrix,
                                              double* DM,
                                              double* EDM)
 {
-    int myproc;
-    MPI_Comm_rank(MPI_COMM_WORLD, &myproc);
     MPI_Group GROUP_TRANS;
     MPI_Comm COMM_TRANS = MPI_COMM_NULL;
     newGroupCommTrans(DST_Matrix, SRC_Matrix, GROUP_TRANS, COMM_TRANS);
     if (COMM_TRANS != MPI_COMM_NULL)
     {
         // init DM and EDM with 0
+#ifdef _OPENMP
+#pragma omp parallel for schedule(static) if (DST_Matrix.get_nrow() * DST_Matrix.get_ncol() >= 4096)
+#endif
         for (int i = 0; i < DST_Matrix.get_nrow() * DST_Matrix.get_ncol(); ++i)
         {
             DM[i] = 0;
@@ -669,6 +1085,9 @@ int DistMatrixTransformer::transformCCStoBCD(DistCCSMatrix& SRC_Matrix,
 
         // transfer DM
         // set up DM sender buffer
+#ifdef _OPENMP
+#pragma omp parallel for schedule(static) if (sender_size >= 4096)
+#endif
         for (int i = 0; i < sender_size; ++i)
         {
             sender_buffer[i] = DMnzvalLocal[sender_index[i]];
@@ -689,6 +1108,9 @@ int DistMatrixTransformer::transformCCStoBCD(DistCCSMatrix& SRC_Matrix,
         if (DST_Matrix.get_layout() == 'R' || DST_Matrix.get_layout() == 'r')
         {
             int DST_Matrix_elem = DST_Matrix.get_nrow() * DST_Matrix.get_ncol();
+#ifdef _OPENMP
+#pragma omp parallel for schedule(static) if (receiver_size >= 4096)
+#endif
             for (int i = 0; i < receiver_size; ++i)
             {
                 int ix = receiver_index[2 * i];
@@ -700,6 +1122,9 @@ int DistMatrixTransformer::transformCCStoBCD(DistCCSMatrix& SRC_Matrix,
         else
         {
             int DST_Matrix_elem = DST_Matrix.get_nrow() * DST_Matrix.get_ncol();
+#ifdef _OPENMP
+#pragma omp parallel for schedule(static) if (receiver_size >= 4096)
+#endif
             for (int i = 0; i < receiver_size; ++i)
             {
                 int ix = receiver_index[2 * i];
@@ -710,6 +1135,9 @@ int DistMatrixTransformer::transformCCStoBCD(DistCCSMatrix& SRC_Matrix,
         }
 
         // setup up sender buffer of EDM
+#ifdef _OPENMP
+#pragma omp parallel for schedule(static) if (sender_size >= 4096)
+#endif
         for (int i = 0; i < sender_size; ++i)
         {
             sender_buffer[i] = EDMnzvalLocal[sender_index[i]];
@@ -730,6 +1158,9 @@ int DistMatrixTransformer::transformCCStoBCD(DistCCSMatrix& SRC_Matrix,
         if (DST_Matrix.get_layout() == 'R' || DST_Matrix.get_layout() == 'r')
         {
             int DST_Matrix_elem = DST_Matrix.get_nrow() * DST_Matrix.get_ncol();
+#ifdef _OPENMP
+#pragma omp parallel for schedule(static) if (receiver_size >= 4096)
+#endif
             for (int i = 0; i < receiver_size; ++i)
             {
                 int ix = receiver_index[2 * i];
@@ -741,6 +1172,9 @@ int DistMatrixTransformer::transformCCStoBCD(DistCCSMatrix& SRC_Matrix,
         else
         {
             int DST_Matrix_elem = DST_Matrix.get_nrow() * DST_Matrix.get_ncol();
+#ifdef _OPENMP
+#pragma omp parallel for schedule(static) if (receiver_size >= 4096)
+#endif
             for (int i = 0; i < receiver_size; ++i)
             {
                 int ix = receiver_index[2 * i];
@@ -761,5 +1195,211 @@ int DistMatrixTransformer::transformCCStoBCD(DistCCSMatrix& SRC_Matrix,
     return 0;
 }
 
+int DistMatrixTransformer::transformCCStoBCD(DistCCSMatrix& SRC_Matrix,
+                                             std::complex<double>* DMnzvalLocal,
+                                             std::complex<double>* EDMnzvalLocal,
+                                             DistBCDMatrix& DST_Matrix,
+                                             std::complex<double>* DM,
+                                             std::complex<double>* EDM)
+{
+    MPI_Group GROUP_TRANS;
+    MPI_Comm COMM_TRANS = MPI_COMM_NULL;
+    newGroupCommTrans(DST_Matrix, SRC_Matrix, GROUP_TRANS, COMM_TRANS);
+    if (COMM_TRANS != MPI_COMM_NULL)
+    {
+        const int dst_size = DST_Matrix.get_nrow() * DST_Matrix.get_ncol();
+#ifdef _OPENMP
+#pragma omp parallel for schedule(static) if (dst_size >= 4096)
+#endif
+        for (int i = 0; i < dst_size; ++i)
+        {
+            DM[i] = std::complex<double>(0.0, 0.0);
+            EDM[i] = std::complex<double>(0.0, 0.0);
+        }
+
+        int NPROC_TRANS;
+        MPI_Comm_size(COMM_TRANS, &NPROC_TRANS);
+        std::vector<int> sender_size_process(NPROC_TRANS, 0);
+        std::vector<int> sender_displacement_process(NPROC_TRANS, 0);
+        std::vector<int> receiver_size_process(NPROC_TRANS, 0);
+        std::vector<int> receiver_displacement_process(NPROC_TRANS, 0);
+
+        int nproc_bcd = 0;
+        std::vector<int> proc_map_bcd_trans;
+        int myproc_trans = 0;
+        MPI_Comm_rank(COMM_TRANS, &myproc_trans);
+        if (myproc_trans == 0)
+        {
+            MPI_Group_size(DST_Matrix.get_group(), &nproc_bcd);
+            MPI_Bcast(&nproc_bcd, 1, MPI_INT, 0, COMM_TRANS);
+            proc_map_bcd_trans.resize(nproc_bcd, 0);
+            for (int i = 0; i < nproc_bcd; ++i)
+            {
+                MPI_Group_translate_ranks(DST_Matrix.get_group(), 1, &i, GROUP_TRANS, &proc_map_bcd_trans[i]);
+            }
+            MPI_Bcast(proc_map_bcd_trans.data(), nproc_bcd, MPI_INT, 0, COMM_TRANS);
+        }
+        else
+        {
+            MPI_Bcast(&nproc_bcd, 1, MPI_INT, 0, COMM_TRANS);
+            proc_map_bcd_trans.resize(nproc_bcd, 0);
+            MPI_Bcast(proc_map_bcd_trans.data(), nproc_bcd, MPI_INT, 0, COMM_TRANS);
+        }
+
+        for (int icol = 0; icol < SRC_Matrix.get_numcol_local(); ++icol)
+        {
+            const int g_col = SRC_Matrix.globalCol(icol);
+            int recv_pcol_bcd = 0;
+            DST_Matrix.localCol(g_col, recv_pcol_bcd);
+            for (int rowidx = SRC_Matrix.get_colptr_local()[icol] - 1;
+                 rowidx < SRC_Matrix.get_colptr_local()[icol + 1] - 1;
+                 ++rowidx)
+            {
+                const int g_row = SRC_Matrix.get_rowind_local()[rowidx] - 1;
+                int recv_prow_bcd = 0;
+                DST_Matrix.localRow(g_row, recv_prow_bcd);
+                const int recv_proc_bcd = DST_Matrix.pnum(recv_prow_bcd, recv_pcol_bcd);
+                const int recv_proc = proc_map_bcd_trans[recv_proc_bcd];
+                ++sender_size_process[recv_proc];
+            }
+        }
+
+        MPI_Alltoall(sender_size_process.data(),
+                     1,
+                     MPI_INT,
+                     receiver_size_process.data(),
+                     1,
+                     MPI_INT,
+                     COMM_TRANS);
+
+        int receiver_size = receiver_size_process[0];
+        for (int i = 1; i < NPROC_TRANS; ++i)
+        {
+            sender_displacement_process[i] = sender_displacement_process[i - 1] + sender_size_process[i - 1];
+            receiver_displacement_process[i] = receiver_displacement_process[i - 1] + receiver_size_process[i - 1];
+            receiver_size += receiver_size_process[i];
+        }
+
+        const int sender_size = SRC_Matrix.get_nnzlocal();
+        std::vector<int> sender_index(std::max(sender_size, 1), -1);
+        std::vector<int> dst_index(std::max(2 * sender_size, 2), -1);
+        std::vector<int> p(sender_displacement_process);
+        int idx = 0;
+        for (int icol = 0; icol < SRC_Matrix.get_numcol_local(); ++icol)
+        {
+            const int g_col = SRC_Matrix.globalCol(icol);
+            int recv_pcol_bcd = 0;
+            const int recv_col = DST_Matrix.localCol(g_col, recv_pcol_bcd);
+            for (int rowidx = SRC_Matrix.get_colptr_local()[icol] - 1;
+                 rowidx < SRC_Matrix.get_colptr_local()[icol + 1] - 1;
+                 ++rowidx)
+            {
+                const int g_row = SRC_Matrix.get_rowind_local()[rowidx] - 1;
+                int recv_prow_bcd = 0;
+                const int recv_row = DST_Matrix.localRow(g_row, recv_prow_bcd);
+                const int recv_proc_bcd = DST_Matrix.pnum(recv_prow_bcd, recv_pcol_bcd);
+                const int recv_proc = proc_map_bcd_trans[recv_proc_bcd];
+                sender_index[p[recv_proc]] = idx;
+                dst_index[2 * p[recv_proc]] = recv_row;
+                dst_index[2 * p[recv_proc] + 1] = recv_col;
+                ++p[recv_proc];
+                ++idx;
+            }
+        }
+
+        std::vector<int> sender_index_size_process(sender_size_process);
+        std::vector<int> sender_index_displacement_process(sender_displacement_process);
+        std::vector<int> receiver_index_size_process(receiver_size_process);
+        std::vector<int> receiver_index_displacement_process(receiver_displacement_process);
+        for (int i = 0; i < NPROC_TRANS; ++i)
+        {
+            sender_index_size_process[i] *= 2;
+            sender_index_displacement_process[i] *= 2;
+            receiver_index_size_process[i] *= 2;
+            receiver_index_displacement_process[i] *= 2;
+        }
+
+        std::vector<int> receiver_index(std::max(2 * receiver_size, 2), -1);
+        MPI_Alltoallv(dst_index.data(),
+                      sender_index_size_process.data(),
+                      sender_index_displacement_process.data(),
+                      MPI_INT,
+                      receiver_index.data(),
+                      receiver_index_size_process.data(),
+                      receiver_index_displacement_process.data(),
+                      MPI_INT,
+                      COMM_TRANS);
+
+        std::vector<double> sender_buffer(std::max(4 * sender_size, 1), 0.0);
+#ifdef _OPENMP
+#pragma omp parallel for schedule(static) if (sender_size >= 4096)
+#endif
+        for (int i = 0; i < sender_size; ++i)
+        {
+            const int src_index = sender_index[i];
+            const std::complex<double> dm_value = DMnzvalLocal[src_index];
+            const std::complex<double> edm_value = EDMnzvalLocal[src_index];
+            sender_buffer[4 * i] = dm_value.real();
+            sender_buffer[4 * i + 1] = dm_value.imag();
+            sender_buffer[4 * i + 2] = edm_value.real();
+            sender_buffer[4 * i + 3] = edm_value.imag();
+        }
+
+        std::vector<int> sender_value_size_process(sender_size_process);
+        std::vector<int> sender_value_displacement_process(sender_displacement_process);
+        std::vector<int> receiver_value_size_process(receiver_size_process);
+        std::vector<int> receiver_value_displacement_process(receiver_displacement_process);
+        for (int i = 0; i < NPROC_TRANS; ++i)
+        {
+            sender_value_size_process[i] *= 4;
+            sender_value_displacement_process[i] *= 4;
+            receiver_value_size_process[i] *= 4;
+            receiver_value_displacement_process[i] *= 4;
+        }
+
+        std::vector<double> receiver_buffer(std::max(4 * receiver_size, 1), 0.0);
+        MPI_Alltoallv(sender_buffer.data(),
+                      sender_value_size_process.data(),
+                      sender_value_displacement_process.data(),
+                      MPI_DOUBLE,
+                      receiver_buffer.data(),
+                      receiver_value_size_process.data(),
+                      receiver_value_displacement_process.data(),
+                      MPI_DOUBLE,
+                      COMM_TRANS);
+
+        if (DST_Matrix.get_layout() == 'R' || DST_Matrix.get_layout() == 'r')
+        {
+#ifdef _OPENMP
+#pragma omp parallel for schedule(static) if (receiver_size >= 4096)
+#endif
+            for (int i = 0; i < receiver_size; ++i)
+            {
+                const int ix = receiver_index[2 * i];
+                const int iy = receiver_index[2 * i + 1];
+                const int dst_index = ix * DST_Matrix.get_ncol() + iy;
+                DM[dst_index] = std::complex<double>(receiver_buffer[4 * i], -receiver_buffer[4 * i + 1]);
+                EDM[dst_index] = std::complex<double>(receiver_buffer[4 * i + 2], -receiver_buffer[4 * i + 3]);
+            }
+        }
+        else
+        {
+#ifdef _OPENMP
+#pragma omp parallel for schedule(static) if (receiver_size >= 4096)
+#endif
+            for (int i = 0; i < receiver_size; ++i)
+            {
+                const int ix = receiver_index[2 * i];
+                const int iy = receiver_index[2 * i + 1];
+                const int dst_index = iy * DST_Matrix.get_nrow() + ix;
+                DM[dst_index] = std::complex<double>(receiver_buffer[4 * i], -receiver_buffer[4 * i + 1]);
+                EDM[dst_index] = std::complex<double>(receiver_buffer[4 * i + 2], -receiver_buffer[4 * i + 3]);
+            }
+        }
+    }
+    deleteGroupCommTrans(GROUP_TRANS, COMM_TRANS);
+    return 0;
+}
+
 } // namespace pexsi
-#endif
\ No newline at end of file
+#endif
diff --git a/source/source_hsolver/module_pexsi/dist_matrix_transformer.h b/source/source_hsolver/module_pexsi/dist_matrix_transformer.h
index 672b22f4f32..916c8f332bb 100644
--- a/source/source_hsolver/module_pexsi/dist_matrix_transformer.h
+++ b/source/source_hsolver/module_pexsi/dist_matrix_transformer.h
@@ -2,6 +2,7 @@
 #define DISTMATRIXTRANSFORMER_H
 
 #include <mpi.h>
+#include <complex>
 #include <map>
 #include <vector>
 // transform a sparse matrix from block cyclic distribution (BCD) to Compressed Column Storage (CCS) distribution
@@ -49,6 +50,16 @@ int getNonZeroIndex(char layout,
                     std::vector<int>& rowidx,
                     std::vector<int>& colidx);
 
+int getNonZeroIndex(char layout,
+                    const int nrow,
+                    const int ncol,
+                    std::complex<double>* H_2d,
+                    std::complex<double>* S_2d,
+                    const double ZERO_Limit,
+                    int& nnz,
+                    std::vector<int>& rowidx,
+                    std::vector<int>& colidx);
+
 int buildTransformParameter(DistBCDMatrix& SRC_Matrix,
                             DistCCSMatrix& DST_Matrix,
                             const int NPROC_TRANS,
@@ -80,12 +91,27 @@ int transformBCDtoCCS(DistBCDMatrix& SRC_Matrix,
                       double*& H_ccs,
                       double*& S_ccs);
 
+int transformBCDtoCCS(DistBCDMatrix& SRC_Matrix,
+                      std::complex<double>* H_2d,
+                      std::complex<double>* S_2d,
+                      const double ZERO_Limit,
+                      DistCCSMatrix& DST_Matrix,
+                      std::complex<double>*& H_ccs,
+                      std::complex<double>*& S_ccs);
+
 int transformCCStoBCD(DistCCSMatrix& SRC_Matrix,
                       double* DMnzvalLocal,
                       double* ENDnzvalLocal,
                       DistBCDMatrix& DST_Matrix,
                       double* DM_2d,
                       double* ED_2d);
+
+int transformCCStoBCD(DistCCSMatrix& SRC_Matrix,
+                      std::complex<double>* DMnzvalLocal,
+                      std::complex<double>* EDMnzvalLocal,
+                      DistBCDMatrix& DST_Matrix,
+                      std::complex<double>* DM_2d,
+                      std::complex<double>* ED_2d);
 }; // namespace DistMatrixTransformer
 } // namespace pexsi
-#endif // DISTMATRIXTRANSFORMER_H
\ No newline at end of file
+#endif // DISTMATRIXTRANSFORMER_H
diff --git a/source/source_hsolver/module_pexsi/pexsi_solver.cpp b/source/source_hsolver/module_pexsi/pexsi_solver.cpp
index 1019090c483..4b95bc7564a 100644
--- a/source/source_hsolver/module_pexsi/pexsi_solver.cpp
+++ b/source/source_hsolver/module_pexsi/pexsi_solver.cpp
@@ -98,25 +98,25 @@ int PEXSI_Solver::solve(double mu0)
     return 0;
 }
 
-const double PEXSI_Solver::get_totalFreeEnergy() const
+double PEXSI_Solver::get_totalFreeEnergy() const
 {
     return totalFreeEnergy;
 }
 
-const double PEXSI_Solver::get_totalEnergyH() const
+double PEXSI_Solver::get_totalEnergyH() const
 {
     return totalEnergyH;
 }
 
-const double PEXSI_Solver::get_totalEnergyS() const
+double PEXSI_Solver::get_totalEnergyS() const
 {
     return totalEnergyS;
 }
 
-const double PEXSI_Solver::get_mu() const
+double PEXSI_Solver::get_mu() const
 {
     return mu;
 }
 
 } // namespace pexsi
-#endif
\ No newline at end of file
+#endif
diff --git a/source/source_hsolver/module_pexsi/pexsi_solver.h b/source/source_hsolver/module_pexsi/pexsi_solver.h
index 922f1b9fb3d..f47ffe99202 100644
--- a/source/source_hsolver/module_pexsi/pexsi_solver.h
+++ b/source/source_hsolver/module_pexsi/pexsi_solver.h
@@ -1,11 +1,13 @@
 #ifndef PEXSI_Solver_H
 #define PEXSI_Solver_H
 
+#include "pexsi_solver_interface.h"
+
 #include <vector>
 
 namespace pexsi
 {
-class PEXSI_Solver
+class PEXSI_Solver : public IPexsiSolver
 {
   public:
     void prepare(const int blacs_text,
@@ -15,12 +17,12 @@ class PEXSI_Solver
                  const double* h,
                  const double* s,
                  double*& DM,
-                 double*& EDM);
-    int solve(double mu0);
-    const double get_totalFreeEnergy() const;
-    const double get_totalEnergyH() const;
-    const double get_totalEnergyS() const;
-    const double get_mu() const;
+                 double*& EDM) override;
+    int solve(double mu0) override;
+    double get_totalFreeEnergy() const override;
+    double get_totalEnergyH() const override;
+    double get_totalEnergyS() const override;
+    double get_mu() const override;
 
     //==========================================================
     // PEXSI related variables
@@ -140,4 +142,4 @@ class PEXSI_Solver
     double mu;
 };
 } // namespace pexsi
-#endif // PEXSI_Solver_H
\ No newline at end of file
+#endif // PEXSI_Solver_H
diff --git a/source/source_hsolver/module_pexsi/pexsi_solver_interface.h b/source/source_hsolver/module_pexsi/pexsi_solver_interface.h
new file mode 100644
index 00000000000..405ea0adb0f
--- /dev/null
+++ b/source/source_hsolver/module_pexsi/pexsi_solver_interface.h
@@ -0,0 +1,28 @@
+#ifndef PEXSI_SOLVER_INTERFACE_H
+#define PEXSI_SOLVER_INTERFACE_H
+
+namespace pexsi
+{
+class IPexsiSolver
+{
+  public:
+    virtual ~IPexsiSolver() = default;
+
+    virtual void prepare(const int blacs_text,
+                         const int nb,
+                         const int nrow,
+                         const int ncol,
+                         const double* h,
+                         const double* s,
+                         double*& DM,
+                         double*& EDM)
+        = 0;
+    virtual int solve(double mu0) = 0;
+    virtual double get_totalFreeEnergy() const = 0;
+    virtual double get_totalEnergyH() const = 0;
+    virtual double get_totalEnergyS() const = 0;
+    virtual double get_mu() const = 0;
+};
+} // namespace pexsi
+
+#endif // PEXSI_SOLVER_INTERFACE_H
diff --git a/source/source_hsolver/module_pexsi/simple_pexsi.cpp b/source/source_hsolver/module_pexsi/simple_pexsi.cpp
index 6c52066e373..2107426d113 100644
--- a/source/source_hsolver/module_pexsi/simple_pexsi.cpp
+++ b/source/source_hsolver/module_pexsi/simple_pexsi.cpp
@@ -1,17 +1,23 @@
 // use PEXSI to solve a Kohn-Sham equation
 // the H and S matrices are given by 2D block cyclic distribution
+#include "simple_pexsi.h"
 #include "source_io/module_parameter/parameter.h"
 // the Density Matrix and Energy Density Matrix calculated by PEXSI are transformed to 2D block cyclic distribution
 // #include "mpi.h"
 #ifdef __PEXSI
 #include <mpi.h>
 
+#include <algorithm>
 #include <cfloat>
 #include <cmath>
+#include <complex>
+#include <cstdint>
+#include <cstdlib>
 #include <cstring>
 #include <fstream>
 #include <iostream>
 #include <memory>
+#include <string>
 
 #include "c_pexsi_interface.h"
 #include "dist_bcd_matrix.h"
@@ -21,10 +27,272 @@
 #include "source_base/timer.h"
 #include "source_base/tool_quit.h"
 #include "source_base/global_variable.h"
-#include "source_hsolver/diago_pexsi.h"
+#include "pexsi_solver.h"
 
 namespace pexsi
 {
+namespace
+{
+bool trace_pexsi_enabled(const char* category)
+{
+    const char* trace_env = std::getenv("ABACUS_PEXSI_TRACE");
+    if (trace_env == nullptr)
+    {
+        return false;
+    }
+    const std::string trace_value(trace_env);
+    return trace_value == "1" || trace_value == "all" || trace_value.find(category) != std::string::npos;
+}
+
+void trace_pexsi_stage(MPI_Comm comm, const char* path, const char* stage)
+{
+    if (!trace_pexsi_enabled("stage") || comm == MPI_COMM_NULL)
+    {
+        return;
+    }
+    int comm_rank = 0;
+    int world_rank = 0;
+    MPI_Comm_rank(comm, &comm_rank);
+    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
+    if (comm_rank == 0)
+    {
+        std::cout << "PEXSI_TRACE world_rank=" << world_rank << " path=" << path << " stage=" << stage
+                  << " time=" << MPI_Wtime() << std::endl;
+    }
+}
+
+std::uint64_t fnv1a_append_int(std::uint64_t hash, const int value)
+{
+    std::uint32_t data = static_cast<std::uint32_t>(value);
+    for (int byte = 0; byte < 4; ++byte)
+    {
+        hash ^= static_cast<unsigned char>((data >> (byte * 8)) & 0xffU);
+        hash *= 1099511628211ULL;
+    }
+    return hash;
+}
+
+std::uint64_t local_pattern_hash(const int numColLocal,
+                                 const int nnzLocal,
+                                 const int* colptrLocal,
+                                 const int* rowindLocal)
+{
+    std::uint64_t hash = 1469598103934665603ULL;
+    hash = fnv1a_append_int(hash, numColLocal);
+    hash = fnv1a_append_int(hash, nnzLocal);
+    for (int i = 0; i < numColLocal + 1; ++i)
+    {
+        hash = fnv1a_append_int(hash, colptrLocal[i]);
+    }
+    for (int i = 0; i < nnzLocal; ++i)
+    {
+        hash = fnv1a_append_int(hash, rowindLocal[i]);
+    }
+    return hash;
+}
+
+void print_effective_pexsi_options_once(const PPEXSIOptions& options,
+                                        const int numProcessPerPole,
+                                        const double zeroLimit)
+{
+    static bool printed = false;
+    if (printed || GlobalV::MY_RANK != 0)
+    {
+        return;
+    }
+    printed = true;
+    GlobalV::ofs_running << "\n PEXSI effective options:"
+                         << " numPole=" << options.numPole
+                         << " temperature=" << options.temperature
+                         << " numProcessPerPole=" << numProcessPerPole
+                         << " muGuard=" << options.muPEXSISafeGuard
+                         << " electronTolerance=" << options.numElectronPEXSITolerance
+                         << " zeroLimit=" << zeroLimit << std::endl;
+}
+} // namespace
+
+struct PexsiComplexReuseContextImpl
+{
+    struct Signature
+    {
+        int size = 0;
+        int numProcessPerPole = 0;
+        int pexsi_prow = 0;
+        int pexsi_pcol = 0;
+        int nnz = 0;
+        int nnzLocal = 0;
+        int numColLocal = 0;
+        int matrixType = 0;
+        int symmetric = 0;
+        int symmetricStorage = 0;
+        int ordering = 0;
+        int rowOrdering = 0;
+        int solver = 0;
+        int npSymbFact = 0;
+        int transpose = 0;
+        std::uint64_t localPatternHash = 0;
+        std::uint64_t globalPatternHash = 0;
+    };
+
+    PPEXSIPlan plan = 0;
+    MPI_Comm comm = MPI_COMM_NULL;
+    Signature signature;
+    bool has_plan = false;
+
+    ~PexsiComplexReuseContextImpl()
+    {
+        clear();
+    }
+
+    void clear()
+    {
+        int initialized = 0;
+        int finalized = 0;
+        MPI_Initialized(&initialized);
+        if (initialized)
+        {
+            MPI_Finalized(&finalized);
+        }
+        if (initialized && !finalized)
+        {
+            if (has_plan)
+            {
+                int info = 0;
+                trace_pexsi_stage(comm, "complex", "PPEXSIPlanFinalize.cached.begin");
+                PPEXSIPlanFinalize(plan, &info);
+                trace_pexsi_stage(comm, "complex", "PPEXSIPlanFinalize.cached.end");
+            }
+            if (comm != MPI_COMM_NULL)
+            {
+                MPI_Comm_free(&comm);
+            }
+        }
+        plan = 0;
+        comm = MPI_COMM_NULL;
+        signature = Signature();
+        has_plan = false;
+    }
+
+    static Signature make_signature(const PPEXSIOptions& options,
+                                    const int size,
+                                    const int numProcessPerPole,
+                                    const int pexsi_prow,
+                                    const int pexsi_pcol,
+                                    const DistCCSMatrix& matrix,
+                                    const std::uint64_t localPatternHash,
+                                    const std::uint64_t globalPatternHash)
+    {
+        Signature sig;
+        sig.size = size;
+        sig.numProcessPerPole = numProcessPerPole;
+        sig.pexsi_prow = pexsi_prow;
+        sig.pexsi_pcol = pexsi_pcol;
+        sig.nnz = matrix.get_nnz();
+        sig.nnzLocal = matrix.get_nnzlocal();
+        sig.numColLocal = matrix.get_numcol_local();
+        sig.matrixType = options.matrixType;
+        sig.symmetric = options.symmetric;
+        sig.symmetricStorage = options.symmetricStorage;
+        sig.ordering = options.ordering;
+        sig.rowOrdering = options.rowOrdering;
+        sig.solver = options.solver;
+        sig.npSymbFact = options.npSymbFact;
+        sig.transpose = options.transpose;
+        sig.localPatternHash = localPatternHash;
+        sig.globalPatternHash = globalPatternHash;
+        return sig;
+    }
+
+    static bool same_signature(const Signature& lhs, const Signature& rhs)
+    {
+        return lhs.size == rhs.size && lhs.numProcessPerPole == rhs.numProcessPerPole
+               && lhs.pexsi_prow == rhs.pexsi_prow && lhs.pexsi_pcol == rhs.pexsi_pcol
+               && lhs.nnz == rhs.nnz && lhs.nnzLocal == rhs.nnzLocal
+               && lhs.numColLocal == rhs.numColLocal && lhs.matrixType == rhs.matrixType
+               && lhs.symmetric == rhs.symmetric && lhs.symmetricStorage == rhs.symmetricStorage
+               && lhs.ordering == rhs.ordering && lhs.rowOrdering == rhs.rowOrdering
+               && lhs.solver == rhs.solver && lhs.npSymbFact == rhs.npSymbFact
+               && lhs.transpose == rhs.transpose && lhs.localPatternHash == rhs.localPatternHash
+               && lhs.globalPatternHash == rhs.globalPatternHash;
+    }
+
+    bool comm_matches(MPI_Comm new_comm) const
+    {
+        if (!has_plan || comm == MPI_COMM_NULL || new_comm == MPI_COMM_NULL)
+        {
+            return false;
+        }
+        int result = MPI_UNEQUAL;
+        MPI_Comm_compare(comm, new_comm, &result);
+        if (result == MPI_IDENT || result == MPI_CONGRUENT)
+        {
+            return true;
+        }
+        if (result == MPI_SIMILAR)
+        {
+            int old_rank = -1;
+            int new_rank = -2;
+            MPI_Comm_rank(comm, &old_rank);
+            MPI_Comm_rank(new_comm, &new_rank);
+            return old_rank == new_rank;
+        }
+        return false;
+    }
+
+    PPEXSIPlan get_or_create_plan(MPI_Comm comm_in,
+                                  const int pexsi_prow,
+                                  const int pexsi_pcol,
+                                  const int outputFileIndex,
+                                  const Signature& new_signature,
+                                  bool& reused)
+    {
+        reused = false;
+        if (comm_in == MPI_COMM_NULL)
+        {
+            return 0;
+        }
+
+        const bool local_match = has_plan && comm_matches(comm_in) && same_signature(signature, new_signature);
+        int local_reuse = local_match ? 1 : 0;
+        int global_reuse = 0;
+        MPI_Allreduce(&local_reuse, &global_reuse, 1, MPI_INT, MPI_MIN, comm_in);
+        reused = global_reuse == 1;
+        if (reused)
+        {
+            return plan;
+        }
+
+        clear();
+        MPI_Comm_dup(comm_in, &comm);
+        int info = 0;
+        trace_pexsi_stage(comm, "complex", "PPEXSIPlanInitialize.cached.begin");
+        plan = PPEXSIPlanInitialize(comm, pexsi_prow, pexsi_pcol, outputFileIndex, &info);
+        trace_pexsi_stage(comm, "complex", "PPEXSIPlanInitialize.cached.end");
+        signature = new_signature;
+        has_plan = true;
+        return plan;
+    }
+};
+
+PexsiComplexReuseContext::PexsiComplexReuseContext()
+    : impl_(new PexsiComplexReuseContextImpl())
+{
+}
+
+PexsiComplexReuseContext::~PexsiComplexReuseContext() = default;
+
+PexsiComplexReuseContext::PexsiComplexReuseContext(PexsiComplexReuseContext&&) noexcept = default;
+
+PexsiComplexReuseContext& PexsiComplexReuseContext::operator=(PexsiComplexReuseContext&&) noexcept = default;
+
+void PexsiComplexReuseContext::clear()
+{
+    if (impl_)
+    {
+        impl_->clear();
+    }
+}
+
 inline void strtolower(char* sa, char* sb)
 {
     char c = '\0';
@@ -115,7 +383,7 @@ int loadPEXSIOption(MPI_Comm comm,
     int_para[15] = 0;
     int_para[16] = pexsi::PEXSI_Solver::pexsi_nproc_pole;
 
-    double_para[0] = 2;//PARAM.inp.nspin; // pexsi::PEXSI_Solver::pexsi_spin;
+    double_para[0] = 2.0;
     double_para[1] = pexsi::PEXSI_Solver::pexsi_temp;
     double_para[2] = pexsi::PEXSI_Solver::pexsi_gap;
     double_para[3] = pexsi::PEXSI_Solver::pexsi_delta_e;
@@ -128,6 +396,21 @@ int loadPEXSIOption(MPI_Comm comm,
     double_para[10] = pexsi::PEXSI_Solver::pexsi_elec_thr;
     double_para[11] = pexsi::PEXSI_Solver::pexsi_zero_thr;
 
+    const bool default_pexsi_temperature = std::abs(double_para[1] - 0.015) < 1.0e-14;
+    const bool fermi_dirac_smearing = PARAM.inp.smearing_method == "fd"
+                                      || PARAM.inp.smearing_method == "fermi-dirac";
+    if (default_pexsi_temperature && fermi_dirac_smearing)
+    {
+        double_para[1] = PARAM.inp.smearing_sigma;
+    }
+
+    int comm_size = 1;
+    if (comm != MPI_COMM_NULL)
+    {
+        MPI_Comm_size(comm, &comm_size);
+    }
+    int_para[16] = std::max(1, std::min(int_para[16], comm_size));
+
     options.numPole = int_para[0];
     options.isInertiaCount = int_para[1];
     options.maxPEXSIIter = int_para[2];
@@ -232,6 +515,8 @@ int simplePEXSI(MPI_Comm comm_PEXSI,
     double ZERO_Limit = 0.0;
     loadPEXSIOption(comm_PEXSI, PexsiOptionFile, options, numProcessPerPole, ZERO_Limit);
     options.mu0 = mu0;
+    print_effective_pexsi_options_once(options, numProcessPerPole, ZERO_Limit);
+    const bool detail_timing = std::getenv("ABACUS_PEXSI_REAL_DETAIL_TIMING") != nullptr;
 
     ModuleBase::timer::start("Diago_LCAO_Matrix", "setup_PEXSI_plan");
     PPEXSIPlan plan;
@@ -271,7 +556,15 @@ int simplePEXSI(MPI_Comm comm_PEXSI,
     double* FDMnzvalLocal = nullptr;
     // transform H and S from 2D block cyclic distribution to compressed column sparse matrix
     // LiuXh modify 2021-03-30, add DONE(ofs_running,"xx") for test
+    if (detail_timing)
+    {
+        ModuleBase::timer::start("Diago_LCAO_Matrix", "TransMAT2CCS");
+    }
     DistMatrixTransformer::transformBCDtoCCS(SRC_Matrix, H, S, ZERO_Limit, DST_Matrix, HnzvalLocal, SnzvalLocal);
+    if (detail_timing)
+    {
+        ModuleBase::timer::end("Diago_LCAO_Matrix", "TransMAT2CCS");
+    }
     // MPI_Barrier(MPI_COMM_WORLD);
     // LiuXh modify 2021-03-30, add DONE(ofs_running,"xx") for test
     if (comm_PEXSI != MPI_COMM_NULL)
@@ -279,6 +572,10 @@ int simplePEXSI(MPI_Comm comm_PEXSI,
 
         // Load H and S to PEXSI
         int isSIdentity = 0;
+        if (detail_timing)
+        {
+            ModuleBase::timer::start("Diago_LCAO_Matrix", "PEXSI_LoadHS_R");
+        }
         PPEXSILoadRealHSMatrix(plan,
                                options,
                                size,
@@ -291,6 +588,10 @@ int simplePEXSI(MPI_Comm comm_PEXSI,
                                isSIdentity,
                                SnzvalLocal,
                                &info);
+        if (detail_timing)
+        {
+            ModuleBase::timer::end("Diago_LCAO_Matrix", "PEXSI_LoadHS_R");
+        }
 
         double nelec = 0.0;
         double muMinInertia = 0.0;
@@ -324,6 +625,10 @@ int simplePEXSI(MPI_Comm comm_PEXSI,
         FDMnzvalLocal = new double[DST_Matrix.get_nnzlocal()];
         if (myid < numProcessPerPole)
         {
+            if (detail_timing)
+            {
+                ModuleBase::timer::start("Diago_LCAO_Matrix", "PEXSI_Retrieve_R");
+            }
             PPEXSIRetrieveRealDFTMatrix(plan,
                                         DMnzvalLocal,
                                         EDMnzvalLocal,
@@ -332,9 +637,21 @@ int simplePEXSI(MPI_Comm comm_PEXSI,
                                         &totalEnergyS,
                                         &totalFreeEnergy,
                                         &info);
+            if (detail_timing)
+            {
+                ModuleBase::timer::end("Diago_LCAO_Matrix", "PEXSI_Retrieve_R");
+            }
         }
         // clean PEXSI
+        if (detail_timing)
+        {
+            ModuleBase::timer::start("Diago_LCAO_Matrix", "PEXSI_Finalize");
+        }
         PPEXSIPlanFinalize(plan, &info);
+        if (detail_timing)
+        {
+            ModuleBase::timer::end("Diago_LCAO_Matrix", "PEXSI_Finalize");
+        }
     }
 
     // transform Density Matrix and Energy Density Matrix from compressed column sparse matrix
@@ -347,20 +664,271 @@ int simplePEXSI(MPI_Comm comm_PEXSI,
         // EDM = new double[SRC_Matrix.get_nrow() * SRC_Matrix.get_ncol()];
     }
     // LiuXh modify 2021-04-29, add DONE(ofs_running,"xx") for test
+    const MPI_Comm barrier_comm = comm_2D != MPI_COMM_NULL ? comm_2D : comm_PEXSI;
     ModuleBase::timer::start("Diago_LCAO_Matrix", "TransMAT22D");
     DistMatrixTransformer::transformCCStoBCD(DST_Matrix, DMnzvalLocal, EDMnzvalLocal, SRC_Matrix, DM, EDM);
     ModuleBase::timer::end("Diago_LCAO_Matrix", "TransMAT22D");
     // LiuXh modify 2021-04-29, add DONE(ofs_running,"xx") for test
 
-    MPI_Barrier(MPI_COMM_WORLD);
-    MPI_Barrier(MPI_COMM_WORLD);
+    if (barrier_comm != MPI_COMM_NULL)
+    {
+        MPI_Barrier(barrier_comm);
+        MPI_Barrier(barrier_comm);
+    }
+    delete[] DMnzvalLocal;
+    delete[] EDMnzvalLocal;
+    delete[] FDMnzvalLocal;
+    delete[] HnzvalLocal;
+    delete[] SnzvalLocal;
+    if (barrier_comm != MPI_COMM_NULL)
+    {
+        MPI_Barrier(barrier_comm);
+    }
+    return 0;
+}
+
+int simplePEXSIComplex(MPI_Comm comm_PEXSI,
+                       MPI_Comm comm_2D,
+                       MPI_Group group_2D,
+                       const int blacs_ctxt,
+                       const int size,
+                       const int nblk,
+                       const int nrow,
+                       const int ncol,
+                       char layout,
+                       std::complex<double>* H,
+                       std::complex<double>* S,
+                       const double numElectronExact,
+                       const std::string PexsiOptionFile,
+                       std::complex<double>*& DM,
+                       std::complex<double>*& EDM,
+                       double& totalEnergyH,
+                       double& totalEnergyS,
+                       double& totalFreeEnergy,
+                       double& mu,
+                       double mu0,
+                       double* numElectronPEXSI,
+                       double* numElectronDrvMuPEXSI,
+                       PexsiComplexReuseContext* reuse_context)
+{
+    if (comm_2D == MPI_COMM_NULL && comm_PEXSI == MPI_COMM_NULL)
+    {
+        return 0;
+    }
+
+    int myid = 0;
+    if (comm_PEXSI != MPI_COMM_NULL)
+    {
+        MPI_Comm_rank(comm_PEXSI, &myid);
+    }
+
+    PPEXSIOptions options;
+    PPEXSISetDefaultOptions(&options);
+    int numProcessPerPole = 0;
+    double ZERO_Limit = 0.0;
+    loadPEXSIOption(comm_PEXSI, PexsiOptionFile, options, numProcessPerPole, ZERO_Limit);
+    options.symmetric = 1;
+    options.symmetricStorage = 0;
+    options.rowOrdering = 0;
+    options.spin = PARAM.inp.nspin == 1 ? 2.0 : 1.0;
+    options.mu0 = mu0;
+    print_effective_pexsi_options_once(options, numProcessPerPole, ZERO_Limit);
+
+    ModuleBase::timer::start("Diago_LCAO_Matrix", "setup_PEXSI_plan");
+    PPEXSIPlan plan = 0;
+    int info = 0;
+    int outputFileIndex = -1;
+    int pexsi_prow = 0;
+    int pexsi_pcol = 0;
+    ModuleBase::timer::start("Diago_LCAO_Matrix", "splitNProc2NProwNPcol");
+    splitNProc2NProwNPcol(numProcessPerPole, pexsi_prow, pexsi_pcol);
+    ModuleBase::timer::end("Diago_LCAO_Matrix", "splitNProc2NProwNPcol");
+
+    ModuleBase::timer::end("Diago_LCAO_Matrix", "setup_PEXSI_plan");
+
+    DistCCSMatrix DST_Matrix(comm_PEXSI, numProcessPerPole, size);
+    DistBCDMatrix SRC_Matrix(comm_2D, group_2D, blacs_ctxt, size, nblk, nrow, ncol, layout);
+
+    std::complex<double>* HnzvalLocal = nullptr;
+    std::complex<double>* SnzvalLocal = nullptr;
+    std::complex<double>* DMnzvalLocal = nullptr;
+    std::complex<double>* EDMnzvalLocal = nullptr;
+    std::complex<double>* FDMnzvalLocal = nullptr;
+
+    ModuleBase::timer::start("Diago_LCAO_Matrix", "TransMAT2CCS");
+    trace_pexsi_stage(comm_PEXSI, "complex", "TransMAT2CCS.begin");
+    DistMatrixTransformer::transformBCDtoCCS(SRC_Matrix, H, S, ZERO_Limit, DST_Matrix, HnzvalLocal, SnzvalLocal);
+    trace_pexsi_stage(comm_PEXSI, "complex", "TransMAT2CCS.end");
+    ModuleBase::timer::end("Diago_LCAO_Matrix", "TransMAT2CCS");
+
+    if (comm_PEXSI != MPI_COMM_NULL)
+    {
+        bool reuse_plan = false;
+        ModuleBase::timer::start("Diago_LCAO_Matrix", "PEXSIPlanInit");
+        if (reuse_context != nullptr)
+        {
+            const std::uint64_t local_pattern_hash_value = local_pattern_hash(DST_Matrix.get_numcol_local(),
+                                                                               DST_Matrix.get_nnzlocal(),
+                                                                               DST_Matrix.get_colptr_local(),
+                                                                               DST_Matrix.get_rowind_local());
+            std::uint64_t global_pattern_hash_value = local_pattern_hash_value;
+            MPI_Allreduce(&local_pattern_hash_value,
+                          &global_pattern_hash_value,
+                          1,
+                          MPI_UINT64_T,
+                          MPI_BXOR,
+                          comm_PEXSI);
+            const auto signature = PexsiComplexReuseContextImpl::make_signature(options,
+                                                                                size,
+                                                                                numProcessPerPole,
+                                                                                pexsi_prow,
+                                                                                pexsi_pcol,
+                                                                                DST_Matrix,
+                                                                                local_pattern_hash_value,
+                                                                                global_pattern_hash_value);
+            plan = reuse_context->impl_->get_or_create_plan(comm_PEXSI,
+                                                            pexsi_prow,
+                                                            pexsi_pcol,
+                                                            outputFileIndex,
+                                                            signature,
+                                                            reuse_plan);
+            if (reuse_plan)
+            {
+                trace_pexsi_stage(comm_PEXSI, "complex", "PPEXSIPlanInitialize.reuse");
+            }
+        }
+        else
+        {
+            trace_pexsi_stage(comm_PEXSI, "complex", "PPEXSIPlanInitialize.begin");
+            plan = PPEXSIPlanInitialize(comm_PEXSI, pexsi_prow, pexsi_pcol, outputFileIndex, &info);
+            trace_pexsi_stage(comm_PEXSI, "complex", "PPEXSIPlanInitialize.end");
+        }
+        ModuleBase::timer::end("Diago_LCAO_Matrix", "PEXSIPlanInit");
+
+        int isSIdentity = 0;
+        PPEXSIOptions load_options = options;
+        if (reuse_plan)
+        {
+            load_options.isSymbolicFactorize = 0;
+            load_options.isConstructCommPattern = 0;
+        }
+        ModuleBase::timer::start("Diago_LCAO_Matrix", "PEXSI_LoadHS_C");
+        trace_pexsi_stage(comm_PEXSI, "complex", "PPEXSILoadComplexHSMatrix.begin");
+        PPEXSILoadComplexHSMatrix(plan,
+                                  load_options,
+                                  size,
+                                  DST_Matrix.get_nnz(),
+                                  DST_Matrix.get_nnzlocal(),
+                                  DST_Matrix.get_numcol_local(),
+                                  DST_Matrix.get_colptr_local(),
+                                  DST_Matrix.get_rowind_local(),
+                                  reinterpret_cast<double*>(HnzvalLocal),
+                                  isSIdentity,
+                                  reinterpret_cast<double*>(SnzvalLocal),
+                                  &info);
+        trace_pexsi_stage(comm_PEXSI, "complex", "PPEXSILoadComplexHSMatrix.end");
+        ModuleBase::timer::end("Diago_LCAO_Matrix", "PEXSI_LoadHS_C");
+        if (!reuse_plan)
+        {
+            ModuleBase::timer::start("Diago_LCAO_Matrix", "PEXSI_Symbolic_C");
+            trace_pexsi_stage(comm_PEXSI, "complex", "PPEXSISymbolicFactorizeComplexUnsymmetricMatrix.begin");
+            PPEXSISymbolicFactorizeComplexUnsymmetricMatrix(plan,
+                                                            options,
+                                                            reinterpret_cast<double*>(HnzvalLocal),
+                                                            &info);
+            trace_pexsi_stage(comm_PEXSI, "complex", "PPEXSISymbolicFactorizeComplexUnsymmetricMatrix.end");
+            ModuleBase::timer::end("Diago_LCAO_Matrix", "PEXSI_Symbolic_C");
+        }
+        else
+        {
+            trace_pexsi_stage(comm_PEXSI, "complex", "PPEXSISymbolicFactorizeComplexUnsymmetricMatrix.reuse");
+        }
+
+        double nelec = 0.0;
+        double d_nelec_d_mu = 0.0;
+        mu = mu0;
+        PPEXSIOptions fermi_options = options;
+        if (reuse_plan)
+        {
+            fermi_options.isSymbolicFactorize = 0;
+            fermi_options.isConstructCommPattern = 0;
+        }
+        ModuleBase::timer::start("Diago_LCAO_Matrix", "PEXSIDFT");
+        ModuleBase::timer::start("Diago_LCAO_Matrix", "PEXSI_FermiOp_C");
+        trace_pexsi_stage(comm_PEXSI, "complex", "PPEXSICalculateFermiOperatorComplex.begin");
+        PPEXSICalculateFermiOperatorComplex(plan,
+                                            fermi_options,
+                                            mu,
+                                            numElectronExact,
+                                            &nelec,
+                                            &d_nelec_d_mu,
+                                            &info);
+        trace_pexsi_stage(comm_PEXSI, "complex", "PPEXSICalculateFermiOperatorComplex.end");
+        ModuleBase::timer::end("Diago_LCAO_Matrix", "PEXSI_FermiOp_C");
+        if (numElectronPEXSI != nullptr)
+        {
+            *numElectronPEXSI = nelec;
+        }
+        if (numElectronDrvMuPEXSI != nullptr)
+        {
+            *numElectronDrvMuPEXSI = d_nelec_d_mu;
+        }
+        ModuleBase::timer::end("Diago_LCAO_Matrix", "PEXSIDFT");
+
+        DMnzvalLocal = new std::complex<double>[DST_Matrix.get_nnzlocal()];
+        EDMnzvalLocal = new std::complex<double>[DST_Matrix.get_nnzlocal()];
+        FDMnzvalLocal = new std::complex<double>[DST_Matrix.get_nnzlocal()];
+        if (myid < numProcessPerPole)
+        {
+            ModuleBase::timer::start("Diago_LCAO_Matrix", "PEXSI_Retrieve_C");
+            trace_pexsi_stage(comm_PEXSI, "complex", "PPEXSIRetrieveComplexDFTMatrix.begin");
+            PPEXSIRetrieveComplexDFTMatrix(plan,
+                                           reinterpret_cast<double*>(DMnzvalLocal),
+                                           reinterpret_cast<double*>(EDMnzvalLocal),
+                                           reinterpret_cast<double*>(FDMnzvalLocal),
+                                           &totalEnergyH,
+                                           &totalEnergyS,
+                                           &totalFreeEnergy,
+                                           &info);
+            trace_pexsi_stage(comm_PEXSI, "complex", "PPEXSIRetrieveComplexDFTMatrix.end");
+            ModuleBase::timer::end("Diago_LCAO_Matrix", "PEXSI_Retrieve_C");
+        }
+        if (reuse_context == nullptr)
+        {
+            ModuleBase::timer::start("Diago_LCAO_Matrix", "PEXSI_Finalize");
+            trace_pexsi_stage(comm_PEXSI, "complex", "PPEXSIPlanFinalize.begin");
+            PPEXSIPlanFinalize(plan, &info);
+            trace_pexsi_stage(comm_PEXSI, "complex", "PPEXSIPlanFinalize.end");
+            ModuleBase::timer::end("Diago_LCAO_Matrix", "PEXSI_Finalize");
+        }
+    }
+
+    const MPI_Comm barrier_comm = comm_2D != MPI_COMM_NULL ? comm_2D : comm_PEXSI;
+    ModuleBase::timer::start("Diago_LCAO_Matrix", "TransMAT22D");
+    trace_pexsi_stage(barrier_comm, "complex", "TransMAT22D.begin");
+    DistMatrixTransformer::transformCCStoBCD(DST_Matrix, DMnzvalLocal, EDMnzvalLocal, SRC_Matrix, DM, EDM);
+    trace_pexsi_stage(barrier_comm, "complex", "TransMAT22D.end");
+    ModuleBase::timer::end("Diago_LCAO_Matrix", "TransMAT22D");
+
+    if (barrier_comm != MPI_COMM_NULL)
+    {
+        ModuleBase::timer::start("Diago_LCAO_Matrix", "PEXSI_Barrier");
+        MPI_Barrier(barrier_comm);
+        MPI_Barrier(barrier_comm);
+        ModuleBase::timer::end("Diago_LCAO_Matrix", "PEXSI_Barrier");
+    }
     delete[] DMnzvalLocal;
     delete[] EDMnzvalLocal;
     delete[] FDMnzvalLocal;
     delete[] HnzvalLocal;
     delete[] SnzvalLocal;
-    MPI_Barrier(MPI_COMM_WORLD);
+    if (barrier_comm != MPI_COMM_NULL)
+    {
+        ModuleBase::timer::start("Diago_LCAO_Matrix", "PEXSI_Barrier");
+        MPI_Barrier(barrier_comm);
+        ModuleBase::timer::end("Diago_LCAO_Matrix", "PEXSI_Barrier");
+    }
     return 0;
 }
 } // namespace pexsi
-#endif
\ No newline at end of file
+#endif
diff --git a/source/source_hsolver/module_pexsi/simple_pexsi.h b/source/source_hsolver/module_pexsi/simple_pexsi.h
index db8879e5ac7..caed8a05116 100644
--- a/source/source_hsolver/module_pexsi/simple_pexsi.h
+++ b/source/source_hsolver/module_pexsi/simple_pexsi.h
@@ -2,9 +2,53 @@
 #define SIMPLE_PEXSI_H
 
 #include <mpi.h>
+#include <complex>
+#include <memory>
+#include <string>
 // a simple interface for calling pexsi with 2D block cyclic distributed matrix
 namespace pexsi
 {
+struct PexsiComplexReuseContextImpl;
+
+class PexsiComplexReuseContext
+{
+  public:
+    PexsiComplexReuseContext();
+    ~PexsiComplexReuseContext();
+    PexsiComplexReuseContext(const PexsiComplexReuseContext&) = delete;
+    PexsiComplexReuseContext& operator=(const PexsiComplexReuseContext&) = delete;
+    PexsiComplexReuseContext(PexsiComplexReuseContext&&) noexcept;
+    PexsiComplexReuseContext& operator=(PexsiComplexReuseContext&&) noexcept;
+
+    void clear();
+
+  private:
+    std::unique_ptr<PexsiComplexReuseContextImpl> impl_;
+    friend int simplePEXSIComplex(MPI_Comm comm_PEXSI,
+                                  MPI_Comm comm_2D,
+                                  MPI_Group group_2D,
+                                  const int blacs_ctxt,
+                                  const int size,
+                                  const int nblk,
+                                  const int nrow,
+                                  const int ncol,
+                                  char layout,
+                                  std::complex<double>* H,
+                                  std::complex<double>* S,
+                                  const double nElectronExact,
+                                  const std::string PexsiOptionFile,
+                                  std::complex<double>*& DM,
+                                  std::complex<double>*& EDM,
+                                  double& totalEnergyH,
+                                  double& totalEnergyS,
+                                  double& totalFreeEnergy,
+                                  double& mu,
+                                  double mu0,
+                                  double* numElectronPEXSI,
+                                  double* numElectronDrvMuPEXSI,
+                                  PexsiComplexReuseContext* reuse_context);
+};
+
 int simplePEXSI(MPI_Comm comm_PEXSI,
                 MPI_Comm comm_2D,
                 MPI_Group group_2D,
@@ -25,5 +69,29 @@ int simplePEXSI(MPI_Comm comm_PEXSI,
                 double& totalFreeEnergy,
                 double& mu,
                 double mu0);
+
+int simplePEXSIComplex(MPI_Comm comm_PEXSI,
+                       MPI_Comm comm_2D,
+                       MPI_Group group_2D,
+                       const int blacs_ctxt,
+                       const int size,
+                       const int nblk,
+                       const int nrow,
+                       const int ncol,
+                       char layout,
+                       std::complex<double>* H,
+                       std::complex<double>* S,
+                       const double nElectronExact,
+                       const std::string PexsiOptionFile,
+                       std::complex<double>*& DM,
+                       std::complex<double>*& EDM,
+                       double& totalEnergyH,
+                       double& totalEnergyS,
+                       double& totalFreeEnergy,
+                       double& mu,
+                       double mu0,
+                       double* numElectronPEXSI = nullptr,
+                       double* numElectronDrvMuPEXSI = nullptr,
+                       PexsiComplexReuseContext* reuse_context = nullptr);
 }
-#endif // SIMPLE_PEXSI_H
\ No newline at end of file
+#endif // SIMPLE_PEXSI_H
diff --git a/source/source_hsolver/test/diago_pexsi_test.cpp b/source/source_hsolver/test/diago_pexsi_test.cpp
index 0d021166e11..327da4f67d4 100644
--- a/source/source_hsolver/test/diago_pexsi_test.cpp
+++ b/source/source_hsolver/test/diago_pexsi_test.cpp
@@ -16,6 +16,8 @@
 #include <iostream>
 #include <mpi.h>
 #include <string>
+#include <memory>
+#include <utility>
 #include <vector>
 
 #define PASSTHRESHOLD 5e-4
@@ -382,6 +384,124 @@ class PexsiGammaOnlyTest : public ::testing::TestWithParam<PexsiPrepare<double>>
 {
 };
 
+class FakePexsiSolver : public pexsi::IPexsiSolver
+{
+  public:
+    void prepare(const int blacs_text_in,
+                 const int nb_in,
+                 const int nrow_in,
+                 const int ncol_in,
+                 const double* h_in,
+                 const double* s_in,
+                 double*& dm_in,
+                 double*& edm_in) override
+    {
+        prepared = true;
+        blacs_text = blacs_text_in;
+        nb = nb_in;
+        nrow = nrow_in;
+        ncol = ncol_in;
+        h = h_in;
+        s = s_in;
+        dm = dm_in;
+        edm = edm_in;
+        for (int i = 0; i < nrow * ncol; ++i)
+        {
+            dm[i] = 10.0 + i;
+            edm[i] = 20.0 + i;
+        }
+    }
+
+    int solve(double mu0_in) override
+    {
+        solved = true;
+        mu0 = mu0_in;
+        return 0;
+    }
+
+    double get_totalFreeEnergy() const override
+    {
+        return total_free_energy;
+    }
+
+    double get_totalEnergyH() const override
+    {
+        return total_energy_h;
+    }
+
+    double get_totalEnergyS() const override
+    {
+        return total_energy_s;
+    }
+
+    double get_mu() const override
+    {
+        return mu;
+    }
+
+    bool prepared = false;
+    bool solved = false;
+    int blacs_text = -1;
+    int nb = -1;
+    int nrow = -1;
+    int ncol = -1;
+    double mu0 = 0.0;
+    const double* h = nullptr;
+    const double* s = nullptr;
+    double* dm = nullptr;
+    double* edm = nullptr;
+    double total_free_energy = -1.5;
+    double total_energy_h = -2.5;
+    double total_energy_s = 3.5;
+    double mu = 0.125;
+};
+
+TEST(DiagoPexsiRefactorTest, UsesInjectedSolverAndOwnedDensityBuffers)
+{
+    PARAM.input.nspin = 1;
+    PARAM.inp.nspin = 1;
+    pexsi::PEXSI_Solver::pexsi_mu = -0.25;
+
+    Parallel_Orbitals po;
+    po.blacs_ctxt = 11;
+    po.nb = 2;
+    po.nrow = 2;
+    po.ncol = 2;
+
+    HamiltTEST<double> hmtest;
+    hmtest.nrow = po.nrow;
+    hmtest.ncol = po.ncol;
+    hmtest.h_local = {1.0, 0.0, 0.0, 2.0};
+    hmtest.s_local = {1.0, 0.0, 0.0, 1.0};
+
+    psi::Psi<double> psi;
+
+    auto fake_solver = std::make_unique<FakePexsiSolver>();
+    FakePexsiSolver* fake_solver_ptr = fake_solver.get();
+    hsolver::DiagoPexsi<double> diago(&po, std::move(fake_solver));
+
+    ASSERT_NE(diago.DM[0], nullptr);
+    ASSERT_NE(diago.EDM[0], nullptr);
+    diago.diag(&hmtest, psi, nullptr);
+
+    EXPECT_TRUE(fake_solver_ptr->prepared);
+    EXPECT_TRUE(fake_solver_ptr->solved);
+    EXPECT_EQ(fake_solver_ptr->blacs_text, po.blacs_ctxt);
+    EXPECT_EQ(fake_solver_ptr->nb, po.nb);
+    EXPECT_EQ(fake_solver_ptr->nrow, po.nrow);
+    EXPECT_EQ(fake_solver_ptr->ncol, po.ncol);
+    EXPECT_EQ(fake_solver_ptr->h, hmtest.h_local.data());
+    EXPECT_EQ(fake_solver_ptr->s, hmtest.s_local.data());
+    EXPECT_EQ(fake_solver_ptr->dm, diago.DM[0]);
+    EXPECT_EQ(fake_solver_ptr->edm, diago.EDM[0]);
+    EXPECT_DOUBLE_EQ(fake_solver_ptr->mu0, -0.25);
+    EXPECT_DOUBLE_EQ(diago.DM[0][3], 13.0);
+    EXPECT_DOUBLE_EQ(diago.EDM[0][3], 23.0);
+    EXPECT_DOUBLE_EQ(diago.totalFreeEnergy, fake_solver_ptr->total_free_energy);
+    EXPECT_DOUBLE_EQ(diago.totalEnergyH, fake_solver_ptr->total_energy_h);
+    EXPECT_DOUBLE_EQ(diago.totalEnergyS, fake_solver_ptr->total_energy_s);
+}
+
 TEST_P(PexsiGammaOnlyTest, LCAO)
 {
     std::stringstream out_info;
diff --git a/tools/pexsi/check_lcao_10case_regression.py b/tools/pexsi/check_lcao_10case_regression.py
new file mode 100755
index 00000000000..0fc29152118
--- /dev/null
+++ b/tools/pexsi/check_lcao_10case_regression.py
@@ -0,0 +1,101 @@
+#!/usr/bin/env python3
+"""Validate ABACUS LCAO 10-case outputs against the official energy baseline."""
+
+from __future__ import annotations
+
+import argparse
+import re
+import sys
+from pathlib import Path
+
+
+REFERENCE_ENERGIES_EV = {
+    "001": ("001_4GaAs", -7836.15655820),
+    "002": ("002_C2H6O", -665.55500011),
+    "003": ("003_4MoS2", -9699.00659511),
+    "004": ("004_12Pt111", -39600.74431945),
+    "005": ("005_3BaTiO3", -10749.40029987),
+    "006": ("006_16Na", -18524.76009654),
+    "007": ("007_27Fe", -86945.53067515),
+    "008": ("008_32H2O", -14950.45084341),
+    "009": ("009_Battery108", -124070.65411079),
+    "010": ("010_216Si", -23123.69990200),
+}
+
+FINAL_ETOT_RE = re.compile(r"!\s*FINAL_ETOT_IS\s+([-+0-9.eE]+)\s+eV")
+
+
+def find_case_dir(run_root: Path, case_id: str) -> Path | None:
+    matches = sorted(path for path in run_root.iterdir() if path.is_dir() and path.name.startswith(f"{case_id}_"))
+    return matches[0] if matches else None
+
+
+def find_running_log(case_dir: Path) -> Path | None:
+    matches = sorted(case_dir.glob("OUT.*/running_scf.log"))
+    return matches[0] if matches else None
+
+
+def parse_log(log_path: Path) -> tuple[bool, float | None]:
+    text = log_path.read_text(errors="replace")
+    converged = "#SCF IS CONVERGED#" in text
+    matches = FINAL_ETOT_RE.findall(text)
+    final_energy = float(matches[-1]) if matches else None
+    return converged, final_energy
+
+
+def validate(run_root: Path, tolerance_ev: float) -> int:
+    if not run_root.is_dir():
+        print(f"ERROR: run root does not exist: {run_root}", file=sys.stderr)
+        return 2
+
+    failures: list[str] = []
+    print(f"{'case':<18} {'status':<10} {'energy(eV)':>18} {'ref(eV)':>18} {'|dE|(eV)':>12}")
+    for case_id, (case_name, reference_energy) in REFERENCE_ENERGIES_EV.items():
+        case_dir = find_case_dir(run_root, case_id)
+        if case_dir is None:
+            failures.append(f"{case_name}: missing case directory")
+            print(f"{case_name:<18} {'MISSING':<10}")
+            continue
+
+        log_path = find_running_log(case_dir)
+        if log_path is None:
+            failures.append(f"{case_name}: missing OUT.*/running_scf.log")
+            print(f"{case_name:<18} {'NO_LOG':<10}")
+            continue
+
+        converged, final_energy = parse_log(log_path)
+        if final_energy is None:
+            failures.append(f"{case_name}: missing FINAL_ETOT_IS")
+            print(f"{case_name:<18} {'NO_ETOT':<10}")
+            continue
+
+        delta = abs(final_energy - reference_energy)
+        status = "PASS" if converged and delta <= tolerance_ev else "FAIL"
+        if status != "PASS":
+            reason = "not converged" if not converged else f"|dE|={delta:.3e} > {tolerance_ev:.3e}"
+            failures.append(f"{case_name}: {reason}")
+        print(f"{case_name:<18} {status:<10} {final_energy:18.10f} {reference_energy:18.10f} {delta:12.3e}")
+
+    if failures:
+        print("\nFailures:", file=sys.stderr)
+        for failure in failures:
+            print(f"  - {failure}", file=sys.stderr)
+        return 1
+    return 0
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("run_root", type=Path, help="directory containing the 001_* ... 010_* LCAO case outputs")
+    parser.add_argument(
+        "--tolerance-ev",
+        type=float,
+        default=5.0e-7,
+        help="maximum allowed absolute total-energy error in eV",
+    )
+    args = parser.parse_args()
+    return validate(args.run_root, args.tolerance_ev)
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/tools/pexsi/run_51_scaling_benchmark.py b/tools/pexsi/run_51_scaling_benchmark.py
new file mode 100755
index 00000000000..4f48ac2c52c
--- /dev/null
+++ b/tools/pexsi/run_51_scaling_benchmark.py
@@ -0,0 +1,808 @@
+#!/usr/bin/env python3
+"""Run the 05_pexsi.md section 5.1 solver-scaling benchmark.
+
+The benchmark compares the conventional LCAO generalized eigensolver
+(`scalapack_gvx`) with `pexsi` on the five systems listed in section 5.1:
+Si64 Gamma, Al128 4x4x4, Cu256 2x2x2, TiO2-192 2x2x2, and H2O Gamma.
+
+Each case is run for np=1,2,4,8,16 by default. Results are written after every
+job so interrupted runs can be resumed.
+"""
+
+from __future__ import annotations
+
+import argparse
+import datetime as _dt
+import json
+import math
+import os
+import re
+import shutil
+import signal
+import subprocess
+import sys
+import time
+from dataclasses import dataclass, replace
+from pathlib import Path
+from typing import Iterable
+
+
+RY_TO_EV = 13.605693009
+ENERGY_TOL_RY = 1.0e-5
+FORCE_TOL_EV_A = 1.0e-4
+STRESS_TOL_GPA = 1.0e-3
+GPA_TO_KBAR = 10.0
+STRESS_TOL_KBAR = STRESS_TOL_GPA * GPA_TO_KBAR
+
+DEFAULT_NP = (1, 2, 4, 8, 16)
+DEFAULT_SOLVERS = ("scalapack_gvx", "pexsi")
+
+FINAL_ETOT_RE = re.compile(r"!\s*FINAL_ETOT_IS\s+([-+0-9.eE]+)\s+eV")
+TOTAL_TIME_RE = re.compile(
+    r"Total\s+Time\s*:\s*"
+    r"(?:(?P<hours>[-+0-9.eE]+)\s*h)?\s*"
+    r"(?:(?P<minutes>[-+0-9.eE]+)\s*mins?)?\s*"
+    r"(?:(?P<seconds>[-+0-9.eE]+)\s*secs?)?",
+    re.IGNORECASE,
+)
+TIMER_RE = re.compile(r"^\s*([A-Za-z0-9_./ -]+?)\s+([-+0-9.eE]+)\s+(?:s|sec|seconds)?\s*$")
+PEXSI_TRACE_RE = re.compile(
+    r"PEXSI_TRACE\s+.*?\sstage=(?P<stage>[A-Za-z0-9_]+)\.(?P<event>begin|end)\s+time=(?P<time>[-+0-9.eE]+)"
+)
+
+TRACE_STAGE_KEYS = {
+    "TransMAT2CCS": "trace_transmat2ccs_s",
+    "PPEXSILoadComplexHSMatrix": "trace_loadhs_s",
+    "PPEXSILoadRealHSMatrix": "trace_loadhs_s",
+    "PPEXSISymbolicFactorizeComplexUnsymmetricMatrix": "trace_symbolic_s",
+    "PPEXSISymbolicFactorizeRealSymmetricMatrix": "trace_symbolic_s",
+    "PPEXSICalculateFermiOperatorComplex": "trace_fermiop_s",
+    "PPEXSICalculateFermiOperatorReal": "trace_fermiop_s",
+    "PPEXSIRetrieveComplexDFTMatrix": "trace_retrieve_s",
+    "PPEXSIRetrieveRealDFTMatrix": "trace_retrieve_s",
+    "TransMAT22D": "trace_transmat22d_s",
+}
+
+
+@dataclass(frozen=True)
+class CaseSpec:
+    name: str
+    natom: int
+    kmesh: tuple[int, int, int]
+    generator: str
+    scf_nmax: int = 100
+    ecutwfc: int = 100
+    mixing_beta: float = 0.3
+    nbands: int | None = None
+    smearing_sigma: float = 0.015
+    parallel_kpoints: int | None = None
+
+    @property
+    def nks(self) -> int:
+        return self.kmesh[0] * self.kmesh[1] * self.kmesh[2]
+
+    @property
+    def kpar_nks(self) -> int:
+        return self.parallel_kpoints or self.nks
+
+
+CASES: dict[str, CaseSpec] = {
+    "si64_gamma": CaseSpec(
+        name="si64_gamma",
+        natom=64,
+        kmesh=(1, 1, 1),
+        generator="copy_si64",
+        scf_nmax=100,
+        ecutwfc=60,
+        mixing_beta=0.4,
+        nbands=200,
+        smearing_sigma=0.002,
+    ),
+    "al128_444": CaseSpec(
+        name="al128_444",
+        natom=128,
+        kmesh=(4, 4, 4),
+        generator="gen_al128",
+        mixing_beta=0.3,
+        smearing_sigma=0.03,
+        parallel_kpoints=36,
+    ),
+    "cu256_222": CaseSpec(
+        name="cu256_222",
+        natom=256,
+        kmesh=(2, 2, 2),
+        generator="gen_cu256",
+        mixing_beta=0.4,
+        smearing_sigma=0.03,
+    ),
+    "tio2_192_222": CaseSpec(
+        name="tio2_192_222",
+        natom=192,
+        kmesh=(2, 2, 2),
+        generator="gen_tio2_192",
+        mixing_beta=0.3,
+        smearing_sigma=0.002,
+    ),
+    "h2o_gamma": CaseSpec(
+        name="h2o_gamma",
+        natom=3,
+        kmesh=(1, 1, 1),
+        generator="copy_h2o",
+        scf_nmax=50,
+        ecutwfc=100,
+        mixing_beta=0.4,
+        nbands=6,
+        smearing_sigma=0.02,
+    ),
+}
+
+
+def repo_root() -> Path:
+    return Path(__file__).resolve().parents[2]
+
+
+def timestamp() -> str:
+    return _dt.datetime.now().strftime("%Y%m%d_%H%M%S")
+
+
+def write_text(path: Path, text: str) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    path.write_text(text, encoding="utf-8")
+
+
+def format_vec(vec: Iterable[float]) -> str:
+    return " ".join(f"{x:.10f}" for x in vec)
+
+
+def direct_supercell_positions(
+    base_positions: Iterable[tuple[float, float, float]],
+    reps: tuple[int, int, int],
+) -> list[tuple[float, float, float]]:
+    rx, ry, rz = reps
+    positions: list[tuple[float, float, float]] = []
+    for ix in range(rx):
+        for iy in range(ry):
+            for iz in range(rz):
+                for x, y, z in base_positions:
+                    positions.append(((x + ix) / rx, (y + iy) / ry, (z + iz) / rz))
+    return positions
+
+
+def write_kpt(case_dir: Path, kmesh: tuple[int, int, int]) -> None:
+    write_text(
+        case_dir / "KPT",
+        "K_POINTS\n"
+        "0\n"
+        "Gamma\n"
+        f"{kmesh[0]} {kmesh[1]} {kmesh[2]} 0 0 0\n",
+    )
+
+
+def copy_case_files(src: Path, dst: Path, kmesh: tuple[int, int, int]) -> None:
+    dst.mkdir(parents=True, exist_ok=True)
+    shutil.copy2(src / "STRU", dst / "STRU")
+    write_kpt(dst, kmesh)
+
+
+def write_al128(dst: Path) -> None:
+    reps = (4, 4, 2)
+    base = (
+        (0.0, 0.0, 0.0),
+        (0.5, 0.5, 0.0),
+        (0.5, 0.0, 0.5),
+        (0.0, 0.5, 0.5),
+    )
+    positions = direct_supercell_positions(base, reps)
+    assert len(positions) == 128
+    lines = [
+        "ATOMIC_SPECIES",
+        "Al 26.982 Al.PD04.PBE.UPF",
+        "",
+        "NUMERICAL_ORBITAL",
+        "Al_gga_10au_100Ry_3s3p2d.orb",
+        "",
+        "LATTICE_CONSTANT",
+        "1.8897261254578281",
+        "",
+        "LATTICE_VECTORS",
+        format_vec((4.04495 * reps[0], 0.0, 0.0)),
+        format_vec((0.0, 4.04495 * reps[1], 0.0)),
+        format_vec((0.0, 0.0, 4.04495 * reps[2])),
+        "",
+        "ATOMIC_POSITIONS",
+        "Direct",
+        "",
+        "Al",
+        "0.0",
+        str(len(positions)),
+    ]
+    lines.extend(f"{x:.10f} {y:.10f} {z:.10f} 1 1 1" for x, y, z in positions)
+    write_text(dst / "STRU", "\n".join(lines) + "\n")
+    write_kpt(dst, (4, 4, 4))
+
+
+def write_cu256(dst: Path) -> None:
+    reps = (8, 8, 4)
+    positions = direct_supercell_positions(((0.0, 0.0, 0.0),), reps)
+    assert len(positions) == 256
+    base_vectors = (
+        (0.5, 0.5, 0.0),
+        (0.5, 0.0, 0.5),
+        (0.0, 0.5, 0.5),
+    )
+    lines = [
+        "ATOMIC_SPECIES",
+        "Cu 1.000 Cu.LDA.UPF",
+        "",
+        "NUMERICAL_ORBITAL",
+        "Cu_lda_7.0au_100Ry_2s2p2d",
+        "",
+        "LATTICE_CONSTANT",
+        "6.91640",
+        "",
+        "LATTICE_VECTORS",
+    ]
+    for vec, rep in zip(base_vectors, reps):
+        lines.append(format_vec(tuple(rep * x for x in vec)))
+    lines.extend(["", "ATOMIC_POSITIONS", "Direct", "", "Cu", "0.0", str(len(positions))])
+    lines.extend(f"{x:.10f} {y:.10f} {z:.10f} 1 1 1" for x, y, z in positions)
+    write_text(dst / "STRU", "\n".join(lines) + "\n")
+    write_kpt(dst, (2, 2, 2))
+
+
+def write_tio2_192(dst: Path) -> None:
+    reps = (4, 4, 2)
+    u = 0.305
+    ti_base = ((0.0, 0.0, 0.0), (0.5, 0.5, 0.5))
+    o_base = (
+        (u, u, 0.0),
+        (1.0 - u, 1.0 - u, 0.0),
+        (0.5 + u, 0.5 - u, 0.5),
+        (0.5 - u, 0.5 + u, 0.5),
+    )
+    ti_positions = direct_supercell_positions(ti_base, reps)
+    o_positions = direct_supercell_positions(o_base, reps)
+    assert len(ti_positions) + len(o_positions) == 192
+    lines = [
+        "ATOMIC_SPECIES",
+        "Ti 47.867 Ti_ONCV_PBE-1.0.upf",
+        "O 15.9994 O_ONCV_PBE-1.0.upf",
+        "",
+        "NUMERICAL_ORBITAL",
+        "Ti_gga_8au_100Ry_4s2p2d1f.orb",
+        "O_gga_8au_100Ry_2s2p1d.orb",
+        "",
+        "LATTICE_CONSTANT",
+        "1.889716",
+        "",
+        "LATTICE_VECTORS",
+        format_vec((4.594 * reps[0], 0.0, 0.0)),
+        format_vec((0.0, 4.594 * reps[1], 0.0)),
+        format_vec((0.0, 0.0, 2.959 * reps[2])),
+        "",
+        "ATOMIC_POSITIONS",
+        "Direct",
+        "",
+        "Ti",
+        "0.0",
+        str(len(ti_positions)),
+    ]
+    lines.extend(f"{x:.10f} {y:.10f} {z:.10f} 1 1 1" for x, y, z in ti_positions)
+    lines.extend(["", "O", "0.0", str(len(o_positions))])
+    lines.extend(f"{x:.10f} {y:.10f} {z:.10f} 1 1 1" for x, y, z in o_positions)
+    write_text(dst / "STRU", "\n".join(lines) + "\n")
+    write_kpt(dst, (2, 2, 2))
+
+
+def prepare_structure(repo: Path, case: CaseSpec, case_dir: Path) -> None:
+    if case.generator == "copy_si64":
+        copy_case_files(repo / "examples/33_pexsi/02_scf_Si64", case_dir, case.kmesh)
+    elif case.generator == "copy_h2o":
+        copy_case_files(repo / "stage3_pexsi_tests/01_h2o_pexsi", case_dir, case.kmesh)
+    elif case.generator == "gen_al128":
+        write_al128(case_dir)
+    elif case.generator == "gen_cu256":
+        write_cu256(case_dir)
+    elif case.generator == "gen_tio2_192":
+        write_tio2_192(case_dir)
+    else:
+        raise ValueError(f"unknown generator: {case.generator}")
+
+
+def input_text(
+    repo: Path,
+    case: CaseSpec,
+    solver: str,
+    np: int,
+    suffix: str,
+    pexsi_npole: int,
+    pexsi_elec_thr: float,
+    pexsi_mu_thr: float,
+    pexsi_zero_thr: float,
+) -> str:
+    gamma_only = 1 if case.kmesh == (1, 1, 1) else 0
+    kpar = 1
+    for candidate in range(min(np, case.kpar_nks), 0, -1):
+        if np % candidate == 0 and case.kpar_nks % candidate == 0:
+            kpar = candidate
+            break
+    lines = [
+        "INPUT_PARAMETERS",
+        f"suffix              {suffix}",
+        f"pseudo_dir          {repo / 'tests/PP_ORB'}",
+        f"orbital_dir         {repo / 'tests/PP_ORB'}",
+        "calculation         scf",
+        "basis_type          lcao",
+        f"gamma_only          {gamma_only}",
+        "symmetry            0",
+        f"ecutwfc             {case.ecutwfc}",
+        "lcao_dr             1e-3",
+        "scf_thr             1e-6",
+        f"scf_nmax            {case.scf_nmax}",
+        "cal_force           1",
+        "cal_stress          1",
+        "smearing_method     fd",
+        f"smearing_sigma      {case.smearing_sigma}",
+        "mixing_type         broyden",
+        f"mixing_beta         {case.mixing_beta}",
+    ]
+    if case.nbands is not None:
+        lines.append(f"nbands              {case.nbands}")
+    if kpar > 1:
+        lines.append(f"kpar                {kpar}")
+    lines.append(f"ks_solver           {solver}")
+    if solver == "pexsi":
+        lines.extend(
+            [
+                f"pexsi_npole         {pexsi_npole}",
+                "pexsi_nproc_pole    1",
+                f"pexsi_temp          {case.smearing_sigma}",
+                "pexsi_gap           0",
+                "pexsi_delta_e       20",
+                "pexsi_mu_lower      -10",
+                "pexsi_mu_upper      10",
+                "pexsi_mu            0",
+                f"pexsi_mu_thr        {pexsi_mu_thr}",
+                "pexsi_mu_expand     0.3",
+                "pexsi_mu_guard      0.2",
+                f"pexsi_elec_thr      {pexsi_elec_thr}",
+                f"pexsi_zero_thr      {pexsi_zero_thr}",
+            ]
+        )
+    return "\n".join(lines) + "\n"
+
+
+def parse_numeric_tail(line: str, n: int) -> list[float] | None:
+    vals: list[float] = []
+    for token in line.replace(",", " ").split():
+        try:
+            vals.append(float(token))
+        except ValueError:
+            continue
+    if len(vals) < n:
+        return None
+    return vals[-n:]
+
+
+def parse_block_vectors(lines: list[str], marker: str, width: int, max_rows: int | None) -> list[list[float]]:
+    vectors: list[list[float]] = []
+    active = False
+    for line in lines:
+        if marker in line:
+            active = True
+            vectors = []
+            continue
+        if not active:
+            continue
+        if not line.strip():
+            if vectors:
+                active = False
+            continue
+        values = parse_numeric_tail(line, width)
+        if values is None:
+            continue
+        vectors.append(values)
+        if max_rows is not None and len(vectors) >= max_rows:
+            active = False
+    return vectors
+
+
+def parse_total_time_s(text: str) -> float | None:
+    matches = list(TOTAL_TIME_RE.finditer(text))
+    if not matches:
+        return None
+    match = matches[-1]
+    hours = float(match.group("hours") or 0.0)
+    minutes = float(match.group("minutes") or 0.0)
+    seconds = float(match.group("seconds") or 0.0)
+    total = hours * 3600.0 + minutes * 60.0 + seconds
+    return total if total > 0.0 else None
+
+
+def max_abs_diff(lhs: object, rhs: object) -> float | None:
+    if not isinstance(lhs, list) or not isinstance(rhs, list):
+        return None
+    if len(lhs) != len(rhs):
+        return None
+    max_diff = 0.0
+    for left_row, right_row in zip(lhs, rhs):
+        if not isinstance(left_row, list) or not isinstance(right_row, list):
+            return None
+        if len(left_row) != len(right_row):
+            return None
+        for left, right in zip(left_row, right_row):
+            if not isinstance(left, (int, float)) or not isinstance(right, (int, float)):
+                return None
+            max_diff = max(max_diff, abs(float(left) - float(right)))
+    return max_diff
+
+
+def parse_pexsi_trace(case_dir: Path) -> dict[str, object]:
+    stdout_path = case_dir / "abacus_stdout.log"
+    result: dict[str, object] = {}
+    if not stdout_path.exists():
+        return result
+    active: dict[str, float] = {}
+    counts: dict[str, int] = {}
+    for line in stdout_path.read_text(errors="replace").splitlines():
+        match = PEXSI_TRACE_RE.search(line)
+        if not match:
+            continue
+        stage = match.group("stage")
+        event = match.group("event")
+        trace_time = float(match.group("time"))
+        key = TRACE_STAGE_KEYS.get(stage)
+        if key is None:
+            continue
+        if event == "begin":
+            active[stage] = trace_time
+            continue
+        begin = active.pop(stage, None)
+        if begin is None:
+            continue
+        duration = max(0.0, trace_time - begin)
+        result[key] = float(result.get(key, 0.0)) + duration
+        count_key = key.removesuffix("_s") + "_count"
+        counts[count_key] = counts.get(count_key, 0) + 1
+    result.update(counts)
+    return result
+
+
+def parse_log(case_dir: Path, suffix: str) -> dict[str, object]:
+    log_path = case_dir / f"OUT.{suffix}" / "running_scf.log"
+    result: dict[str, object] = {
+        "log": str(log_path),
+        "converged": False,
+        "energy_ev": None,
+        "total_time_s": None,
+        "max_force_abs": None,
+        "max_stress_abs": None,
+        "pexsidft_s": None,
+        "hsolver_s": None,
+        "transmat2ccs_s": None,
+        "stress_unit": "kbar",
+    }
+    result.update(parse_pexsi_trace(case_dir))
+    if not log_path.exists():
+        return result
+    text = log_path.read_text(errors="replace")
+    lines = text.splitlines()
+    result["converged"] = "#SCF IS CONVERGED" in text or "SCF IS CONVERGED" in text
+    energies = FINAL_ETOT_RE.findall(text)
+    if energies:
+        result["energy_ev"] = float(energies[-1])
+    result["total_time_s"] = parse_total_time_s(text)
+
+    forces = parse_block_vectors(lines, "TOTAL-FORCE", 3, None)
+    if forces:
+        result["forces_ev_per_angstrom"] = forces
+        result["max_force_abs"] = max(abs(v) for row in forces for v in row)
+    stress = parse_block_vectors(lines, "TOTAL-STRESS", 3, 3)
+    if stress:
+        result["stress_kbar"] = stress
+        result["max_stress_abs"] = max(abs(v) for row in stress for v in row)
+
+    for line in lines:
+        if "PEXSIDFT" in line:
+            nums = parse_numeric_tail(line, 1)
+            if nums:
+                result["pexsidft_s"] = nums[-1]
+        if "HSolverLCAO" in line and "solve" in line:
+            nums = parse_numeric_tail(line, 1)
+            if nums:
+                result["hsolver_s"] = nums[-1]
+        if "TransMAT2CCS" in line:
+            nums = parse_numeric_tail(line, 1)
+            if nums:
+                result["transmat2ccs_s"] = nums[-1]
+    return result
+
+
+def run_command(cmd: list[str], cwd: Path, env: dict[str, str], timeout_s: int) -> tuple[int, bool, float]:
+    stdout = cwd / "abacus_stdout.log"
+    stderr = cwd / "abacus_stderr.log"
+    start = time.monotonic()
+    with stdout.open("w", encoding="utf-8") as out, stderr.open("w", encoding="utf-8") as err:
+        proc = subprocess.Popen(cmd, cwd=cwd, env=env, stdout=out, stderr=err, start_new_session=True)
+        try:
+            returncode = proc.wait(timeout=timeout_s)
+            timed_out = False
+        except subprocess.TimeoutExpired:
+            timed_out = True
+            os.killpg(proc.pid, signal.SIGTERM)
+            try:
+                proc.wait(timeout=30)
+            except subprocess.TimeoutExpired:
+                os.killpg(proc.pid, signal.SIGKILL)
+                proc.wait()
+            returncode = 124
+    elapsed = time.monotonic() - start
+    return returncode, timed_out, elapsed
+
+
+def result_path(run_root: Path, case: str, solver: str, np: int) -> Path:
+    return run_root / case / solver / f"np{np}" / "result.json"
+
+
+def load_result(path: Path) -> dict[str, object] | None:
+    if not path.exists():
+        return None
+    return json.loads(path.read_text(encoding="utf-8"))
+
+
+def refresh_result_file(path: Path, case_dir: Path, suffix: str) -> None:
+    data = load_result(path)
+    if data is None:
+        return
+    parsed = parse_log(case_dir, suffix)
+    data.update(parsed)
+    timed_out = bool(data.get("timed_out", False))
+    data["status"] = "timeout" if timed_out else ("converged" if parsed["converged"] else "failed")
+    write_text(path, json.dumps(data, indent=2, sort_keys=True) + "\n")
+
+
+def write_summaries(run_root: Path, cases: list[str], nps: list[int], solvers: list[str]) -> None:
+    rows: list[dict[str, object]] = []
+    by_key: dict[tuple[str, str, int], dict[str, object]] = {}
+    for case in cases:
+        for solver in solvers:
+            for np in nps:
+                data = load_result(result_path(run_root, case, solver, np))
+                if data is None:
+                    continue
+                by_key[(case, solver, np)] = data
+                rows.append(data)
+
+    for data in rows:
+        case = str(data["case"])
+        solver = str(data["solver"])
+        np = int(data["np"])
+        data["delta_e_ev"] = None
+        data["delta_e_ry"] = None
+        data["energy_pass"] = None
+        data["delta_force_max"] = None
+        data["force_pass"] = None
+        data["delta_stress_max"] = None
+        data["stress_pass"] = None
+        if solver != "pexsi":
+            continue
+        ref = by_key.get((case, "scalapack_gvx", np))
+        if not ref:
+            continue
+        energy = data.get("energy_ev")
+        ref_energy = ref.get("energy_ev")
+        if isinstance(energy, (int, float)) and isinstance(ref_energy, (int, float)):
+            delta_ev = float(energy) - float(ref_energy)
+            data["delta_e_ev"] = delta_ev
+            data["delta_e_ry"] = delta_ev / RY_TO_EV
+            data["energy_pass"] = abs(delta_ev / RY_TO_EV) <= ENERGY_TOL_RY
+        delta_force = max_abs_diff(data.get("forces_ev_per_angstrom"), ref.get("forces_ev_per_angstrom"))
+        if delta_force is None:
+            force = data.get("max_force_abs")
+            ref_force = ref.get("max_force_abs")
+            if isinstance(force, (int, float)) and isinstance(ref_force, (int, float)):
+                delta_force = abs(float(force) - float(ref_force))
+        if delta_force is not None:
+            data["delta_force_max"] = delta_force
+            data["force_pass"] = delta_force <= FORCE_TOL_EV_A
+        delta_stress = max_abs_diff(data.get("stress_kbar"), ref.get("stress_kbar"))
+        if delta_stress is None:
+            stress = data.get("max_stress_abs")
+            ref_stress = ref.get("max_stress_abs")
+            if isinstance(stress, (int, float)) and isinstance(ref_stress, (int, float)):
+                delta_stress = abs(float(stress) - float(ref_stress))
+        if delta_stress is not None:
+            data["delta_stress_max"] = delta_stress
+            data["stress_pass"] = delta_stress <= STRESS_TOL_KBAR
+
+    summary_json = run_root / "summary.json"
+    write_text(summary_json, json.dumps(rows, indent=2, sort_keys=True) + "\n")
+
+    headers = [
+        "case",
+        "solver",
+        "np",
+        "status",
+        "returncode",
+        "wall_s",
+        "energy_ev",
+        "delta_e_ev",
+        "delta_e_ry",
+        "energy_pass",
+        "max_force_abs",
+        "delta_force_max",
+        "force_pass",
+        "max_stress_abs",
+        "delta_stress_max",
+        "stress_pass",
+        "total_time_s",
+        "hsolver_s",
+        "pexsidft_s",
+        "transmat2ccs_s",
+        "trace_fermiop_s",
+        "trace_fermiop_count",
+        "trace_symbolic_s",
+        "trace_transmat2ccs_s",
+        "trace_transmat22d_s",
+        "trace_loadhs_s",
+        "trace_retrieve_s",
+        "case_dir",
+    ]
+    lines = ["\t".join(headers)]
+    for data in sorted(rows, key=lambda x: (str(x["case"]), int(x["np"]), str(x["solver"]))):
+        lines.append("\t".join(str(data.get(h, "")) for h in headers))
+    write_text(run_root / "summary.tsv", "\n".join(lines) + "\n")
+
+
+def parse_csv_arg(value: str, allowed: Iterable[str] | None = None) -> list[str]:
+    items = [item.strip() for item in value.split(",") if item.strip()]
+    if allowed is not None:
+        allowed_set = set(allowed)
+        bad = [item for item in items if item not in allowed_set]
+        if bad:
+            raise argparse.ArgumentTypeError(f"unknown value(s): {', '.join(bad)}")
+    return items
+
+
+def main(argv: list[str] | None = None) -> int:
+    repo = repo_root()
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--run-root", type=Path, default=None)
+    parser.add_argument("--abacus-bin", type=Path, default=repo / "build-pexsi-conda/abacus_basic_para")
+    parser.add_argument("--mpirun", default="mpirun")
+    parser.add_argument("--timeout", type=int, default=3600)
+    parser.add_argument("--pexsi-npole", type=int, default=120)
+    parser.add_argument("--pexsi-elec-thr", type=float, default=0.001)
+    parser.add_argument("--pexsi-mu-thr", type=float, default=0.05)
+    parser.add_argument("--pexsi-zero-thr", type=float, default=1.0e-10)
+    parser.add_argument(
+        "--smearing-sigma",
+        type=float,
+        default=None,
+        help="override the default Fermi-Dirac smearing sigma and pexsi_temp for all selected cases",
+    )
+    parser.add_argument("--np-list", default=",".join(str(x) for x in DEFAULT_NP))
+    parser.add_argument("--solvers", default=",".join(DEFAULT_SOLVERS))
+    parser.add_argument("--cases", default=",".join(CASES.keys()))
+    parser.add_argument("--force", action="store_true", help="rerun jobs with an existing result.json")
+    parser.add_argument("--prepare-only", action="store_true", help="only generate input directories")
+    parser.add_argument("--summary-only", action="store_true", help="only refresh/write summaries without running missing jobs")
+    parser.add_argument("--reparse-existing", action="store_true", help="reparse existing logs before writing summaries")
+    args = parser.parse_args(argv)
+
+    cases = parse_csv_arg(args.cases, CASES.keys())
+    solvers = parse_csv_arg(args.solvers, DEFAULT_SOLVERS)
+    nps = [int(x) for x in parse_csv_arg(args.np_list)]
+    run_root = args.run_root or Path(f"/tmp/abacus_pexsi_51_scaling_{timestamp()}")
+    run_root.mkdir(parents=True, exist_ok=True)
+
+    meta = {
+        "created_at": _dt.datetime.now().isoformat(timespec="seconds"),
+        "repo": str(repo),
+        "abacus_bin": str(args.abacus_bin),
+        "timeout_s": args.timeout,
+        "cases": cases,
+        "solvers": solvers,
+        "np_list": nps,
+        "case_options": {
+            name: {
+                "natom": CASES[name].natom,
+                "kmesh": "x".join(str(x) for x in CASES[name].kmesh),
+                "parallel_kpoints": CASES[name].kpar_nks,
+                "smearing_sigma": args.smearing_sigma
+                if args.smearing_sigma is not None
+                else CASES[name].smearing_sigma,
+            }
+            for name in cases
+        },
+        "pexsi_npole": args.pexsi_npole,
+        "pexsi_elec_thr": args.pexsi_elec_thr,
+        "pexsi_mu_thr": args.pexsi_mu_thr,
+        "pexsi_zero_thr": args.pexsi_zero_thr,
+        "energy_tolerance_ry": ENERGY_TOL_RY,
+        "force_tolerance_ev_per_angstrom": FORCE_TOL_EV_A,
+        "stress_tolerance_gpa": STRESS_TOL_GPA,
+        "stress_tolerance_kbar": STRESS_TOL_KBAR,
+        "stress_output_unit": "kbar",
+    }
+    write_text(run_root / "metadata.json", json.dumps(meta, indent=2, sort_keys=True) + "\n")
+
+    env = os.environ.copy()
+    env["OMP_NUM_THREADS"] = env.get("OMP_NUM_THREADS", "1")
+    env["OPENBLAS_NUM_THREADS"] = env.get("OPENBLAS_NUM_THREADS", "1")
+    env["MKL_NUM_THREADS"] = env.get("MKL_NUM_THREADS", "1")
+    env["ABACUS_PEXSI_TRACE"] = env.get("ABACUS_PEXSI_TRACE", "stage")
+
+    conda_bin = repo / ".conda/pexsi-env/bin"
+    conda_lib = repo / ".conda/pexsi-env/lib"
+    if conda_bin.exists():
+        env["PATH"] = f"{conda_bin}:{env.get('PATH', '')}"
+    if conda_lib.exists():
+        env["LD_LIBRARY_PATH"] = f"{conda_lib}:{env.get('LD_LIBRARY_PATH', '')}"
+
+    for case_name in cases:
+        case = CASES[case_name]
+        if args.smearing_sigma is not None:
+            case = replace(case, smearing_sigma=args.smearing_sigma)
+        for np in nps:
+            for solver in solvers:
+                job_dir = run_root / case.name / solver / f"np{np}"
+                result_file = job_dir / "result.json"
+                suffix = f"p51_{case.name}_{solver}_np{np}"
+                if result_file.exists() and not args.force:
+                    if args.reparse_existing:
+                        refresh_result_file(result_file, job_dir, suffix)
+                    print(f"skip existing {case.name} {solver} np={np}")
+                    continue
+                if args.summary_only:
+                    continue
+
+                if job_dir.exists() and args.force:
+                    shutil.rmtree(job_dir)
+                job_dir.mkdir(parents=True, exist_ok=True)
+                prepare_structure(repo, case, job_dir)
+                write_text(
+                    job_dir / "INPUT",
+                    input_text(
+                        repo,
+                        case,
+                        solver,
+                        np,
+                        suffix,
+                        args.pexsi_npole,
+                        args.pexsi_elec_thr,
+                        args.pexsi_mu_thr,
+                        args.pexsi_zero_thr,
+                    ),
+                )
+                if args.prepare_only:
+                    continue
+
+                cmd = [args.mpirun, "-np", str(np), str(args.abacus_bin)]
+                print(f"run case={case.name} solver={solver} np={np} timeout={args.timeout}s", flush=True)
+                returncode, timed_out, wall_s = run_command(cmd, job_dir, env, args.timeout)
+                parsed = parse_log(job_dir, suffix)
+                status = "timeout" if timed_out else ("converged" if parsed["converged"] else "failed")
+                data: dict[str, object] = {
+                    "case": case.name,
+                    "natom": case.natom,
+                    "kmesh": "x".join(str(x) for x in case.kmesh),
+                    "solver": solver,
+                    "np": np,
+                    "status": status,
+                    "returncode": returncode,
+                    "timed_out": timed_out,
+                    "wall_s": wall_s,
+                    "case_dir": str(job_dir),
+                }
+                data.update(parsed)
+                write_text(result_file, json.dumps(data, indent=2, sort_keys=True) + "\n")
+                write_summaries(run_root, cases, nps, solvers)
+
+    write_summaries(run_root, cases, nps, solvers)
+    print(f"summary: {run_root / 'summary.tsv'}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())