Replace custom CUDA CSysMatrix matvec with cuSPARSE BSR SpMV#2816
Conversation
pcarruscag
left a comment
There was a problem hiding this comment.
Is there anything else in cuSparse that would help with preconditioner implementations?
I think it can do triangular solves, but using its own analysis phase which is not necessary in our case since we prepare the parallelization schedule ourselves.
I'm trying to decide if just the matrix-vector product justifies adding a very large dependency.
|
Thanks for taking a look. For the preconditioner side, I looked at the current LU_SGS GPU work in #2539 and the follow-up review-fix PR #2751. That path already builds SU2-side level scheduling / graph partitioning and uses custom forward/backward kernels, so I do not think this PR directly enables the current LU_SGS GPU implementation to use cuSPARSE. The possible overlap would be a separate follow-up investigation of whether cuSPARSE triangular-solve APIs can replace or complement part of that scheduled preconditioner path. The reason I still prefer keeping cuSPARSE here is that the CUDA MVP work has been moving toward a GPU-resident linear-algebra backend, not just a faster standalone kernel. The original CUDA MVP work in #2346 showed that repeated host/device traffic and per-call setup were major bottlenecks, and PR #2806 moved the Jacobian upload lifetime into For that direction, I think SpMV is a good boundary for a library-backed primitive. A custom kernel can repair this bug, but SU2 would still own the low-level BSR-like SpMV contract: block layout, indexing/thread mapping, reduction behavior, and future cache/lifetime rules. With cuSPARSE, that basic sparse-matvec contract moves to the vendor library, while SU2-specific custom kernels can stay focused on preconditioner scheduling, fused solver operations, and higher-level GPU-resident orchestration. So I see the tradeoff as: custom fix means smaller dependency surface and easier short-term patch; cuSPARSE means larger dependency surface, but a stronger foundation for the future GPU-resident Krylov path. If the project preference is still to avoid cuSPARSE for SpMV-only use, I can rework this into a custom-kernel correctness fix. |
pcarruscag
left a comment
There was a problem hiding this comment.
Ok we can use this for SpMv and probably cuBlas for dot products.
Vector-Vector operations I want to implement with custom kernels so that we keep natural math syntax via the expression templates.
|
Sounds good, thanks |
|
One CUDA-version issue I noticed after checking the cuSPARSE API history: the current implementation uses the generic The documented support I found for generic I think there are two reasonable ways to handle this PR:
In both cases, the main correctness goal of the PR stays the same: avoid the old custom CUDA kernel silently producing wrong results, and explicitly reject unsupported rectangular blocks instead of running an invalid path. |
|
What are the constraints on index type? And is it 13.1 only or 13.X is ok? |
|
For the current generic cusparseSpMV() path, the index constraint is not 32-bit only. SU2's sparse-pattern arrays here are unsigned long (row_ptr / col_ind), and the current implementation maps that to either CUSPARSE_INDEX_32I or CUSPARSE_INDEX_64I depending on sizeof(unsigned long). So the current path should support both 32-bit and 64-bit sparse indices, as long as the cuSPARSE version supports generic BSR SpMV. The issue is that I found explicit documentation for generic cusparseSpMV() with BSR and CUSPARSE_SPMV_BSR_ALG1 starting from the CUDA 13.1 cuSPARSE docs. I did not find explicit support for this path in the CUDA 13.0 or older docs, so I would not assume the current implementation works with those versions. For a lower-version-compatible path, the main constraint is different: the legacy BSR matvec API is int-based for the BSR row/column indices and dimensions. So that fallback would effectively require the sparse-pattern indices to fit in 32-bit integers. If they do not, we would need explicit range checks plus an int32 index-copy path, or avoid that legacy fallback. So the current implementation is cleanest with a CUDA/cuSPARSE 13.1+ guard. If SU2 needs CUDA 12.x / older CUDA compatibility, I can rework the implementation, but that path would have to handle the legacy 32-bit index limitation explicitly. |
|
So if we change the type to uint64_t are the older cusparse versions fine? |
|
No. Changing SU2's internal type to Older cuSPARSE can use the legacy BSR matvec API, but that API is The current generic |
|
ok. it's a localized changed and you are working actively on it so I'll merge it and if we need to revisit it's easy to go back |
Summary
This PR replaces the custom CUDA
CSysMatrixblock-sparse matrix-vector product kernel with cuSPARSE BSR SpMV for the CUDA square-block path.This is not only a performance or maintenance refactor. It is primarily a correctness fix.
The previous custom CUDA kernel has two independent numerical correctness problems:
nVar != nEqn), but that path was never semantically correct and can produce wrong results.Because of these issues, the previous CUDA kernel was not semantically safe for all inputs accepted by the interface. For the square-block CUDA path, the matrix storage is already BSR-like, so this PR replaces the custom kernel with cuSPARSE BSR SpMV instead of extending a path with known correctness mismatches.
What changed
GPUMatrixVectorProductAddkernel.CSysMatrix<ScalarType>::GPUMatrixVectorProductwith cuSPARSE BSR SpMV for the CUDA path.nVar != nEqn.Why the previous CUDA kernel is being replaced
The motivation here is not only simplification. The previous CUDA kernel had correctness risks that were reproducible at the matrix-vector-product level.
Problem 1: square-block routing bug
The old kernel derives the output component from a fixed
threadNo-basedblockRow, while the actual matrix entry being processed changes with the loop index:But the true dense-block row of the current matrix entry is determined by the current
index, not by the fixed thread id.That means a single thread can process entries belonging to different local block rows, while still accumulating everything into one fixed output component. This creates an algorithmic routing mismatch.
A minimal synthetic square-block case demonstrates it directly:
Expected CPU reference output:
Observed outputs:
In this example, one CUDA lane processes both
index = 0andindex = 30. Those two entries belong to different local block rows, but the old kernel routes both contributions to the same output component.Problem 2: rectangular block layouts were accepted, but not semantically consistent with the CPU path
CSysMatrixhas a general block interface with shapenVar x nEqn.The CPU path uses the correct input-vector stride:
However, the old CUDA kernel uses:
That is not semantically consistent with the CPU rectangular-block matvec definition when
nVar != nEqn.So although the old CUDA kernel would execute for rectangular blocks, that execution was not actually equivalent to the CPU rectangular-block matvec semantics. In other words, the rectangular CUDA path appeared to be accepted by the implementation, but its indexing was not consistent with the CPU rectangular-block definition.
A minimal synthetic rectangular case shows this clearly:
Observed outputs:
This is not just a corner-case API detail. It shows that the previous CUDA kernel could silently accept a rectangular block layout and produce a wrong result without any explicit failure.
Why explicit rejection is better than silent execution
This PR intentionally supports only the CUDA square-block case:
For
nVar != nEqn, the new CUDA path now fails explicitly instead of silently running an incorrect algorithm.So this PR does not reduce the set of rectangular CUDA cases that were correctly supported before, because the old kernel did not provide a semantically correct rectangular CUDA implementation in the first place.
The behavior change is:
That is a correctness improvement, not a compatibility regression.
Why cuSPARSE BSR is the right replacement here
For the CUDA square-block path, the existing storage is BSR-like in the sense relevant here:
row_ptrcol_indmatrixas dense block payloadsThe cuSPARSE path uses the matching BSR block direction for the existing block payload layout and provides a well-tested implementation of the required operation.
Given the observed correctness problems in both square and rectangular cases, using cuSPARSE for the valid square-block CUDA path is more robust than preserving a separate custom implementation for that path.
Validation
Build
nvccfailed Meson's CUDA compiler sanity check before SU2 compilation. This happened before building the modified source files and was not a cuSPARSE link failure.1. Synthetic square-block validation
Compared:
Result:
2. Synthetic rectangular-block validation
Compared:
Result:
3. Real SU2 integration cases
Ran 6 existing runnable implicit Krylov cases with four-way comparison:
Tested cases:
Summary:
new CPUmatcheddevelop CPUin 6/6 cases.new CUDAmatchednew CPUin 6/6 cases.develop CUDAdiffered from CPU / new CUDA in 5/6 cases.GPUassert,cuSPARSEassert, illegal memory access, or segmentation fault was observed.NaNflags in all four runs, so that behavior was not new-CUDA-specific.For cases without standard residual columns, the final
history.csvrow was compared across all common numeric fields.Scope
This PR changes only the CUDA matrix-vector product implementation for the square-block
CSysMatrixpath.It does not change:
Follow-up work
This PR focuses on correctness first.
The current implementation creates/destroys cuSPARSE descriptors and allocates/frees the SpMV workspace inside each matvec call. That can be optimized later by caching:
cusparseSpMV_preprocess()stateFollow-up draft
I have also prepared a follow-up draft PR in my fork to show the next intended step after this correctness PR:
That draft is based on this PR branch, not directly on
develop, so it shows only the incremental resource-lifecycle change: caching the cuSPARSE handle, BSR matrix descriptor, and SpMV workspace after the cuSPARSE BSR path introduced here.Related Work
This replaces the earlier narrower draft PR #2813, which attempted to fix only the contribution-routing issue inside the existing custom CUDA kernel. After additional testing, the scope changed to replacing the square-block CUDA path with cuSPARSE BSR SpMV and explicitly rejecting unsupported rectangular CUDA blocks.
PR Checklist
--warnlevel=3when using meson).pre-commit run --allto format old commits.