Skip to content

Replace custom CUDA CSysMatrix matvec with cuSPARSE BSR SpMV#2816

Merged
pcarruscag merged 2 commits into
su2code:developfrom
LwhJesse:replace-cuda-matvec-with-cusparse-bsr
May 27, 2026
Merged

Replace custom CUDA CSysMatrix matvec with cuSPARSE BSR SpMV#2816
pcarruscag merged 2 commits into
su2code:developfrom
LwhJesse:replace-cuda-matvec-with-cusparse-bsr

Conversation

@LwhJesse
Copy link
Copy Markdown

@LwhJesse LwhJesse commented May 17, 2026

Summary

This PR replaces the custom CUDA CSysMatrix block-sparse matrix-vector product kernel with cuSPARSE BSR SpMV for the CUDA square-block path.

This is not only a performance or maintenance refactor. It is primarily a correctness fix.

The previous custom CUDA kernel has two independent numerical correctness problems:

  1. It silently produces incorrect results for some valid square-block inputs.
  2. It also silently executes on rectangular block layouts (nVar != nEqn), but that path was never semantically correct and can produce wrong results.

Because of these issues, the previous CUDA kernel was not semantically safe for all inputs accepted by the interface. For the square-block CUDA path, the matrix storage is already BSR-like, so this PR replaces the custom kernel with cuSPARSE BSR SpMV instead of extending a path with known correctness mismatches.

What changed

  • Removed the custom CUDA GPUMatrixVectorProductAdd kernel.
  • Replaced CSysMatrix<ScalarType>::GPUMatrixVectorProduct with cuSPARSE BSR SpMV for the CUDA path.
  • Added an explicit CUDA-side guard for nVar != nEqn.
  • Added cuSPARSE as a CUDA build dependency in Meson.
  • Kept the CPU matvec path unchanged.

Why the previous CUDA kernel is being replaced

The motivation here is not only simplification. The previous CUDA kernel had correctness risks that were reproducible at the matrix-vector-product level.

Problem 1: square-block routing bug

The old kernel derives the output component from a fixed threadNo-based blockRow, while the actual matrix entry being processed changes with the loop index:

threadNo = threadIdx.x % 32
blockRow = (threadNo / nVar) % nVar
index += activeThreads

But the true dense-block row of the current matrix entry is determined by the current index, not by the fixed thread id.

That means a single thread can process entries belonging to different local block rows, while still accumulating everything into one fixed output component. This creates an algorithmic routing mismatch.

A minimal synthetic square-block case demonstrates it directly:

nPointDomain = 1
nPoint       = 2
nVar = nEqn  = 5

row_ptr = [0, 2]
col_ind = [0, 1]

matrix[0]  = 1
matrix[30] = 10
all other matrix entries = 0

vec[:] = 1

Expected CPU reference output:

[1, 10, 0, 0, 0]

Observed outputs:

CPU reference   = [1, 10, 0, 0, 0]
old CUDA kernel = [11, 0, 0, 0, 0]
cuSPARSE BSR    = [1, 10, 0, 0, 0]

In this example, one CUDA lane processes both index = 0 and index = 30. Those two entries belong to different local block rows, but the old kernel routes both contributions to the same output component.

Problem 2: rectangular block layouts were accepted, but not semantically consistent with the CPU path

CSysMatrix has a general block interface with shape nVar x nEqn.

The CPU path uses the correct input-vector stride:

vec[col_j * nEqn]

However, the old CUDA kernel uses:

vec[d_col_ind[blockNo] * nVar + blockCol]

That is not semantically consistent with the CPU rectangular-block matvec definition when nVar != nEqn.

So although the old CUDA kernel would execute for rectangular blocks, that execution was not actually equivalent to the CPU rectangular-block matvec semantics. In other words, the rectangular CUDA path appeared to be accepted by the implementation, but its indexing was not consistent with the CPU rectangular-block definition.

A minimal synthetic rectangular case shows this clearly:

nPointDomain = 1
nPoint       = 2
nVar         = 3
nEqn         = 6

row_ptr = [0, 2]
col_ind = [0, 1]

matrix[0]  = 5
matrix[30] = 7

vec[0] = 1
vec[3] = 100
vec[6] = 1
all other vec entries = 0

Observed outputs:

CPU rectangular reference = [5, 0, 7]
old CUDA kernel           = [705, 0, 0]

This is not just a corner-case API detail. It shows that the previous CUDA kernel could silently accept a rectangular block layout and produce a wrong result without any explicit failure.

Why explicit rejection is better than silent execution

This PR intentionally supports only the CUDA square-block case:

nVar == nEqn

For nVar != nEqn, the new CUDA path now fails explicitly instead of silently running an incorrect algorithm.

So this PR does not reduce the set of rectangular CUDA cases that were correctly supported before, because the old kernel did not provide a semantically correct rectangular CUDA implementation in the first place.

The behavior change is:

before: rectangular CUDA input may run and silently compute the wrong answer
after:  rectangular CUDA input is rejected explicitly

That is a correctness improvement, not a compatibility regression.

Why cuSPARSE BSR is the right replacement here

For the CUDA square-block path, the existing storage is BSR-like in the sense relevant here:

  • row_ptr
  • col_ind
  • matrix as dense block payloads

The cuSPARSE path uses the matching BSR block direction for the existing block payload layout and provides a well-tested implementation of the required operation.

Given the observed correctness problems in both square and rectangular cases, using cuSPARSE for the valid square-block CUDA path is more robust than preserving a separate custom implementation for that path.

Validation

Build

  • Serial CUDA build completed locally.
  • MPI-enabled CUDA build could not be evaluated on my local Arch/GCC 16.1.1 environment because nvcc failed Meson's CUDA compiler sanity check before SU2 compilation. This happened before building the modified source files and was not a cuSPARSE link failure.

1. Synthetic square-block validation

Compared:

  • CPU reference BSR matvec
  • old custom CUDA kernel
  • new cuSPARSE BSR path

Result:

CPU reference   = [1, 10, 0, 0, 0]
old CUDA kernel = [11, 0, 0, 0, 0]
cuSPARSE BSR    = [1, 10, 0, 0, 0]

2. Synthetic rectangular-block validation

Compared:

  • CPU rectangular-block reference
  • old custom CUDA kernel
  • new CUDA-path behavior

Result:

CPU rectangular reference = [5, 0, 7]
old CUDA kernel           = [705, 0, 0]
new CUDA path             = explicit rejection via square-block guard

3. Real SU2 integration cases

Ran 6 existing runnable implicit Krylov cases with four-way comparison:

develop CPU
develop CUDA
new branch CPU
new branch CUDA

Tested cases:

periodic2d_sector
udf_lam_flatplate_s
udf_lam_flatplate_m
udf_lam_flatplate_l
udf_test_11_probes_s
udf_test_11_probes_m

Summary:

  • new CPU matched develop CPU in 6/6 cases.
  • new CUDA matched new CPU in 6/6 cases.
  • develop CUDA differed from CPU / new CUDA in 5/6 cases.
  • No GPUassert, cuSPARSEassert, illegal memory access, or segmentation fault was observed.
  • One case showed NaN flags in all four runs, so that behavior was not new-CUDA-specific.

For cases without standard residual columns, the final history.csv row was compared across all common numeric fields.

Scope

This PR changes only the CUDA matrix-vector product implementation for the square-block CSysMatrix path.

It does not change:

  • CPU matvec semantics
  • Krylov solver logic
  • preconditioner logic
  • CFD solver physics
  • matrix assembly semantics

Follow-up work

This PR focuses on correctness first.

The current implementation creates/destroys cuSPARSE descriptors and allocates/frees the SpMV workspace inside each matvec call. That can be optimized later by caching:

  • the cuSPARSE handle
  • BSR matrix descriptor
  • dense vector descriptors
  • SpMV workspace
  • optional cusparseSpMV_preprocess() state

Follow-up draft

I have also prepared a follow-up draft PR in my fork to show the next intended step after this correctness PR:

That draft is based on this PR branch, not directly on develop, so it shows only the incremental resource-lifecycle change: caching the cuSPARSE handle, BSR matrix descriptor, and SpMV workspace after the cuSPARSE BSR path introduced here.

Related Work

This replaces the earlier narrower draft PR #2813, which attempted to fix only the contribution-routing issue inside the existing custom CUDA kernel. After additional testing, the scope changed to replacing the square-block CUDA path with cuSPARSE BSR SpMV and explicitly rejecting unsupported rectangular CUDA blocks.

PR Checklist

  • I am submitting my contribution to the develop branch.
  • My contribution generates no new compiler warnings (try with --warnlevel=3 when using meson).
  • My contribution is commented and consistent with SU2 style (https://su2code.github.io/docs_v7/Style-Guide/).
  • I used the pre-commit hook to prevent dirty commits and used pre-commit run --all to format old commits.
  • I have added a test case that demonstrates my contribution, if necessary.
  • I have updated appropriate documentation (Tutorials, Docs Page, config_template.cpp), if necessary.

Copy link
Copy Markdown
Member

@pcarruscag pcarruscag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything else in cuSparse that would help with preconditioner implementations?
I think it can do triangular solves, but using its own analysis phase which is not necessary in our case since we prepare the parallelization schedule ourselves.
I'm trying to decide if just the matrix-vector product justifies adding a very large dependency.

@LwhJesse
Copy link
Copy Markdown
Author

Thanks for taking a look.

For the preconditioner side, I looked at the current LU_SGS GPU work in #2539 and the follow-up review-fix PR #2751. That path already builds SU2-side level scheduling / graph partitioning and uses custom forward/backward kernels, so I do not think this PR directly enables the current LU_SGS GPU implementation to use cuSPARSE. The possible overlap would be a separate follow-up investigation of whether cuSPARSE triangular-solve APIs can replace or complement part of that scheduled preconditioner path.

The reason I still prefer keeping cuSPARSE here is that the CUDA MVP work has been moving toward a GPU-resident linear-algebra backend, not just a faster standalone kernel. The original CUDA MVP work in #2346 showed that repeated host/device traffic and per-call setup were major bottlenecks, and PR #2806 moved the Jacobian upload lifetime into CSysMatrixVectorProduct. A natural next step is to keep the matrix and Krylov vectors resident on the GPU and move more of the iteration loop there.

For that direction, I think SpMV is a good boundary for a library-backed primitive. A custom kernel can repair this bug, but SU2 would still own the low-level BSR-like SpMV contract: block layout, indexing/thread mapping, reduction behavior, and future cache/lifetime rules. With cuSPARSE, that basic sparse-matvec contract moves to the vendor library, while SU2-specific custom kernels can stay focused on preconditioner scheduling, fused solver operations, and higher-level GPU-resident orchestration.

So I see the tradeoff as: custom fix means smaller dependency surface and easier short-term patch; cuSPARSE means larger dependency surface, but a stronger foundation for the future GPU-resident Krylov path. If the project preference is still to avoid cuSPARSE for SpMV-only use, I can rework this into a custom-kernel correctness fix.

Copy link
Copy Markdown
Member

@pcarruscag pcarruscag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok we can use this for SpMv and probably cuBlas for dot products.
Vector-Vector operations I want to implement with custom kernels so that we keep natural math syntax via the expression templates.

@LwhJesse
Copy link
Copy Markdown
Author

Sounds good, thanks

@LwhJesse
Copy link
Copy Markdown
Author

One CUDA-version issue I noticed after checking the cuSPARSE API history: the current implementation uses the generic cusparseSpMV() BSR path via CUSPARSE_SPMV_BSR_ALG1.

The documented support I found for generic cusparseSpMV() with BSR starts from the CUDA 13.1 cuSPARSE documentation line. So the current implementation should be treated as requiring CUDA/cuSPARSE 13.1+ unless we add a lower-version-compatible path.

I think there are two reasonable ways to handle this PR:

  1. Keep the current generic cuSPARSE BSR SpMV implementation, and add a configure-time CUDA/cuSPARSE version guard so older CUDA versions fail clearly instead of failing later in compilation.

  2. If SU2 should keep CUDA 12.x / older CUDA compatibility for this path, I can rework the implementation to a lower-version-compatible BSR SpMV path instead, for example using the legacy BSR matvec API if its index-type constraints are acceptable.

In both cases, the main correctness goal of the PR stays the same: avoid the old custom CUDA kernel silently producing wrong results, and explicitly reject unsupported rectangular blocks instead of running an invalid path.

@pcarruscag
Copy link
Copy Markdown
Member

What are the constraints on index type? And is it 13.1 only or 13.X is ok?

@LwhJesse
Copy link
Copy Markdown
Author

For the current generic cusparseSpMV() path, the index constraint is not 32-bit only. SU2's sparse-pattern arrays here are unsigned long (row_ptr / col_ind), and the current implementation maps that to either CUSPARSE_INDEX_32I or CUSPARSE_INDEX_64I depending on sizeof(unsigned long). So the current path should support both 32-bit and 64-bit sparse indices, as long as the cuSPARSE version supports generic BSR SpMV.

The issue is that I found explicit documentation for generic cusparseSpMV() with BSR and CUSPARSE_SPMV_BSR_ALG1 starting from the CUDA 13.1 cuSPARSE docs. I did not find explicit support for this path in the CUDA 13.0 or older docs, so I would not assume the current implementation works with those versions.

For a lower-version-compatible path, the main constraint is different: the legacy BSR matvec API is int-based for the BSR row/column indices and dimensions. So that fallback would effectively require the sparse-pattern indices to fit in 32-bit integers. If they do not, we would need explicit range checks plus an int32 index-copy path, or avoid that legacy fallback.

So the current implementation is cleanest with a CUDA/cuSPARSE 13.1+ guard. If SU2 needs CUDA 12.x / older CUDA compatibility, I can rework the implementation, but that path would have to handle the legacy 32-bit index limitation explicitly.

@pcarruscag
Copy link
Copy Markdown
Member

So if we change the type to uint64_t are the older cusparse versions fine?
Is the limitation the type-switching or just 32bit support in old versions?

@LwhJesse
Copy link
Copy Markdown
Author

LwhJesse commented May 27, 2026

No. Changing SU2's internal type to uint64_t would not make older cuSPARSE versions equivalent to the current path.

Older cuSPARSE can use the legacy BSR matvec API, but that API is int-based for BSR indices, so it is effectively a 32-bit-index path.

The current generic cusparseSpMV() BSR path is different: it can represent either 32-bit or 64-bit sparse indices, which matches SU2's unsigned long index type. The issue is that this generic BSR SpMV path is only explicitly documented from CUDA 13.1 onward.

@pcarruscag
Copy link
Copy Markdown
Member

ok. it's a localized changed and you are working actively on it so I'll merge it and if we need to revisit it's easy to go back

@pcarruscag pcarruscag merged commit 732eaef into su2code:develop May 27, 2026
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants