Skip to content

Harden host-side sparse index width and allocation safety#2822

Draft
LwhJesse wants to merge 2 commits into
su2code:developfrom
LwhJesse:fix/host-sparse-index-overflow
Draft

Harden host-side sparse index width and allocation safety#2822
LwhJesse wants to merge 2 commits into
su2code:developfrom
LwhJesse:fix/host-sparse-index-overflow

Conversation

@LwhJesse
Copy link
Copy Markdown

@LwhJesse LwhJesse commented May 25, 2026

Harden Host-Side Sparse Index Width and Allocation Safety

Summary

This PR hardens the host-side sparse linear algebra path against integer-width overflow and silent narrowing.

It introduces an explicit 64-bit host sparse-index alias,

using su2_index_t = std::uint64_t;

and uses it in the host sparse-pattern / CSR metadata path where index width and allocation size matter. It also adds checked multiplication for the main storage-size calculations and checked narrowing at the remaining external boundaries.

This PR does not modify the CUDA backend implementation itself. In particular, it does not change:

  • Common/src/linear_algebra/CSysMatrixGPU.cu
  • Common/include/linear_algebra/GPUComms.cuh

CUDA matvec correctness remains the responsibility of PR #2816.

Problem

Several sparse linear algebra paths in SU2 currently rely on unsigned long or int for sparse metadata, storage-size arithmetic, or boundary conversion.

On LP64 systems this often goes unnoticed because unsigned long is usually 64-bit. On LLP64 systems such as 64-bit Windows, unsigned long is still 32-bit. At sufficiently large local sparse-matrix sizes, that can lead to:

  • CSR metadata truncation
  • wrapped products such as nnz * nVar * nEqn
  • undersized matrix or vector allocations
  • wrapped block-value addressing
  • silent truncation when converting to external library integer types

For the main matrix storage, the critical host-side product is:

nnz_blocks * nVar * nEqn

If that is evaluated in 32-bit arithmetic, the largest representable unsigned value is:

2^32 - 1 = 4,294,967,295

So the first overflowing matrix-scalar count is:

4,294,967,296

For double, that means:

4,294,967,296 * 8 bytes = 34,359,738,368 bytes

which is:

  • about 32 GiB binary
  • about 34.36 GB decimal

That is the origin of the host-side estimate quoted above.

To connect that number to an actual sparse matrix, take a common square-block case with nVar = nEqn = 5. Then each nonzero block contributes 25 matrix scalars, so the first overflowing host-side case is:

nnz_blocks = 171,798,692
nVar = nEqn = 5

nnz_blocks * nVar * nEqn
= 171,798,692 * 25
= 4,294,967,300

This is already larger than 2^32 - 1 = 4,294,967,295, so the host-side 32-bit product has crossed the limit. In memory terms, that same case corresponds to:

4,294,967,300 * 8 bytes = 34,359,738,400 bytes

which is again about:

  • 32 GiB binary
  • 34.36 GB decimal

So the quoted 34.36 GB number is not a separate back-of-the-envelope estimate; it is simply the byte size of the first 5 x 5 double-precision sparse matrix whose nnz_blocks * nVar * nEqn product no longer fits in 32-bit unsigned arithmetic.

Using a rough relation nnz_blocks ~= N_local * z, with z the average nonzero-block count per row, this 5 x 5 threshold corresponds approximately to:

  • about 17.18 million local points at z = 10
  • about 8.59 million local points at z = 20

The same argument gives the corresponding first-overflow cases for larger square blocks:

  • 6 x 6: 119,304,648 * 36 = 4,294,967,328
  • 7 x 7: 87,652,394 * 49 = 4,294,967,306

So the threshold is high, but it is still a local sparse-matrix scale that can be reached in large implicit runs.

For reference, the current legacy CUDA kernel in develop can overflow earlier because it uses a signed 32-bit int matrix-scalar offset internally. The largest positive signed 32-bit value is:

2^31 - 1 = 2,147,483,647

so the first overflowing matrix-scalar count is:

2,147,483,648

For double, that means:

2,147,483,648 * 8 bytes = 17,179,869,184 bytes

which is:

  • about 16 GiB binary
  • about 17.18 GB decimal

That estimate is specific to the legacy custom CUDA matvec path currently used in develop.

PR #2816 naturally addresses that CUDA-side issue by replacing the old custom matvec kernel, including the legacy signed-int matrix-offset path that creates this earlier overflow limit. That is why this PR does not modify the CUDA backend itself.

What this PR changes

  • Introduces su2_index_t = std::uint64_t for the host sparse-index chain that needs it.
  • Propagates that type through graph-toolbox sparse-pattern aliases, host-side CSysMatrix sparse metadata, CSysVector size/index API, CPastixWrapper sparse input handling, and the sparse-pattern-facing geometry accessors required for that propagation.
  • Adds checked multiplication for the main host allocation-sensitive products, including:
    • nnz * nVar * nEqn
    • nnz_ilu * nVar * nEqn
    • nPointDomain * nVar * nEqn
    • numBlk * numVar
    • numBlkDomain * numVar
  • Adds checked narrowing for PaStiX boundary conversion.
  • Keeps device sparse indices as unsigned long, but range-checks host sparse indices before converting them for the current CUDA upload path.

After this change, the targeted host-side sparse-index and allocation path no longer fails by silent wraparound or silent truncation. If a value exceeds the remaining boundary types, the code now fails explicitly through SU2_MPI::Error(...).

Validation

  • git diff --check passed.
  • The touched CPU/CUDA object files rebuilt successfully in the available build directories.
  • No CUDA backend implementation files were modified.

For numerical validation, I ran the existing six-case correctness harness with a two-way CPU comparison:

  • develop CPU
  • this branch CPU

Cases:

  • periodic2d_sector
  • udf_lam_flatplate_s
  • udf_lam_flatplate_m
  • udf_lam_flatplate_l
  • udf_test_11_probes_s
  • udf_test_11_probes_m

Result:

  • this branch CPU matched develop CPU in all 6 cases
  • the final common numeric fields in history.csv matched exactly in all 6 cases
  • max_abs_delta = 0.0 for every case

For CUDA, the relevant correctness discussion remains PR #2816, since this branch still uses the legacy pre-#2816 custom CUDA matvec implementation and this PR intentionally does not modify that backend. In particular, the current develop-side signed-int offset limit described above is handled naturally by the #2816 backend replacement rather than by this PR.

Notes on testing

This PR does not add a direct large-memory reproducer for the original host-side overflow. The original failure mode depends on LLP64-style 32-bit unsigned long arithmetic, and a faithful end-to-end reproduction would require a much larger local sparse structure than is practical for a routine test here.

PR Checklist

  • I am submitting my contribution to the develop branch.
  • My contribution generates no new compiler warnings (try with --warnlevel=3 when using meson).
  • My contribution is commented and consistent with SU2 style (https://su2code.github.io/docs_v7/Style-Guide/).
  • I used the pre-commit hook to prevent dirty commits and used pre-commit run --all to format old commits.
  • I have added a test case that demonstrates my contribution, if necessary.
  • I have updated appropriate documentation (Tutorials, Docs Page, config_template.cpp), if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant