Ctrls performance by JPRichings · Pull Request #739 · QuEST-Kit/QuEST

JPRichings · 2026-04-28T20:57:32Z

Initial pass on performance changed to remove thrust device_vector that is used to move ctrls to device that is causing a performance impact due to the thrust device_vector moving data from host to device on construction.

Updates version references and fixes doxygen parsing of CMake configured header broken in QuEST-Kit#616

JPRichings · 2026-04-29T07:25:58Z

Todo:

Confirm no horrible race condition is introduced by cudamemcpyToSymbol
Set the ctrls buffer size programmatically or to some sensible limit, say 50? (at least 64KB of constant memory on device so no worries giving ourselves some extra room
Apply fix to all kernels in QuEST

Assumptions in this code:

Single operation on quantum register at a time which allows the assumption that ctrls can be overwritten before each subsequent gate application.

otbrown · 2026-05-03T16:44:03Z

Alloc size 64 qubits -- add as macro somewhere if not done already MAX_QUREG_SIZE

TysonRayJones · 2026-05-03T18:28:13Z

Reminder of other stuff from meeting:

storing targets also in constant mem
retain passing ptrs to kernels; dispatcher passes constant mem ptr

JPRichings · 2026-05-05T20:32:54Z

Way easier to do this instead:

Beginning with CUDA 12.1, you can now pass up to 32,764 bytes as kernel parameters on NVIDIA Volta and above

#define TOTAL_PARAMS (8000) // ints

typedef struct {
    int param[TOTAL_PARAMS];
} param_large_t;

__global__ void kernelLargeParam(__grid_constant__ const param_large_t p,...) {
    // access all parameters from p
}

int main() {
    param_large_t p_large;
    kernelLargeParam<<<GRIDDIM,BLOCKDIM>>>(p_large,...);
    cudaDeviceSynchronize();
}

Note that in both preceding examples, kernel parameters are annotated with the grid_constant qualifier to indicate they are read-only.

reference: https://developer.nvidia.com/blog/cuda-12-1-supports-large-kernel-parameters/

I think this solves our concerns about multi-qureg operations both accessing a common ctrls cache in future.

(this is also much better than the variadic kernel idea I had earlier today)

Other benefits:

Performance: sweet sweet ~~200ns~~ more like a few micro seconds now ive checked my profiling data (could be another factor of two here) performance save possible by removing cudaMemcpyToSymbol call.
Removes any risk of race condition

JPRichings · 2026-05-05T21:39:12Z

Profiling to confirm but here is an extra factor of 2 over #729 (comment):

Total number of gates: 210
Measured probability amplitude of |0..0> state: 9.53674e-07
Calculated probability amplitude of |0..0>, C0 = 1 / 2^20: 9.53674e-07
Measuring final state: (all probabilities should be 0.5)
Qubit 0 measured in state 1 with probability 0.5
Qubit 1 measured in state 1 with probability 0.5
Qubit 2 measured in state 1 with probability 0.5
Qubit 3 measured in state 0 with probability 0.5
Qubit 4 measured in state 0 with probability 0.5
Qubit 5 measured in state 0 with probability 0.5
Qubit 6 measured in state 1 with probability 0.5
Qubit 7 measured in state 0 with probability 0.5
Qubit 8 measured in state 1 with probability 0.5
Qubit 9 measured in state 1 with probability 0.5
Qubit 10 measured in state 0 with probability 0.5
Qubit 11 measured in state 0 with probability 0.5
Qubit 12 measured in state 0 with probability 0.5
Qubit 13 measured in state 1 with probability 0.5
Qubit 14 measured in state 1 with probability 0.5
Qubit 15 measured in state 1 with probability 0.5
Qubit 16 measured in state 0 with probability 0.5
Qubit 17 measured in state 0 with probability 0.5
Qubit 18 measured in state 1 with probability 0.5
Qubit 19 measured in state 0 with probability 0.5

Final state:
|11100010110001110010>
QFT run time: 0.00141512s
Total run time: 2.36306s

JPRichings · 2026-05-05T21:47:02Z

profile to confirm no data movement outside of kernel launch at all:

…memory copy

TysonRayJones · 2026-05-06T02:02:01Z

Is it possible to pass multiple, distinct arguments to the one function that way? E.g. so that the target qubits of kernel_statevec_anyCtrlAnyTargDiagMatr_sub are also __grid_constant__? It'd be gross (but not the end of the world) if each kernel must manually isolate the targets array from a single ctrlsAndTargs array (although sneakily, kernel_statevec_anyCtrlOneTargDenseMatr_subA already performs such a trick).

The pattern of gpu_subroutines.cpp "secretly" passing __device__ __constant__ to unchanged kernels which still receive a ctrls ptr (and separately, a targs ptr) is still more pleasant to me, because then only gpu_subroutines.cpp needs to know about the optimisation being done, at copy-time. But a 2x speedup is nothing to sneeze at (if I interpreted that right)!

Orthogonally, do we have consternations about bumping min CUDA to 12.1 (released 2023)?

JPRichings · 2026-05-06T07:33:39Z

I think it is possible to have multiple arguments passed this way. All this is (and its what I tried to do first with this optimisation) capture by value in the kernel launch a c style array. The __grid_const__ is just to make sure its treated as read only on the device.

I think this is much cleaner than the data movement to const memory as instead of pointers to device constant memory we just pass the array directly. Only change to the original kernel (before all this optimisation stuff) is then we need to access a data member of a struct. There is no other change inside the kernel and we aren't accessing a global device variable out of the blue mid kernel. In gpu_subroutines.cpp if I can get util_getSorted to return directly to the struct we are using for hiding the c array then there will be no change to that code apart from a function returning to a new small struct opposed to a vector<int>.

We probably don't need to to change cuda versions unless we think we are passing over more than 4096 bytes in the kernel. I don't think we are anywhere near this but no objection to moving to CUDA 12.1.

JPRichings · 2026-05-06T07:36:50Z

Just to hammer home the performance improvement here. The red is the explicit cudamemcpytosymbol is the previous ctrls buffer which is now eliminated in this version.

JPRichings · 2026-05-06T07:39:09Z

Finally usual caveats that correctness checking needs to take place with a full run of the test suite!

…ance

…rmance

JPRichings · 2026-06-11T12:03:20Z

Devel merged in and List64 applied to all the gpu subroutines and kernels.

Simple QFT performance test gives slightly improved results:

Total number of gates: 210
Measured probability amplitude of |0..0> state: 9.53674e-07
Calculated probability amplitude of |0..0>, C0 = 1 / 2^20: 9.53674e-07
Measuring final state: (all probabilities should be 0.5)
Qubit 0 measured in state 1 with probability 0.5
Qubit 1 measured in state 0 with probability 0.5
Qubit 2 measured in state 1 with probability 0.5
Qubit 3 measured in state 1 with probability 0.5
Qubit 4 measured in state 0 with probability 0.5
Qubit 5 measured in state 0 with probability 0.5
Qubit 6 measured in state 0 with probability 0.5
Qubit 7 measured in state 0 with probability 0.5
Qubit 8 measured in state 0 with probability 0.5
Qubit 9 measured in state 0 with probability 0.5
Qubit 10 measured in state 1 with probability 0.5
Qubit 11 measured in state 1 with probability 0.5
Qubit 12 measured in state 1 with probability 0.5
Qubit 13 measured in state 0 with probability 0.5
Qubit 14 measured in state 0 with probability 0.5
Qubit 15 measured in state 0 with probability 0.5
Qubit 16 measured in state 0 with probability 0.5
Qubit 17 measured in state 1 with probability 0.5
Qubit 18 measured in state 1 with probability 0.5
Qubit 19 measured in state 1 with probability 0.5

Final state:
|10110000001110000111>
QFT run time: 0.00120811s
Total run time: 2.37806s

Tested 20 qubits on grace-hopper.

Todo:

Fix compilation on AMD hardware as the __grid_constant__ type modifier is not currently supported in HiP as discussed elsewhere.

JPRichings · 2026-06-13T00:07:49Z

I am definitely seeing a memory leak in the density evolution test but it can be removed by changing the number of repeats and number of qubits used in the test so is a bit odd really.

Have thrown nvidia compute-sanitizer over it the test:

compute-sanitizer --tool memcheck --leak-check=full ./tests/tests "density evolution"

And produced the following and many other errors:

========= Invalid __global__ read of size 8 bytes
=========     at void kernel_statevec_anyCtrlOneTargDenseMatr_subA<(int)-1>(base_qcomp *, long long, List64, long long, int, base_qcomp, base_qcomp, base_qcomp, base_qcomp)+0x650
=========     by thread (64,0,0) in block (1,0,0)
=========     Access to 0xec56a69c3248 is out of bounds
=========     and is 249272 bytes before the nearest allocation at 0xec56a6a00000 of size 1 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame: cudaLaunchKernel [0x3391e3] in libQuEST.so.4
=========         Host Frame: void gpu_statevec_anyCtrlOneTargDenseMatr_subA<-1>(Qureg, List64 const&, List64 const&, int, CompMatr1) [0x277993] in libQuEST.so.4
=========         Host Frame: accel_statevec_anyCtrlOneTargDenseMatr_subA(Qureg, List64 const&, List64 const&, int, CompMatr1) [0xed68b] in libQuEST.so.4
=========         Host Frame: localiser_statevec_anyCtrlOneTargDenseMatr(Qureg, List64 const&, List64 const&, int, CompMatr1, bool, bool) [0xff0c3] in libQuEST.so.4
=========         Host Frame: void localiser_statevec_anyCtrlAnyTargAnyMatr<CompMatr1>(Qureg, List64 const&, List64 const&, List64 const&, CompMatr1, bool) [0x105e6b] in libQuEST.so.4
=========         Host Frame: void validateAndApplyAnyCtrlAnyTargUnitaryMatrix<CompMatr1>(Qureg, int*, int*, int, int*, int, CompMatr1, char const*) [0xdfef7] in libQuEST.so.4
=========         Host Frame: applyMultiStateControlledCompMatr1 [0xd9b47] in libQuEST.so.4
=========         Host Frame: testDensityMatrixEvolution(Qureg, Qureg) [0x3308cf] in tests
=========         Host Frame: CATCH2_INTERNAL_TEST_0() [0x334ccb] in tests
=========         Host Frame: Catch::RunContext::invokeActiveTestCase() [0xa2f47] in libCatch2.so.3.8.0
=========         Host Frame: Catch::RunContext::runCurrentTest() [0xa3d07] in libCatch2.so.3.8.0
=========         Host Frame: Catch::RunContext::runTest(Catch::TestCaseHandle const&) [0xa417f] in libCatch2.so.3.8.0
=========         Host Frame: Catch::Session::runInternal() [0x7e007] in libCatch2.so.3.8.0
=========         Host Frame: Catch::Session::run() [0x7e4f3] in libCatch2.so.3.8.0
=========         Host Frame: main [0xeca3] in tests

Test run on grace-hopper.

Exact kernel that error is not fixed and changes based on qubit and repeat counts set in test. One concern I have is that this only got caught as gpu_sync() has a cuda error check call in it and threw the last error seen but if not for this being in a later call the errors would not have been caught.

Action: Review GPU error handling in QuEST just so we catch device side errors more consistently.

Updated Windows CI environment from 'windows-latest' to 'windows-2022' for HIP compilation.

to avoid conflict with bitwise-opt MR

TysonRayJones · 2026-06-22T04:07:11Z

    qindex qubitStateMask = util_getBitMask(ctrls, ctrlStates, {targ}, {0});

-    auto [m00, m01, m10, m11] = getFlattenedGpuQcompMatrix<2>(matr.elems); // explicit template for MSVC, grr!
+    auto [m00, m01, m10, m11] = getFlattenedGpuQcompMatrix<2>(matr.elems); // explicit template for MSVC, grrr!


since not recognised by < v11.7

TysonRayJones · 2026-06-22T04:19:18Z


    // use template param to compile-time unroll loop in insertBits()
-    SET_VAR_AT_COMPILE_TIME(int, numCtrlBits, NumCtrls, numCtrls);
+    SET_VAR_AT_COMPILE_TIME(int, numCtrlBits, NumCtrls, ctrlsAndTarg.size());


Flagging an alarming bug here; ctrlsAndTarg.size() is one bigger than its previous numCtrls size. The final parameter to SET_VAR_AT_COMPILE_TIME is only ever consulted when it exceeds the NumCtrls template parameter, which has a maximum of 5 (iirc). So this bug is only triggered when passing 6 control qubits - which is impossible of our 6-qubit Qureg unit tests, since that leaves insufficient available qubits to be targets. Ergo, our tests do not uncover this bug.

Easy patch, but indicates that our tests do not test sufficiently large Qureg! I chose Qureg <= 6 deliberately to exceed the max parameter of 5, but I was thinking of target qubits which can reach 6, whereas control qubits cannot. Eep!

Added to my backlog to clean up the macro mischief responsible for the pitfall, described in #802.

which is a scenario not currently covered by our unit tests! This is because we'd need a Qureg of at least 8 qubits for there to exist 6 controls and 2 targets; and our max Qureg size is 6. Ruh roh!

primarily by replacing +1 and +2 with constexpr numTargs

and discarding an attempted improvement of SET_VAR_AT_COMPILE_TIME to make the prior bug impossible, because it also affects CPU invocation

TysonRayJones · 2026-06-22T06:07:08Z

This is a brilliant PR! 🎉 🎉 I note that it changes sensitive and error-prone kernel logic which (as mentioned above) is actually not covered by our existing unit tests. However, that problem already indicts the integrity of the devel branch, so I advocate for merging this PR anyway. I guess it will be a priority to get devel rigorously tested before release - I'll address the gap in the unit tests 🙏

TysonRayJones and others added 5 commits May 28, 2025 03:06

v4.1 (QuEST-Kit#622)

2d1a6fb

post v4.1

4d44ec8

Updates version references and fixes doxygen parsing of CMake configured header broken in QuEST-Kit#616

v4.2 (QuEST-Kit#692)

9d7618d

update to remove implicit thrust vector host to device copy

0acd9b9

remove thrust devic_vector

d7bf558

JPRichings mentioned this pull request May 1, 2026

Weaning off std::complex arithmetic #729

Merged

TysonRayJones reviewed May 3, 2026

View reviewed changes

Comment thread quest/src/gpu/gpu_kernels.cuh Outdated

alternative approach where capture array in struct to remove explict …

324e5be

…memory copy

JPRichings added 12 commits May 9, 2026 20:31

identify opportutnities for performance improvement

e04dc1f

kernel_statevec_anyCtrlSwap_subA performance change

abf0387

update kernel_statevec_anyCtrlSwap_subB for performance

a3cd184

update kernel_statevec_anyCtrlSwap_subC for performance

5dd472d

update kernel_statevec_anyCtrlOneTargDenseMatr_subB for performance

a23b231

update kernel_statevec_anyCtrlTwoTargDenseMatr_sub for performance

6ea9d76

update kernel_statevec_anyCtrlFewTargDenseMatr for performance

5ef2d27

update kernel_statevector_anyCtrlPauliTensorOrGadget_subA for perform…

a353450

…ance

update kernel_statevector_anyCtrlPauliTensorOrGadget_subA for perform…

1f57274

…ance

update kernel_statevector_anyCtrlPauliTensorOrGadget_subB for perform…

6604823

…ance

update kernel_densmatr_partialTrace_sub for performance

ca95173

update kernel_statevec_calcProbsOfAllMultiQubitOutcomes_sub for perfo…

ec71af6

…rmance

JPRichings added 5 commits June 11, 2026 08:43

update util functin calls

4206fe5

updated kernels to work with List64

6637690

remove references to ctrl_device_t and fix bugs

bf78c77

correct spelling of variable

6da3fc0

updates to kernel

95356c7

update test value to divisable by 32 and 64

cebb805

mk0dz mentioned this pull request Jun 12, 2026

QuEST #749 — optimise away small GPU qubit-list copies (Thrust) [unitaryHACK] #792

Closed

JPRichings added 2 commits June 15, 2026 14:18

replace explict reference to __grid_constant__ with preprocessor macro

8db2170

Correct errors in kernel refactors

d8c5d39

JPRichings marked this pull request as ready for review June 16, 2026 21:38

JPRichings and others added 7 commits June 18, 2026 23:50

Update Visual Studio path from 2022 to 2026

fe5501a

revert visual studio version change

98a3739

Update Windows OS version in CI workflow

077a4e4

Change Windows CI to use 'windows-2022' for the hip tests

1aeba33

Updated Windows CI environment from 'windows-latest' to 'windows-2022' for HIP compilation.

Merge branch 'devel' into pr/739

e8cd64b

removing bitwise whitespace diff

ff70ef1

to avoid conflict with bitwise-opt MR

remove now-defunct devints 🎉

3533172

TysonRayJones reviewed Jun 22, 2026

View reviewed changes

forego __grid_constant__ in old compilers

71268dc

since not recognised by < v11.7

TysonRayJones reviewed Jun 22, 2026

View reviewed changes

TysonRayJones added 3 commits June 22, 2026 00:34

patch distributed SWAP with >5 ctrls

79443e1

which is a scenario not currently covered by our unit tests! This is because we'd need a Qureg of at least 8 qubits for there to exist 6 controls and 2 targets; and our max Qureg size is 6. Ruh roh!

clean up

8d6594f

primarily by replacing +1 and +2 with constexpr numTargs

restoring spacing and braces

7e74836

and discarding an attempted improvement of SET_VAR_AT_COMPILE_TIME to make the prior bug impossible, because it also affects CPU invocation

TysonRayJones mentioned this pull request Jun 22, 2026

Improve SET_VAR_AT_COMPILE_TIME clarity #802

Open

TysonRayJones added 2 commits June 22, 2026 01:47

update authorlists

64c65a4

tidy gpu_subroutines.cpp

82636b2

TysonRayJones merged commit 90527c3 into QuEST-Kit:devel Jun 22, 2026
129 of 130 checks passed

Uh oh!

Conversation

JPRichings commented Apr 28, 2026

Uh oh!

JPRichings commented Apr 29, 2026

Uh oh!

otbrown commented May 3, 2026

Uh oh!

Uh oh!

TysonRayJones commented May 3, 2026

Uh oh!

JPRichings commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TysonRayJones commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented May 6, 2026

Uh oh!

JPRichings commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented Jun 13, 2026

Uh oh!

TysonRayJones Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

TysonRayJones Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

TysonRayJones Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

TysonRayJones commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JPRichings commented May 5, 2026 •

edited

Loading

JPRichings commented May 5, 2026 •

edited

Loading

JPRichings commented May 5, 2026 •

edited

Loading

TysonRayJones commented May 6, 2026 •

edited

Loading

JPRichings commented May 6, 2026 •

edited

Loading

JPRichings commented May 6, 2026 •

edited

Loading

JPRichings commented Jun 11, 2026 •

edited

Loading