Ctrls performance#739
Conversation
Updates version references and fixes doxygen parsing of CMake configured header broken in QuEST-Kit#616
|
Todo:
Assumptions in this code:
|
|
Alloc size 64 qubits -- add as macro somewhere if not done already |
|
Reminder of other stuff from meeting:
|
|
Way easier to do this instead: Beginning with CUDA 12.1, you can now pass up to 32,764 bytes as kernel parameters on NVIDIA Volta and above Note that in both preceding examples, kernel parameters are annotated with the grid_constant qualifier to indicate they are read-only. reference: https://developer.nvidia.com/blog/cuda-12-1-supports-large-kernel-parameters/ I think this solves our concerns about multi-qureg operations both accessing a common ctrls cache in future. (this is also much better than the variadic kernel idea I had earlier today) Other benefits:
|
|
Profiling to confirm but here is an extra factor of 2 over #729 (comment): |
|
Is it possible to pass multiple, distinct arguments to the one function that way? E.g. so that the The pattern of Orthogonally, do we have consternations about bumping min CUDA to 12.1 (released 2023)? |
|
I think it is possible to have multiple arguments passed this way. All this is (and its what I tried to do first with this optimisation) capture by value in the kernel launch a c style array. The I think this is much cleaner than the data movement to const memory as instead of pointers to device constant memory we just pass the array directly. Only change to the original kernel (before all this optimisation stuff) is then we need to access a data member of a struct. There is no other change inside the kernel and we aren't accessing a global device variable out of the blue mid kernel. In We probably don't need to to change cuda versions unless we think we are passing over more than 4096 bytes in the kernel. I don't think we are anywhere near this but no objection to moving to CUDA 12.1. |
|
Finally usual caveats that correctness checking needs to take place with a full run of the test suite! |
|
Devel merged in and List64 applied to all the gpu subroutines and kernels. Simple QFT performance test gives slightly improved results: Tested 20 qubits on grace-hopper. Todo:
|
|
I am definitely seeing a memory leak in the Have thrown nvidia compute-sanitizer over it the test: And produced the following and many other errors: Test run on grace-hopper. Exact kernel that error is not fixed and changes based on qubit and repeat counts set in test. One concern I have is that this only got caught as Action: Review GPU error handling in QuEST just so we catch device side errors more consistently. |
Updated Windows CI environment from 'windows-latest' to 'windows-2022' for HIP compilation.
to avoid conflict with bitwise-opt MR
| qindex qubitStateMask = util_getBitMask(ctrls, ctrlStates, {targ}, {0}); | ||
|
|
||
| auto [m00, m01, m10, m11] = getFlattenedGpuQcompMatrix<2>(matr.elems); // explicit template for MSVC, grr! | ||
| auto [m00, m01, m10, m11] = getFlattenedGpuQcompMatrix<2>(matr.elems); // explicit template for MSVC, grrr! |
since not recognised by < v11.7
|
|
||
| // use template param to compile-time unroll loop in insertBits() | ||
| SET_VAR_AT_COMPILE_TIME(int, numCtrlBits, NumCtrls, numCtrls); | ||
| SET_VAR_AT_COMPILE_TIME(int, numCtrlBits, NumCtrls, ctrlsAndTarg.size()); |
There was a problem hiding this comment.
Flagging an alarming bug here; ctrlsAndTarg.size() is one bigger than its previous numCtrls size. The final parameter to SET_VAR_AT_COMPILE_TIME is only ever consulted when it exceeds the NumCtrls template parameter, which has a maximum of 5 (iirc). So this bug is only triggered when passing 6 control qubits - which is impossible of our 6-qubit Qureg unit tests, since that leaves insufficient available qubits to be targets. Ergo, our tests do not uncover this bug.
Easy patch, but indicates that our tests do not test sufficiently large Qureg! I chose Qureg <= 6 deliberately to exceed the max parameter of 5, but I was thinking of target qubits which can reach 6, whereas control qubits cannot. Eep!
There was a problem hiding this comment.
Added to my backlog to clean up the macro mischief responsible for the pitfall, described in #802.
which is a scenario not currently covered by our unit tests! This is because we'd need a Qureg of at least 8 qubits for there to exist 6 controls and 2 targets; and our max Qureg size is 6. Ruh roh!
and discarding an attempted improvement of SET_VAR_AT_COMPILE_TIME to make the prior bug impossible, because it also affects CPU invocation
|
This is a brilliant PR! 🎉 🎉 I note that it changes sensitive and error-prone kernel logic which (as mentioned above) is actually not covered by our existing unit tests. However, that problem already indicts the integrity of the |


Initial pass on performance changed to remove thrust device_vector that is used to move ctrls to device that is causing a performance impact due to the thrust device_vector moving data from host to device on construction.