[DON'T MERGE] trtexec: CC-safe timing via GPU %globaltimer#4740
Open
wenbingl wants to merge 1 commit intoNVIDIA:mainfrom
Open
[DON'T MERGE] trtexec: CC-safe timing via GPU %globaltimer#4740wenbingl wants to merge 1 commit intoNVIDIA:mainfrom
wenbingl wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
cudaEventElapsedTime() is unreliable when Confidential Compute (CC) is
enabled, producing negative latencies in trtexec performance summaries
on CC-enabled systems (nvbug 5598617, originally hit on B200 with Intel
TDX; reproducible on RTX PRO 6000 Blackwell as well).
Detect CC at startup via nvmlSystemGetConfComputeState (loaded through
dlopen so there is no build-time dependency on NVML or libnvidia-ml).
When CC is enabled, time TrtCudaEvent intervals by reading the PTX
%globaltimer register from a single-thread kernel launched alongside
each cudaEventRecord. operator- then computes elapsed time from the
device-side timestamps via two cudaMemcpy reads, instead of calling
cudaEventElapsedTime. Non-CC runs keep the existing path unchanged.
Also handle a related CC issue in HostDeallocator: under CC,
cudaPointerGetAttributes() reports memory returned by cudaMallocHost()
as cudaMemoryTypeManaged rather than cudaMemoryTypeHost, which used to
cause an exit(EXIT_FAILURE) during teardown ("Unexpected cuda memory
type:3"). Treat Managed identically to Host -- still freed via
cudaFreeHost.
Mirrors the approach in TRT-LLM PR NVIDIA/TensorRT-LLM#11657.
Files:
samples/common/globalTimerKernel.{h,cu}: 1-thread kernel writing
%globaltimer (ns) to a device buffer.
samples/common/sampleDevice.h: TrtCudaEvent allocates a device
timestamp when CC is active, launches the timer kernel in
record(), and subtracts device-side timestamps in operator-
(signed int64_t to avoid implementation-defined unsigned->signed
cast). HostDeallocator accepts cudaMemoryTypeManaged.
samples/common/sampleDevice.cpp: isConfidentialComputeEnabled()
loads libnvidia-ml.so.1 via dlopen and reads
nvmlSystemGetConfComputeState; result cached once per process.
Windows returns false.
samples/common/CMakeLists.txt: register the new files.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
cudaEventElapsedTime() is unreliable when Confidential Compute (CC) is enabled, producing negative latencies in trtexec performance summaries on CC-enabled systems (nvbug 5598617, originally hit on B200 with Intel TDX; reproducible on RTX PRO 6000 Blackwell as well).
Detect CC at startup via nvmlSystemGetConfComputeState (loaded through dlopen so there is no build-time dependency on NVML or libnvidia-ml). When CC is enabled, time TrtCudaEvent intervals by reading the PTX %globaltimer register from a single-thread kernel launched alongside each cudaEventRecord. operator- then computes elapsed time from the device-side timestamps via two cudaMemcpy reads, instead of calling cudaEventElapsedTime. Non-CC runs keep the existing path unchanged.
Also handle a related CC issue in HostDeallocator: under CC, cudaPointerGetAttributes() reports memory returned by cudaMallocHost() as cudaMemoryTypeManaged rather than cudaMemoryTypeHost, which used to cause an exit(EXIT_FAILURE) during teardown ("Unexpected cuda memory type:3"). Treat Managed identically to Host -- still freed via cudaFreeHost.
Mirrors the approach in TRT-LLM PR NVIDIA/TensorRT-LLM#11657.
Files:
samples/common/globalTimerKernel.{h,cu}: 1-thread kernel writing
%globaltimer (ns) to a device buffer.
samples/common/sampleDevice.h: TrtCudaEvent allocates a device
timestamp when CC is active, launches the timer kernel in
record(), and subtracts device-side timestamps in operator-
(signed int64_t to avoid implementation-defined unsigned->signed
cast). HostDeallocator accepts cudaMemoryTypeManaged.
samples/common/sampleDevice.cpp: isConfidentialComputeEnabled()
loads libnvidia-ml.so.1 via dlopen and reads
nvmlSystemGetConfComputeState; result cached once per process.
Windows returns false.
samples/common/CMakeLists.txt: register the new files.