[DON'T MERGE] trtexec: CC-safe timing via GPU %globaltimer by wenbingl · Pull Request #4740 · NVIDIA/TensorRT

wenbingl · 2026-04-27T16:28:03Z

cudaEventElapsedTime() is unreliable when Confidential Compute (CC) is enabled, producing negative latencies in trtexec performance summaries on CC-enabled systems (nvbug 5598617, originally hit on B200 with Intel TDX; reproducible on RTX PRO 6000 Blackwell as well).

Detect CC at startup via nvmlSystemGetConfComputeState (loaded through dlopen so there is no build-time dependency on NVML or libnvidia-ml). When CC is enabled, time TrtCudaEvent intervals by reading the PTX %globaltimer register from a single-thread kernel launched alongside each cudaEventRecord. operator- then computes elapsed time from the device-side timestamps via two cudaMemcpy reads, instead of calling cudaEventElapsedTime. Non-CC runs keep the existing path unchanged.

Also handle a related CC issue in HostDeallocator: under CC, cudaPointerGetAttributes() reports memory returned by cudaMallocHost() as cudaMemoryTypeManaged rather than cudaMemoryTypeHost, which used to cause an exit(EXIT_FAILURE) during teardown ("Unexpected cuda memory type:3"). Treat Managed identically to Host -- still freed via cudaFreeHost.

Mirrors the approach in TRT-LLM PR NVIDIA/TensorRT-LLM#11657.

Files:
samples/common/globalTimerKernel.{h,cu}: 1-thread kernel writing
%globaltimer (ns) to a device buffer.
samples/common/sampleDevice.h: TrtCudaEvent allocates a device
timestamp when CC is active, launches the timer kernel in
record(), and subtracts device-side timestamps in operator-
(signed int64_t to avoid implementation-defined unsigned->signed
cast). HostDeallocator accepts cudaMemoryTypeManaged.
samples/common/sampleDevice.cpp: isConfidentialComputeEnabled()
loads libnvidia-ml.so.1 via dlopen and reads
nvmlSystemGetConfComputeState; result cached once per process.
Windows returns false.
samples/common/CMakeLists.txt: register the new files.

cudaEventElapsedTime() is unreliable when Confidential Compute (CC) is enabled, producing negative latencies in trtexec performance summaries on CC-enabled systems (nvbug 5598617, originally hit on B200 with Intel TDX; reproducible on RTX PRO 6000 Blackwell as well). Detect CC at startup via nvmlSystemGetConfComputeState (loaded through dlopen so there is no build-time dependency on NVML or libnvidia-ml). When CC is enabled, time TrtCudaEvent intervals by reading the PTX %globaltimer register from a single-thread kernel launched alongside each cudaEventRecord. operator- then computes elapsed time from the device-side timestamps via two cudaMemcpy reads, instead of calling cudaEventElapsedTime. Non-CC runs keep the existing path unchanged. Also handle a related CC issue in HostDeallocator: under CC, cudaPointerGetAttributes() reports memory returned by cudaMallocHost() as cudaMemoryTypeManaged rather than cudaMemoryTypeHost, which used to cause an exit(EXIT_FAILURE) during teardown ("Unexpected cuda memory type:3"). Treat Managed identically to Host -- still freed via cudaFreeHost. Mirrors the approach in TRT-LLM PR NVIDIA/TensorRT-LLM#11657. Files: samples/common/globalTimerKernel.{h,cu}: 1-thread kernel writing %globaltimer (ns) to a device buffer. samples/common/sampleDevice.h: TrtCudaEvent allocates a device timestamp when CC is active, launches the timer kernel in record(), and subtracts device-side timestamps in operator- (signed int64_t to avoid implementation-defined unsigned->signed cast). HostDeallocator accepts cudaMemoryTypeManaged. samples/common/sampleDevice.cpp: isConfidentialComputeEnabled() loads libnvidia-ml.so.1 via dlopen and reads nvmlSystemGetConfComputeState; result cached once per process. Windows returns false. samples/common/CMakeLists.txt: register the new files.

wenbingl requested a review from a team as a code owner April 27, 2026 16:28

wenbingl requested review from kevinch-nv and yuanyao-nv April 27, 2026 16:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DON'T MERGE] trtexec: CC-safe timing via GPU %globaltimer#4740

[DON'T MERGE] trtexec: CC-safe timing via GPU %globaltimer#4740
wenbingl wants to merge 1 commit intoNVIDIA:mainfrom
wenbingl:fix-nvbug-5598617-cc-negative-latency

wenbingl commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wenbingl commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant