Skip to content

[DON'T MERGE] trtexec: CC-safe timing via GPU %globaltimer#4740

Open
wenbingl wants to merge 1 commit intoNVIDIA:mainfrom
wenbingl:fix-nvbug-5598617-cc-negative-latency
Open

[DON'T MERGE] trtexec: CC-safe timing via GPU %globaltimer#4740
wenbingl wants to merge 1 commit intoNVIDIA:mainfrom
wenbingl:fix-nvbug-5598617-cc-negative-latency

Conversation

@wenbingl
Copy link
Copy Markdown

cudaEventElapsedTime() is unreliable when Confidential Compute (CC) is enabled, producing negative latencies in trtexec performance summaries on CC-enabled systems (nvbug 5598617, originally hit on B200 with Intel TDX; reproducible on RTX PRO 6000 Blackwell as well).

Detect CC at startup via nvmlSystemGetConfComputeState (loaded through dlopen so there is no build-time dependency on NVML or libnvidia-ml). When CC is enabled, time TrtCudaEvent intervals by reading the PTX %globaltimer register from a single-thread kernel launched alongside each cudaEventRecord. operator- then computes elapsed time from the device-side timestamps via two cudaMemcpy reads, instead of calling cudaEventElapsedTime. Non-CC runs keep the existing path unchanged.

Also handle a related CC issue in HostDeallocator: under CC, cudaPointerGetAttributes() reports memory returned by cudaMallocHost() as cudaMemoryTypeManaged rather than cudaMemoryTypeHost, which used to cause an exit(EXIT_FAILURE) during teardown ("Unexpected cuda memory type:3"). Treat Managed identically to Host -- still freed via cudaFreeHost.

Mirrors the approach in TRT-LLM PR NVIDIA/TensorRT-LLM#11657.

Files:
samples/common/globalTimerKernel.{h,cu}: 1-thread kernel writing
%globaltimer (ns) to a device buffer.
samples/common/sampleDevice.h: TrtCudaEvent allocates a device
timestamp when CC is active, launches the timer kernel in
record(), and subtracts device-side timestamps in operator-
(signed int64_t to avoid implementation-defined unsigned->signed
cast). HostDeallocator accepts cudaMemoryTypeManaged.
samples/common/sampleDevice.cpp: isConfidentialComputeEnabled()
loads libnvidia-ml.so.1 via dlopen and reads
nvmlSystemGetConfComputeState; result cached once per process.
Windows returns false.
samples/common/CMakeLists.txt: register the new files.

cudaEventElapsedTime() is unreliable when Confidential Compute (CC) is
enabled, producing negative latencies in trtexec performance summaries
on CC-enabled systems (nvbug 5598617, originally hit on B200 with Intel
TDX; reproducible on RTX PRO 6000 Blackwell as well).

Detect CC at startup via nvmlSystemGetConfComputeState (loaded through
dlopen so there is no build-time dependency on NVML or libnvidia-ml).
When CC is enabled, time TrtCudaEvent intervals by reading the PTX
%globaltimer register from a single-thread kernel launched alongside
each cudaEventRecord. operator- then computes elapsed time from the
device-side timestamps via two cudaMemcpy reads, instead of calling
cudaEventElapsedTime. Non-CC runs keep the existing path unchanged.

Also handle a related CC issue in HostDeallocator: under CC,
cudaPointerGetAttributes() reports memory returned by cudaMallocHost()
as cudaMemoryTypeManaged rather than cudaMemoryTypeHost, which used to
cause an exit(EXIT_FAILURE) during teardown ("Unexpected cuda memory
type:3"). Treat Managed identically to Host -- still freed via
cudaFreeHost.

Mirrors the approach in TRT-LLM PR NVIDIA/TensorRT-LLM#11657.

Files:
  samples/common/globalTimerKernel.{h,cu}: 1-thread kernel writing
    %globaltimer (ns) to a device buffer.
  samples/common/sampleDevice.h: TrtCudaEvent allocates a device
    timestamp when CC is active, launches the timer kernel in
    record(), and subtracts device-side timestamps in operator-
    (signed int64_t to avoid implementation-defined unsigned->signed
    cast). HostDeallocator accepts cudaMemoryTypeManaged.
  samples/common/sampleDevice.cpp: isConfidentialComputeEnabled()
    loads libnvidia-ml.so.1 via dlopen and reads
    nvmlSystemGetConfComputeState; result cached once per process.
    Windows returns false.
  samples/common/CMakeLists.txt: register the new files.
@wenbingl wenbingl requested a review from a team as a code owner April 27, 2026 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant