NVIDIA Open GPU Kernel Modules Version
590.48.01-1
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
Operating System and Version
Debian GNU/Linux 13 (trixie)
Kernel Release
6.12.73
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
Hardware: GPU
RTX 5070 Laptop GPU
Describe the bug
Some GPM and DGCM metrics use cycle count rather than elapsed time to determine statistics for things like SM usage, occupancy, etc. Because of that, large SM clock swings due to DVFS can cause severe inaccuracies. A research paper (https://dl.acm.org/doi/full/10.1145/3784828.3785156) discussed some software-workarounds involving postprocessing, but it would be great if the actual metric calculations could be fixed.
The paper describes:
The documentation for the semantically equivalent DCGM metric PROF_SM_ACTIVE describes the SM utilization as “the ratio of cycles an SM has at least 1 warp assigned”. While the GPU utilization is measured as a percentage of time, the SM utilization is measured as a percentage of cycles. Therefore, the SM utilization depends on the SM clock frequency during the measurement.
This issue applies to any Blackwell (and presumably Ada/Hopper) GPU. I only have a Blackwell, and Blackwell does not support the proprietary kernel module, hence I cannot test with it.
To Reproduce
- Enable GPM:
nvidia-smi gpm -s 1
- Monitor GPM metrics:
nvidia-smi dmon --gpm-metrics 1,2
- Trigger clock swings. Metric 1 will be accurate, metric 2 will be inaccurate
Bug Incidence
Always
nvidia-bug-report.log.gz
N/A
More Info
No response
NVIDIA Open GPU Kernel Modules Version
590.48.01-1
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
Operating System and Version
Debian GNU/Linux 13 (trixie)
Kernel Release
6.12.73
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
Hardware: GPU
RTX 5070 Laptop GPU
Describe the bug
Some GPM and DGCM metrics use cycle count rather than elapsed time to determine statistics for things like SM usage, occupancy, etc. Because of that, large SM clock swings due to DVFS can cause severe inaccuracies. A research paper (https://dl.acm.org/doi/full/10.1145/3784828.3785156) discussed some software-workarounds involving postprocessing, but it would be great if the actual metric calculations could be fixed.
The paper describes:
This issue applies to any Blackwell (and presumably Ada/Hopper) GPU. I only have a Blackwell, and Blackwell does not support the proprietary kernel module, hence I cannot test with it.
To Reproduce
nvidia-smi gpm -s 1nvidia-smi dmon --gpm-metrics 1,2Bug Incidence
Always
nvidia-bug-report.log.gz
N/A
More Info
No response