Add a README.md file to ck/library/util#3665

Closed

shumway wants to merge 1 commit intodevelopfrom

jshumway/util-readme

Collaborator

shumway commented Jan 28, 2026 •

edited

Loading

I'm collecting information about our current testing (#3664). As part of this work I a README to the directory to emphasize the GPU-first testing strategy and our support for type-specific tolerances.

This readme contains internal code comments for CK developers and does not need ROCm documentation review.


          Add a readme file to ck/library/util

69fc05d

I'm collecting information about our current testing and added a README to the directory to emphasize the GPU-first testing strategy and our  support for type-specific tolerances.

shumway requested review from a team, Snektron, ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, cgmillette, coderfeli, ddembeckAMD, geyyer, illsilin, poyenc, qianfengz, vidyasagar-amd and vpietila-amd as code owners

January 28, 2026 04:39

Collaborator Author

shumway commented Jan 28, 2026

Can I get a review from @aosewski , @amd-anclark , @bartekxk , @johannes-graner, and @kabrahamAMD?

If you know any other engineers working on this kind of testing improvement, can you invite them to review this code?

afagaj requested a review from Copilot

January 28, 2026 17:19

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull request overview

This PR adds comprehensive documentation to the CK library utility directory, explaining the testing infrastructure with an emphasis on modern GPU-first validation strategies and automatic tolerance computation based on IEEE 754 precision limits.

Changes:

Added a detailed README.md file documenting testing utilities, validation approaches, and best practices
Documented the performance advantages of GPU-first validation over legacy CPU-based approaches
Provided reference tables for tolerance computation across different data types

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

include/ck/library/utility/README.md

+              | FP32      | 23            | 1e-5         | 3e-6         |
+              | TF32      | 10            | 5e-4         | 5e-4         |
+              | FP16      | 10            | 1e-3         | 1e-3         |
+              | BF16      | 7             | 1e-1         | 1e-3         |

Copilot AI Jan 28, 2026

The relative tolerance for BF16 (1e-1 or 0.1) appears unusually high compared to other data types. This suggests 10% relative error is acceptable, which seems inconsistent with typical numerical validation standards. Verify this value is correct or clarify if this is for specific use cases.

Suggested change

      
            | BF16      | 7             | 1e-1         | 1e-3         |
          
            | BF16      | 7             | 1e-2         | 1e-3         |

Copilot uses AI. Check for mistakes.

vidyasagar-amd reviewed

View reviewed changes

include/ck/library/utility/README.md

+              - `gpu_reduce_max()`: Computes max(abs(tensor)) on GPU for tolerance scaling
+              - Grid-stride kernels with LDS reduction for optimal performance
+              **Performance**: 10-100x faster than CPU validation for large tensors.

Contributor

vidyasagar-amd Jan 28, 2026

Maybe remove the 10-100x number?

include/ck/library/utility/README.md

+              - `gpu_verify()`: Compares device tensors entirely on GPU
+                - Automatic tolerance computation based on data types
+                - Only transfers error statistics (~12 bytes), not tensors

Contributor

vidyasagar-amd Jan 28, 2026

we can remove (~12 bytes)

include/ck/library/utility/README.md

+              | FP16      | 10            | 1e-3         | 1e-3         |
+              | BF16      | 7             | 1e-1         | 1e-3         |
+              | FP8       | 3-4           | 1e-3         | 1e-3         |
+              | BF8       | 2-3           | 1e-3         | 1e-3         |

Contributor

vidyasagar-amd Jan 28, 2026

Rtol for BF8 lower than BF16?

include/ck/library/utility/README.md

+                          ↑ BOTTLENECK: PCIe transfer of entire tensor
+              ```
+              - **Problem**: Transferring multi-GB tensors over PCIe is 10-100x slower than computation

Contributor

vidyasagar-amd Jan 28, 2026

can remove the 10-100x

include/ck/library/utility/README.md

+              ```
+              - **Advantage**: All data stays on GPU, only error statistics transfer to CPU
+              - **Performance**: 10-100x faster for large tensors

Contributor

vidyasagar-amd Jan 28, 2026

Can remove or rephrase

amd-anclark reviewed

View reviewed changes

include/ck/library/utility/README.md


		This directory contains utility headers for testing, benchmarking, and validating Composable Kernel (CK) operations. The utilities support both modern GPU-first validation for high-performance testing and legacy CPU-based approaches for backward compatibility.

		## Quick Start

Collaborator

amd-anclark Jan 28, 2026

This section seems to summarize what our good practices are, key principles, or validation guidelines rather than initial setup steps.

Suggested change

      
            ## Quick Start
          
            ## Recommended Practices

include/ck/library/utility/README.md

+. **Let the system compute tolerances** automatically based on data types
+. **Only transfer error statistics**, not full tensors
+              ## File-to-Purpose Quick Reference

Collaborator

amd-anclark Jan 28, 2026

A small thing, but the table is organized from left-to-right as purpose-to-file. The current section name is the reverse.

Suggested change

      
            ## File-to-Purpose Quick Reference
          
            ## Purpose-to-Utility Quick Reference

include/ck/library/utility/README.md

Comment on lines +44 to +48

+              // Explicit tolerance
+              bool pass = gpu_verify<float>(output_dev, reference_dev, 1e-5f, 1e-6f, size);
+              // Automatic tolerance for mixed precision
+              bool pass = gpu_verify<float, half_t, float>(output_dev, reference_dev, K_dim, size);

Collaborator

amd-anclark Jan 28, 2026

Is it worth mentioning when to use an explicit vs. automatic tolerance?

include/ck/library/utility/README.md

Comment on lines +57 to +58

		- `get_relative_threshold<ComputeType, OutType, AccType>()`: Computes relative tolerance from mantissa bits
		- `get_absolute_threshold<ComputeType, OutType, AccType>()`: Computes absolute tolerance scaled by magnitude

Collaborator

amd-anclark Jan 28, 2026

If it is helpful, a short example for each of these calls could help users see its usage.

johannes-graner reviewed

View reviewed changes

Contributor

johannes-graner left a comment

Very nice to have summary of the existing utility functionality.
My review mainly covers the GPU verification since that's what I'm most familiar with, although I looked through the rest too.

include/ck/library/utility/README.md

+              bool pass = gpu_verify<float>(output_dev, reference_dev, 1e-5f, 1e-6f, size);
+              // Automatic tolerance for mixed precision
+              bool pass = gpu_verify<float, half_t, float>(output_dev, reference_dev, K_dim, size);

Contributor

johannes-graner Jan 29, 2026

K_dim should be changed to accumulation_count or similar, it's not necessarily equal to the K dimension.

On a slightly separate note, this does not currently support split-k, which requires accounting for accumulation in multiple data types. See issue #3673.

include/ck/library/utility/README.md

+. **Use GPU-first validation** for all new tests
+. **Avoid CPU transfers** unless debugging specific values
+. **Generate data on GPU** when possible
+. **Batch verification** to amortize kernel launch overhead

Contributor

johannes-graner Jan 29, 2026

I don't think we do or support batch verification. It would be very VRAM-intense since many device-side tensors would have to be kept in memory instead of clearing the non-reference output after each kernel that is tested.

assistant-librarian bot mentioned this pull request

Add a README.md file to ck/library/util ROCm/rocm-libraries#4277

Merged

Contributor

ammallya commented Feb 3, 2026

Imported to ROCm/rocm-libraries

ammallya closed this

illsilin added a commit to ROCm/rocm-libraries that referenced this pull request


          Add a README.md file to ck/library/util (#4277)

I'm collecting information about our current testing (#3664). As part of
this work I a README to the directory to emphasize the GPU-first testing
strategy and our support for type-specific tolerances.

This readme contains internal code comments for CK developers and does
not need ROCm documentation review.

---
🔁 Imported from
[ROCm/composable_kernel#3665](ROCm/composable_kernel#3665)
🧑‍💻 Originally authored by @shumway

Co-authored-by: John Shumway <jshumway@amd.com>
Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

johannes-graner johannes-graner left review comments

Copilot code review Copilot Copilot left review comments

vidyasagar-amd vidyasagar-amd left review comments

amd-anclark amd-anclark left review comments

illsilin Awaiting requested review from illsilin illsilin is a code owner

carlushuang Awaiting requested review from carlushuang carlushuang is a code owner

qianfengz Awaiting requested review from qianfengz qianfengz is a code owner

aosewski Awaiting requested review from aosewski aosewski is a code owner

poyenc Awaiting requested review from poyenc poyenc is a code owner

geyyer Awaiting requested review from geyyer geyyer is a code owner

bartekxk Awaiting requested review from bartekxk bartekxk is a code owner

andriy-ca Awaiting requested review from andriy-ca andriy-ca is a code owner

afagaj Awaiting requested review from afagaj afagaj is a code owner

asleepzzz Awaiting requested review from asleepzzz asleepzzz is a code owner

ThomasNing Awaiting requested review from ThomasNing ThomasNing is a code owner

coderfeli Awaiting requested review from coderfeli coderfeli is a code owner

cgmillette Awaiting requested review from cgmillette cgmillette is a code owner

ddembeckAMD Awaiting requested review from ddembeckAMD ddembeckAMD is a code owner

vpietila-amd Awaiting requested review from vpietila-amd vpietila-amd is a code owner

Snektron Awaiting requested review from Snektron Snektron is a code owner

Labels

None yet