Conversation
I'm collecting information about our current testing and added a README to the directory to emphasize the GPU-first testing strategy and our support for type-specific tolerances.
|
Can I get a review from @aosewski , @amd-anclark , @bartekxk , @johannes-graner, and @kabrahamAMD? If you know any other engineers working on this kind of testing improvement, can you invite them to review this code? |
There was a problem hiding this comment.
Pull request overview
This PR adds comprehensive documentation to the CK library utility directory, explaining the testing infrastructure with an emphasis on modern GPU-first validation strategies and automatic tolerance computation based on IEEE 754 precision limits.
Changes:
- Added a detailed README.md file documenting testing utilities, validation approaches, and best practices
- Documented the performance advantages of GPU-first validation over legacy CPU-based approaches
- Provided reference tables for tolerance computation across different data types
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| | FP32 | 23 | 1e-5 | 3e-6 | | ||
| | TF32 | 10 | 5e-4 | 5e-4 | | ||
| | FP16 | 10 | 1e-3 | 1e-3 | | ||
| | BF16 | 7 | 1e-1 | 1e-3 | |
There was a problem hiding this comment.
The relative tolerance for BF16 (1e-1 or 0.1) appears unusually high compared to other data types. This suggests 10% relative error is acceptable, which seems inconsistent with typical numerical validation standards. Verify this value is correct or clarify if this is for specific use cases.
| | BF16 | 7 | 1e-1 | 1e-3 | | |
| | BF16 | 7 | 1e-2 | 1e-3 | |
| - `gpu_reduce_max()`: Computes max(abs(tensor)) on GPU for tolerance scaling | ||
| - Grid-stride kernels with LDS reduction for optimal performance | ||
|
|
||
| **Performance**: 10-100x faster than CPU validation for large tensors. |
There was a problem hiding this comment.
Maybe remove the 10-100x number?
|
|
||
| - `gpu_verify()`: Compares device tensors entirely on GPU | ||
| - Automatic tolerance computation based on data types | ||
| - Only transfers error statistics (~12 bytes), not tensors |
There was a problem hiding this comment.
we can remove (~12 bytes)
| | FP16 | 10 | 1e-3 | 1e-3 | | ||
| | BF16 | 7 | 1e-1 | 1e-3 | | ||
| | FP8 | 3-4 | 1e-3 | 1e-3 | | ||
| | BF8 | 2-3 | 1e-3 | 1e-3 | |
There was a problem hiding this comment.
Rtol for BF8 lower than BF16?
| ↑ BOTTLENECK: PCIe transfer of entire tensor | ||
| ``` | ||
|
|
||
| - **Problem**: Transferring multi-GB tensors over PCIe is 10-100x slower than computation |
There was a problem hiding this comment.
can remove the 10-100x
| ``` | ||
|
|
||
| - **Advantage**: All data stays on GPU, only error statistics transfer to CPU | ||
| - **Performance**: 10-100x faster for large tensors |
There was a problem hiding this comment.
Can remove or rephrase
|
|
||
| This directory contains utility headers for testing, benchmarking, and validating Composable Kernel (CK) operations. The utilities support both modern GPU-first validation for high-performance testing and legacy CPU-based approaches for backward compatibility. | ||
|
|
||
| ## Quick Start |
There was a problem hiding this comment.
This section seems to summarize what our good practices are, key principles, or validation guidelines rather than initial setup steps.
| ## Quick Start | |
| ## Recommended Practices |
| 2. **Let the system compute tolerances** automatically based on data types | ||
| 3. **Only transfer error statistics**, not full tensors | ||
|
|
||
| ## File-to-Purpose Quick Reference |
There was a problem hiding this comment.
A small thing, but the table is organized from left-to-right as purpose-to-file. The current section name is the reverse.
| ## File-to-Purpose Quick Reference | |
| ## Purpose-to-Utility Quick Reference |
| // Explicit tolerance | ||
| bool pass = gpu_verify<float>(output_dev, reference_dev, 1e-5f, 1e-6f, size); | ||
|
|
||
| // Automatic tolerance for mixed precision | ||
| bool pass = gpu_verify<float, half_t, float>(output_dev, reference_dev, K_dim, size); |
There was a problem hiding this comment.
Is it worth mentioning when to use an explicit vs. automatic tolerance?
| - `get_relative_threshold<ComputeType, OutType, AccType>()`: Computes relative tolerance from mantissa bits | ||
| - `get_absolute_threshold<ComputeType, OutType, AccType>()`: Computes absolute tolerance scaled by magnitude |
There was a problem hiding this comment.
If it is helpful, a short example for each of these calls could help users see its usage.
johannes-graner
left a comment
There was a problem hiding this comment.
Very nice to have summary of the existing utility functionality.
My review mainly covers the GPU verification since that's what I'm most familiar with, although I looked through the rest too.
| bool pass = gpu_verify<float>(output_dev, reference_dev, 1e-5f, 1e-6f, size); | ||
|
|
||
| // Automatic tolerance for mixed precision | ||
| bool pass = gpu_verify<float, half_t, float>(output_dev, reference_dev, K_dim, size); |
There was a problem hiding this comment.
K_dim should be changed to accumulation_count or similar, it's not necessarily equal to the K dimension.
On a slightly separate note, this does not currently support split-k, which requires accounting for accumulation in multiple data types. See issue #3673.
| 1. **Use GPU-first validation** for all new tests | ||
| 2. **Avoid CPU transfers** unless debugging specific values | ||
| 3. **Generate data on GPU** when possible | ||
| 4. **Batch verification** to amortize kernel launch overhead |
There was a problem hiding this comment.
I don't think we do or support batch verification. It would be very VRAM-intense since many device-side tensors would have to be kept in memory instead of clearing the non-reference output after each kernel that is tested.
|
Imported to ROCm/rocm-libraries |
I'm collecting information about our current testing (#3664). As part of this work I a README to the directory to emphasize the GPU-first testing strategy and our support for type-specific tolerances. This readme contains internal code comments for CK developers and does not need ROCm documentation review. --- 🔁 Imported from [ROCm/composable_kernel#3665](ROCm/composable_kernel#3665) 🧑💻 Originally authored by @shumway Co-authored-by: John Shumway <jshumway@amd.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
I'm collecting information about our current testing (#3664). As part of this work I a README to the directory to emphasize the GPU-first testing strategy and our support for type-specific tolerances.
This readme contains internal code comments for CK developers and does not need ROCm documentation review.