[CK] Add FP8 KV_BLOCKSCALE support for batch prefill#3696
[CK] Add FP8 KV_BLOCKSCALE support for batch prefill#3696Jeff-Huang wants to merge 6 commits intodevelopfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request implements FP8 KV_BLOCKSCALE support for batch prefill in the FMHA (Fused Multi-Head Attention) pipeline. The implementation adds per-page K/V quantization for paged attention caches, uses an exp2 shift trick to eliminate explicit P scaling overhead, and prefetches physical page offsets to hide memory latency.
Changes:
- Added KV_BLOCKSCALE enum to BlockAttentionQuantScaleEnum for per-page K/V quantization
- Refactored KV cache offset computation by splitting into load_physical_pages() and kv_offset_array_transform()
- Implemented exp2 shift trick to scale attention probabilities by 2^shift without explicit multiplication
- Added prefetching of physical pages to overlap memory loads with computation
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp | Core pipeline implementation: added load_physical_pages(), refactored kv_offset_array_transform(), implemented KV_BLOCKSCALE logic with exp2 shift trick and prefetching |
| fmha_batch_prefill_kernel.hpp | Added KV_BLOCKSCALE kernel argument structures (FmhaFwdKVBlockScaleKargs), QScaleKargsSelector template, and operator() overloads |
| block_attention_quant_scale_enum.hpp | Added KV_BLOCKSCALE enum value (3) and string conversion template specialization |
| quant.hpp | Added kv_blockscale enum value and decode/encode logic for command-line interface |
| fmha_fwd.hpp | Added kv_block_descale parameters to fmha_batch_prefill_args structure |
| fmha_batch_prefill.py | Restricted kv_blockscale to page_size=1024 in code generation |
| cpp_symbol_map.py | Added kv_blockscale mappings to QSCALE_MAP and QSCALE_CHECK_MAP |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
include/ck_tile/ops/fmha/pipeline/block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp
Show resolved
Hide resolved
| static constexpr auto kKVMemoryLayout = Problem::kKVMemoryLayout; | ||
| static constexpr auto QScaleEnum = Problem::QScaleEnum; | ||
|
|
||
| // For KV_BLOCKSCALE: shift value for exp2(x + shift) to scale P to [0, 2^shift] |
There was a problem hiding this comment.
The comment says "For KV_BLOCKSCALE" but these shift constants appear to be general FP8 constants that could be used for both BLOCKSCALE and KV_BLOCKSCALE. In other pipeline files (block_fmha_pipeline_qr_ks_vs.hpp and block_fmha_pipeline_qr_ks_vs_async.hpp), the same constants have comments saying "For BLOCKSCALE". Consider making the comment more general (e.g., "For BLOCKSCALE/KV_BLOCKSCALE") or extracting these constants to a shared location to avoid duplication and inconsistency.
| // For KV_BLOCKSCALE: shift value for exp2(x + shift) to scale P to [0, 2^shift] | |
| // For BLOCKSCALE/KV_BLOCKSCALE: shift value for exp2(x + shift) to scale P to [0, 2^shift] |
| else // !kVTileCrossesPages | ||
| { | ||
| // V cache: use physical_pages[k0] for each token | ||
| // physical_pages was already populated correctly by load_physical_pages(), handling: | ||
| // - page_size=1: page_idx maps token_idx -> physical_page directly | ||
| // - V tile crosses pages: per-token page lookup | ||
| // - V tile in single page: lane0 lookup with broadcast to all lanes |
There was a problem hiding this comment.
The comment "!kVTileCrossesPages" is misleading. This is the else branch for "if constexpr(kIsKcache)", so it handles V cache regardless of whether the V tile crosses pages or not. The comment should be "!kIsKcache" or "V cache" to accurately describe this branch.
| else // !kVTileCrossesPages | |
| { | |
| // V cache: use physical_pages[k0] for each token | |
| // physical_pages was already populated correctly by load_physical_pages(), handling: | |
| // - page_size=1: page_idx maps token_idx -> physical_page directly | |
| // - V tile crosses pages: per-token page lookup | |
| // - V tile in single page: lane0 lookup with broadcast to all lanes | |
| else // !kIsKcache (V cache) | |
| { | |
| // V cache: use physical_pages[k0] for each token | |
| // physical_pages was already populated correctly by load_physical_pages(), handling: | |
| // - page_size=1: page_idx maps token_idx -> physical_page directly | |
| // - V tile crosses pages: per-token page lookup | |
| - // - V tile in single page: lane0 lookup with broadcast to all lanes |
include/ck_tile/ops/fmha/pipeline/block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp
Outdated
Show resolved
Hide resolved
| // all tokens in a main loop iteration belong to the same page | ||
| if constexpr(QScaleEnum == BlockAttentionQuantScaleEnum::KV_BLOCKSCALE) | ||
| { | ||
| static_assert(kPageBlockSize >= kN0, "KV_BLOCKSCALE requires kPageBlockSize >= kN0"); |
There was a problem hiding this comment.
Please clarify that the pipeline only supports kPageBlockSize==1024 if QScaleEnum ==BlockAttentionQuantScaleEnum::KV_BLOCKSCALE
include/ck_tile/ops/fmha/block/block_attention_quant_scale_enum.hpp
Outdated
Show resolved
Hide resolved
include/ck_tile/ops/fmha/pipeline/block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp
Outdated
Show resolved
Hide resolved
include/ck_tile/ops/fmha/pipeline/block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp
Outdated
Show resolved
Hide resolved
include/ck_tile/ops/fmha/pipeline/block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp
Show resolved
Hide resolved
| } | ||
|
|
||
| template <bool Cond = kIsGroupMode> | ||
| CK_TILE_HOST static constexpr std::enable_if_t<Cond, Kargs> |
There was a problem hiding this comment.
Group mode is not used. Should it be deleted, or should we use static_assert to mask it
There was a problem hiding this comment.
It seems Aiter still uses batch prefill kernel with group mode in the mha_batch_prefill_kernels.cu:
float t = aiter::mha_batch_prefill(args,
stream_config,
dtype_str,
true, // is_group_mode
mask.type,
bias_type,
has_lse,
qscale_type,
false);
Do you have a specific concern about group mode in batch prefill kernel, or was there another reason you suggested removing it? I'm happy to discuss further if there's a better way to structure this.
| { | ||
| const index_t scale_offset = | ||
| k_physical_pages[0] * kv_block_descale_stride_block + | ||
| block_indices.kv_head_idx * kv_block_descale_stride_head; |
There was a problem hiding this comment.
If using group mode, the calculation method is incorrect. It is recommended to disable group mode
There was a problem hiding this comment.
Could you clarify which part of the calculation you think is incorrect in group mode? I'd like to make sure I understand your concern correctly.
From my understanding, the current implementation should work for both batch mode and group mode:
- k_physical_pages[0] is the physical page index obtained from page table lookup. The page_idx pointer is already per-batch adjusted:
- SGLang: page_idx = kv_page_indices + kv_indptr[i_batch]
- vLLM: page_idx = block_table_ptr + i_batch * batch_stride
- kv_head_idx is computed as i_nhead / nhead_ratio_qk, which should be the same for both modes.
- The scale tensor is indexed by physical page (not logical sequence position), so the offset calculation physical_page * nblock_stride + kv_head_idx * nhead_stride should be correct regardless of batch mode or group mode.
Is there a specific scenario or edge case you have in mind where this would produce incorrect results? Happy to discuss further!
Implement per-page K/V quantization for paged attention: - Add KV_BLOCKSCALE enum to BlockAttentionQuantScaleEnum - Use exp2 shift trick to eliminate explicit P scaling overhead - Prefetch physical pages offset for KV cache, overlaps with computations
2. Rename QScaleKargsSelector -> GetQScaleKargs for naming consistency 3. Remove unused BlockAttentionQuantScaleEnumToStr<KV_BLOCKSCALE>
53d41c9 to
8b07580
Compare
…intain flexibility.
|
Imported to ROCm/rocm-libraries |
Implement per-page K/V quantization for paged attention: - Add KV_BLOCKSCALE enum to BlockAttentionQuantScaleEnum - Use exp2 shift trick to eliminate explicit P scaling overhead - Prefetch physical pages offset for KV cache, overlaps with computations ## Proposed changes Please describe the motivation behind the pull request, whether it enables a new feature or fixes a bug. If there are associated pull requests or issues, please link them to the pull request. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [ ] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3696](ROCm/composable_kernel#3696) 🧑💻 Originally authored by @Jeff-Huang --------- Co-authored-by: Jeff Huang <chiachi.huang@amd.com> Co-authored-by: Illia Silin <Illia.Silin@amd.com>
commit 8c40fb6cac48969d6237cccdcbbbad56b44ff0a3
Author: Kerry Wang <kerrywang369@gmail.com>
Date: Mon Feb 9 20:27:33 2026 -0500
more consistent skip text
commit 557e2764b3a001884a004f0a183a307c4fbc2bd2
Merge: 6bd6e49791 219f365e7b
Author: Kerry Wang <kerrywang369@gmail.com>
Date: Mon Feb 9 18:38:48 2026 -0600
Merge remote-tracking branch 'origin/develop' into users/kerrwang/lds-queue
commit 6bd6e497910e3ba681b22a47630bc5f0dedb16b8
Author: Kerry Wang <kerrywang369@gmail.com>
Date: Mon Feb 9 18:38:31 2026 -0600
fix format
commit 219f365e7bc40c9ce3f5c382228a7b2e14b90520
Author: James Sandham <33790278+jsandham@users.noreply.github.com>
Date: Mon Feb 9 19:08:04 2026 -0500
[hipsparse] Match behaviour of csr2csr_compress from rocsparse (#4420)
## Motivation
In the hipSPARSE test code host solution, we were incorrectly checking
if a value satisfied:
`testing_abs(csr_val_A[j]) > testing_real(tol) &&
testing_abs(csr_val_A[j]) > std::numeric_limits<float>::min()`
instead of the correct criteria:
`testing_abs(csr_val_A[j]) > testing_real(tol)`
commit 698d5d09184a24fde32ab7309fcd88410fc7ff8e
Author: amd-hsong <hao.song@amd.com>
Date: Mon Feb 9 16:40:07 2026 -0700
[rocprim] Fix a call to intrinsics in test_device_reduce_by_key (#4391)
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
Fix a call to __clzll in test_device_reduce_by_key
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
There are a couple of issues in the call to __clzll:
- the argument is cast to `long long`: it should be cast to `unsigned
long long` instead
- in rocprim there exists a wrapper for clz, so for better portability
rocprim::clz should be used instead.
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
Run test_device_reduce_by_key to verify the test runs correctly.
## Test Result
<!-- Briefly summarize test outcomes. -->
The test passes.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit 557f5baa6d68bb5a8126d9730a8d48983778aac3
Author: Kerry Wang <kerrywang369@gmail.com>
Date: Mon Feb 9 16:53:41 2026 -0600
skip on non-gfx950
commit 8b72bc8759d9c11dfcbf410182fa332152b97e69
Author: bnemanich <brad.nemanich@amd.com>
Date: Mon Feb 9 16:26:51 2026 -0500
[hipBLASLt] Enable custom MXFP4 kernels (#4384)
## Motivation
Allow hipBLASLt to call custom MX FP4 kernels for higher performance.
## Technical Details
A single kernel was added in this PR. The kernel was originally from:
https://github.com/ROCm/aiter/tree/main/hsa/gfx950/f4gemm.
This kernel used a slightly different shuffled scaling layout than
rocRoller. hipBLASLt will only support this new shuffled layout, plus
the original non-shuffled layout. All rocRoller kernels will be disabled
when using shuffled scales for now. Once rocRoller supports the new
layout, they will be added back in.
This PR also adds some new MX datatype generation patterns that were
useful during debugging.
New custom kernels can be added to the custom_kernels directory. They
will also need to be added in the customer_kernels.cpp file that was
added in this PR.
## Test Plan
Check that performance improved when using MXFP4 GEMMs with shuffled
scales.
## Test Result
Performance improved by about 17%.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Andrew Whittle <Andrew.Whittle@amd.com>
Co-authored-by: Bryant Nelson <bryant.nelson@amd.com>
commit 61f9f906dcc0a9d4f6c327fea713aebc6d4b0a1d
Author: Bartłomiej Kocot <barkocot@amd.com>
Date: Mon Feb 9 22:08:57 2026 +0100
[CK] CK Tile grouped convolution direct load (#4406)
## Motivation
CK Tile grouped convolution forward direct load support.
## Technical Details
Basic pipeline for direct load and new instances for forward for v1 and
v4 pipelines.
## Test Plan
test_grouped_convnd_fwd_tile
## Test Result
CI pending
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-130
commit b7f136734ad26314386ca2b4f5a99467804f1bb7
Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com>
Date: Mon Feb 9 20:58:57 2026 +0000
Enable group mode (varlen) kernel generation for PyTorch integration (#4292)
## Proposed changes
This PR enables group mode (variable-length attention) kernel generation
for PyTorch's CK SDPA backend.
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [X] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [X] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged
## Discussion
The change is minimal (single line deletion) but enables a significant
feature: variable-length attention support for ROCm users via PyTorch's
torch.nn.attention.varlen API.
---
🔁 Imported from
[ROCm/composable_kernel#3553](https://github.com/ROCm/composable_kernel/pull/3553)
🧑💻 Originally authored by @chinmaydk99
Co-authored-by: Chinmay_Kuchinad <ChinmayDattanand.Kuchinad@amd.com>
commit f48a5e63edb7102996b0b769e76114c0bbfd35cf
Author: Mihnea Chirila <37160326+mihnea-chirila@users.noreply.github.com>
Date: Mon Feb 9 14:55:12 2026 -0600
[Tensilelite] Added MIArchVgpr support for Complex Datatypes. (#4332)
## Motivation
Added MIArchVgpr support for Complex Datatypes.
## Technical Details
Fixed AlphaTmpVgpr initialization, and rocisa register offset bug:
- Updated condition to initialize AlphaTmpVgpr if MIArchVgpr parameter
is enabled. Required to generate `MulMIOutAlphaToArch' code
(https://github.com/ROCm/rocm-libraries/blob/c20a85b6c458ef44c1f0e30c35b286a0395fb8fa/projects/hipblaslt/tensilelite/Tensile/KernelWriterModules.py#L251)
regardless of postGSU Accumulation scheme.
- Fixed underlying `Holder` struct bug: correctly passes string passed
offsets to `RegisterContainer`. Required to update imaginary register
for C/ZGEMM.
(https://github.com/ROCm/rocm-libraries/blob/c20a85b6c458ef44c1f0e30c35b286a0395fb8fa/projects/hipblaslt/tensilelite/Tensile/KernelWriterModules.py#L288)
## Test Plan
Tested for C & Z with MIArchVgpr: [0, 1] on gfx942 and gfx950
## Test Result
Success
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit 3de83b9b4035006b5ddd825df9404edc43ca9b39
Author: CMiservaAMD <cmiserva@amd.com>
Date: Mon Feb 9 13:42:54 2026 -0700
[hipDNN] Add integration tests for frontend configuration knobs APIs. (#4307)
Add integration tests to verify correct operation of new hipDNN frontend API
functions for managing engine config settings.
commit 2752a8a5105e11929b876ce0e343bcc73a9cf308
Author: DarylHawkinsAMD <Daryl.Hawkins@amd.com>
Date: Mon Feb 9 13:08:30 2026 -0700
[MIOpen] First set of kernels using CK Builder end to end (#4123)
commit e55f37bad667987f74989bc95e08f86603438963
Author: Mitchell Ousdahl <mitch.ousdahl@amd.com>
Date: Mon Feb 9 10:05:26 2026 -0800
Modified test plugin rpaths (#4350)
## Motivation
In order to successfully get hipDNN added to the python ROCm wheels, the
RPATHs on Linux for the test plugins need to be updated to make them
portable. We will leverage TheRock's existing RPATH update mechanism to
do this.
## Technical Details
- Update all test plugin RPATHs
## Test Plan
- Build ROCm
- Build the wheels
- Use the "Test ROCm Wheel" workflow, which verifies that the test
plugins can load and find their dependencies.
## Test Result
- [ ] "Test ROCm Wheel" workflow succeeds
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit a7485411874b0650b31068364f6b1155d9890212
Author: Muhammad Osama <osama94@gmail.com>
Date: Mon Feb 9 09:30:52 2026 -0800
[Origami] Skip test-selector if torch not found. (#4359)
## Motivation
Makes `torch` completely optional by skipping dependent tests if it is
not found.
## Technical Details
```
# Skip entire module if torch is not available (selector requires torch)
torch = pytest.importorskip("torch", reason="torch is required for OrigamiMatmulSelector tests.")
```
## Test Plan
Run tests using CI + TheRock build.
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit 0c5cd629a94a454a350eb651b5921baeb1c82546
Author: Swati Rawat <120587655+SwRaw@users.noreply.github.com>
Date: Mon Feb 9 22:51:13 2026 +0530
Update Tensile CHANGELOG.md (#4164)
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit 331512e9e13e197d8d7fdf7b72f5b60eb63d7d1e
Author: Bartłomiej Kocot <barkocot@amd.com>
Date: Mon Feb 9 16:36:52 2026 +0100
[CK] Fix grouped conv fwd transform for merged groups (#4399)
## Motivation
[CK] Fix grouped conv fwd transform for merged groups for 1d and 3d.
## Technical Details
After optimizations for 2d there is a lack of implementation for 1d and
3d
## Test Plan
test_grouped_convnd_fwd
## Test Result
pending CI
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit 1c2927530e176c63cf814b44eb8147e89d2bcaf7
Author: Eiden Yoshida <47196116+eidenyoshida@users.noreply.github.com>
Date: Mon Feb 9 10:23:47 2026 -0500
[CK] MICI: Disable failure pattern checking (#4373)
## Motivation
- ck mici jobs hanging at end, possibly at failure pattern checking
## Technical Details
- Disable failure pattern checking to see if hanging goes away
## Test Plan
- Observe behavior after merge
commit a3058d1dc0b3f176f56fbecd040c2fc48c7258ad
Author: COrruDXC <carlo.orru@dxc.com>
Date: Mon Feb 9 14:02:05 2026 +0100
Reduce boost usage by replacing time calls (#3875)
## Motivation
Reduce boost usage by replacing time calls.
## Technical Details
Replace boost::posix_time related data types with the corresponsing
std::chrono data types.
## Submission Checklist
- [X] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit 959bd9393ad9a578711334c40948ac1321e41c1f
Author: Yi-Yao (Alex), Wang <68064688+alex391a@users.noreply.github.com>
Date: Mon Feb 9 17:15:47 2026 +0800
Update gfx942/gfx950 BBS/HHS/I8I8S SPB/SPA logic yaml (#4365)
## Motivation
- Update BBS/HHS/I8I8S SPB/SPA logic yaml for gfx942/gfx950
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
- Run local hipsparselt-test
- Run local tests for all matrix sizes using hipsparselt-bench
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: chiahlin <chiahlin@amd.com>
commit ad03e58dabbf2bbc348c031a06ec73011d85d2c3
Author: Chuck Wu <chuckwu92job@gmail.com>
Date: Mon Feb 9 13:04:38 2026 +0800
[hipblaslt] Fix memory leaks & uninitialized value use (#4338)
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
According to the
[ROCM-1835](https://amd-hub.atlassian.net/browse/ROCM-1835?focusedCommentId=109304&sourceType=mention),
there are some memory leaks and instances of uninitialized value being
used during the gtest.
<img width="450" height="367" alt="image"
src="https://github.com/user-attachments/assets/2345e1f8-6062-4a5a-b294-97042709b18e"
/>
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
1. Add the code to call the corresponding destroy functions for the data
that has not been released yet.
2. Add the default value to compute_input_typeA/B
Flow (before this commit) ->
a. hipblasLtMatmulDescCreate:
- compute_input_typeA = ???
- compute_input_typeB = ???
b. hipblasLtMatmulDescSetAttribute(COMPUTE_INPUT_TYPE_A, HIP_R_16F, ...)
- compute_input_typeA = HIP_R_16_F
- compute_input_typeB = ???
- call _matmul_desc_determine_compute_type()
- Read compute_input_typeA & compute_input_typeB
c. hipblasLtMatmulDescSetAttribute(COMPUTE_INPUT_TYPE_B, HIP_R_16F, ...)
- compute_input_typeA = HIP_R_16_F
- compute_input_typeB = HIP_R_16_F
- call _matmul_desc_determine_compute_type()
- Read compute_input_typeA & compute_input_typeB
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
1. Test command: `valgrind --leak-check=full ./hipblaslt-test
--gtest_filter=_/aux_test.*`
2. Before this commit:
Uninitialized value being used
<img width="691" height="81" alt="image"
src="https://github.com/user-attachments/assets/22a897f1-c25e-4608-850e-0c6bcb5ad0a3"
/>
Memory leaks
<img width="708" height="78" alt="image"
src="https://github.com/user-attachments/assets/02268893-a29a-4db4-95aa-c93385371d5a"
/>
## Test Result
<!-- Briefly summarize test outcomes. -->
1. The Valgrind output above isn’t showing.
2. gtest all pass on Navi3.
<img width="952" height="92" alt="image"
src="https://github.com/user-attachments/assets/6e1b9b01-afc4-4a51-8a9f-e0196fc8495a"
/>
## Submission Checklist
- [X] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[ROCM-1835]:
https://amd-hub.atlassian.net/browse/ROCM-1835?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
commit b7de1e14cea70681a23cd1a136df42910c776e4a
Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com>
Date: Mon Feb 9 11:54:54 2026 +0800
[CK_TILE] Add blockscale GEMM support for EightWarps on gfx950 (#4280)
## Proposed changes
gemm blockscale eightwarps support
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run `clang-format` on all changed files
- [x] Any dependent changes have been merged
## Discussion
If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
---
🔁 Imported from
[ROCm/composable_kernel#3650](https://github.com/ROCm/composable_kernel/pull/3650)
🧑💻 Originally authored by @kensclin
---------
Co-authored-by: KenSCLin <lshyhchy@amd.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
commit 774cfc6410ed55802691fef19a34449182878be5
Author: Ethan <Yi-Chen.Lin@amd.com>
Date: Mon Feb 9 11:39:20 2026 +0800
[hipblaslt] do some debug operations only in debug setting
## Motivation
Single solution selection time has increased slightly.
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
## Technical Details
Lots of "assign matchingTag" can be avoid if not in debug
(printProperty), but I still keep the Equal assign there since it has
been there before #2757
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit ff3e9821bbe2d14b9874e61ebb518bbbce621ac3
Author: jakpiase <jakpia21@gmail.com>
Date: Sun Feb 8 20:57:14 2026 +0100
[CK_TILE] Add support and tests for V6 pipeline in conv fwd (#4357)
Added support for conv v6 pipeline in ck tile's convolution forward
kernel. CK Tile v6 pipeline is the equivalent to old ck's V5 pipeline
and should be faster than other pipelines for some cases. This PR also
adds tests inside profiler that's currently inside experimental
directory, so now we should be able to detect regressions easier.
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: subhajitdchow <sduttach@amd.com>
commit 591f50450241d6b1965f9f6ee3fe2526ef71ab8d
Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com>
Date: Sun Feb 8 12:34:59 2026 +0100
[CK] Add fwd conv group merging to v3 conv instances (#4273)
## Proposed changes
Added conv group merging to the (universal) V3 fwd conv pipeline. The
new instance improves fwd conv performance when the number of
input/output channel per group is low.
On MI300 (`gfx942`) we get
| CK prof command | Baseline (TFLOPS) | V3 group merging (TFLOPS) |
|:-----|:------:|------:|
| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 4 4 3 3 200 200 1 1 1 1 1 1 1 1
| 3.86035 | 8.36796 |
| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 200 200 2 2 1 1 1 1 1 1
| 10.1867 | 13.4677 |
| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 100 100 1 2 1 1 1 1 1 1
| 11.7875 | 16.3657 |
---
🔁 Imported from
[ROCm/composable_kernel#3675](https://github.com/ROCm/composable_kernel/pull/3675)
🧑💻 Originally authored by @vpietila-amd
---------
Co-authored-by: Ville Pietilä <>
Co-authored-by: Ville Pietilä <188998872+vpietila-amd@users.noreply.github.com>
Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
commit cad7fa2c1849b0863ed52ef6cd47198e421d5b6e
Author: BrianHarrisonAMD <169072757+BrianHarrisonAMD@users.noreply.github.com>
Date: Fri Feb 6 23:48:54 2026 -0700
[hipDNN] Fix codecov target breaks (#4374)
## Motivation
code_cov stage for hipDNN was breaking due to environment differences.
Newer clang tooling flags false positives on added files.
Environments that are missing spdlog, but have fmt present were causing
issues due to mixed include expectations.
## Technical Details
- Ignore false positives for lint
- Only add fmt if spdlog was built with external FMT
## Test Plan
- Code cov target builds succeessfully
## Test Result
Waiting on CI
commit 91627789d86acc7dff4bf5eaafe3b774a7037f76
Author: Koji Nakajima <75698246+nakajee@users.noreply.github.com>
Date: Fri Feb 6 23:44:05 2026 -0700
[hipblaslt] Fix memory access error with DtlPlusLdsBuf (#4303)
## Motivation
Fix a memory access error with DtlPlusLdsBuf
## Technical Details
- generate all GlobalRead Inc code before local read addr swap
## Test Plan
Added a test case in dtl.yaml
## Test Result
Confirmed new test failed with before change and no error with after
change
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit 8b5a98b48c007663765865d2e14247ef1f056b01
Author: Aaron St George <aaronstgeorge@gmail.com>
Date: Fri Feb 6 23:18:13 2026 -0600
[hipDNN] Add `FUSILLI_PLUGIN` to `EngineNames.hpp` (#4362)
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
Following the approach outlined in
[hipdnn/docs/rfcs/0003_EngineIdDesign.md](https://github.com/ROCm/rocm-libraries/blob/develop/projects/hipdnn/docs/rfcs/0003_EngineIdDesign.md)
this PR defines an engine ID for fusilli in
`hipdnn/data_sdk/include/hipdnn_data_sdk/utilities/EngineNames.hpp`.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
ID + tests defined.
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
Test defined in the PR.
## Test Result
<!-- Briefly summarize test outcomes. -->
Tests pass.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit e3a9b3f95d29ce626efc3d2134e4e924b1c591a9
Author: James Newling <james.newling@gmail.com>
Date: Fri Feb 6 21:12:55 2026 -0800
[hipblaslt] Raise exception instead of segfaulting (#3995)
## Motivation
Faster problem diagnostic when failure.
## Technical Details
Throw exception if library is nullptr.
## Test
Before:
```
TensileLibrary.yaml:181:31: error: invalid boolean
customMainLoopScheduling: 0
^
[Lots of logging]
Segmentation fault
+ ERR2=139
+ ERR=0
```
After:
```
TensileLibrary.yaml:181:31: error: invalid boolean
customMainLoopScheduling: 0
terminate called after throwing an instance of 'std::runtime_error'
what(): Failed to load solution library
+ ERR2=134
+ ERR=0
```
commit f48eaa54f7395aa8ce4980dcc6725fe38784f7e6
Author: CMiservaAMD <cmiserva@amd.com>
Date: Fri Feb 6 21:48:52 2026 -0700
[hipDNN] Fix a couple log messages in test plugins broken by recent merge. (#4380)
Include correct function name in test plugin log output.
commit 5df3343ecfae6b39201995d8178fe39e061e0c40
Author: Emily Martins <65371150+ecamartins@users.noreply.github.com>
Date: Fri Feb 6 17:26:57 2026 -0700
[CK_TILE] Fix MMA concepts compiler error (#4381)
## Motivation
CK Tile is required to support certain older OSs; on these OSs, cpp 20
is not fully supported. For ROCm 7.2, compiler errors occur on one of
these older OSs. An example of this error is as follows:
```bash
/composable_kernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp:34:28: error: expected concept name with optional arguments
34 | { MmaOp::kAMBlock } -> std::convertible_to<unsigned int>;
|
```
The goal of this PR is to resolve these compiler errors.
## Technical Details
The existing guards around the mma concepts only check if the concepts
language feature is supported, as follows:
```cpp
#if defined(__cpp_concepts) && __cpp_concepts >= 201907L
// ...
template <typename CtrlFlags>
concept CtrlFlagsGfx9I = requires(CtrlFlags ctrlFlags) {
// Flag members for Gfx9 MFMA instructions
{ CtrlFlags::Cbsz } -> std::convertible_to<int>;
{ CtrlFlags::Abid } -> std::convertible_to<int>;
{ CtrlFlags::Blgp } -> std::convertible_to<int>;
};
#endif // defined(__cpp_concepts) && __cpp_concepts >= 201907L
```
That said, in cases where functionality from the `<concepts>` header is
used (e.g., `std::convertible_to`), this guard fails to check whether
the `<concepts>` header is available.
This change adds an additional check to the concepts that make use of
functionality from the `<concepts>` header to ensure the header is
available.
## Test Plan
I tested the changes on the relevant docker for gfx90a, gfx950, and
gfx942 and the compiler issue is not present.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit 06976b37a2f0353b80c71fb3d56bee92bb6b9bab
Author: Aviral Goel <aviral.goel@amd.com>
Date: Sat Feb 7 04:14:28 2026 +0400
Increase tolerance for FP16 GEMM tests to handle non-deterministic ro… (#4335)
…unding
Three tests were failing intermittently with small errors (0.01-1.5%)
due to non-deterministic FP16 accumulation order from GPU thread
scheduling:
- test_ck_tile_batched_gemm
- test_ck_tile_grouped_gemm_preshuffle
- test_ck_tile_grouped_gemm_multi_d
These tests use kbatch=1 (no split-K), so errors are from
order-dependent rounding, not atomics. Increased tolerances from 1e-3 to
2e-3 (0.2%) to account for FP16 precision limits while still catching
real bugs.
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
commit 07e9d561402c717946a1c08cfdce2681d5733335
Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com>
Date: Fri Feb 6 16:10:23 2026 -0800
[CK] add inter/intrawave scheduling concept doc (#4300)
## Proposed changes
Adding information about inter/intrawave scheduling
---
🔁 Imported from
[ROCm/composable_kernel#3660](https://github.com/ROCm/composable_kernel/pull/3660)
🧑💻 Originally authored by @spolifroni-amd
---------
Co-authored-by: spolifroni-amd <sandra.polifroni@amd.com>
Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
commit 4d773b636ca00996e971d55bcd0530f641837b42
Author: JonathanLichtnerAMD <195780826+JonathanLichtnerAMD@users.noreply.github.com>
Date: Fri Feb 6 17:09:55 2026 -0700
Add .cline_storage to .gitignore (#4390)
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit 738ffd7689ba0759f00c0e9430889b2ed995fa94
Author: Enrico Degregori <73224202+EnricoDeg@users.noreply.github.com>
Date: Sat Feb 7 01:09:08 2026 +0100
[CK] Workaround blockscale wp test failure (#4372)
## Motivation
Workaround to fix blockscale wp test failure for pipeline v3
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit 612bf0b710b399276916c222d8d4c5f9c34f9f62
Author: James Sandham <33790278+jsandham@users.noreply.github.com>
Date: Fri Feb 6 18:44:55 2026 -0500
[rocsparse] Add bfloat16 and complex-types tests for code coverage (#4204)
## Motivation
Add bfloat16 and complex-types tests for code coverage. Also renames the
atomic_add tests to belong to pre_checkin so that they will be run as
part of code coverage pre_checkin tests.
commit 287fbc900071d5f9f8df7efdf1cfd25d9c8ea338
Author: Kerry Wang <kerrywang369@gmail.com>
Date: Fri Feb 6 15:55:22 2026 -0600
don't include waitcnts; renames
commit f52966a377bfd26725f35f103fbc7975cd9b4ec9
Author: Yiqian Liu <157505981+liu-yiqian@users.noreply.github.com>
Date: Fri Feb 6 15:50:20 2026 -0600
[rocRoller] Explicitly convert when typeAcc differs with typeD (#3977)
## Motivation
This PR explicitly converts data type when Accumulator type is different
with matrix D. The purpose of this change is to make rocRoller client
adds the same operation as hipblaslt
## Technical Details
Added a convert operation when Accumulator is not the same type as
matrix D.
## Test Plan
All the existing tests should pass.
This change should not affect the performance.
## Test Result
Existing tests passed.
---------
Co-authored-by: yiqialiu <yiqialiu@amd.com>
commit 323a8d256e2409de54fa9dfa9523f4c50295c305
Author: Kerry Wang <kerrywang369@gmail.com>
Date: Fri Feb 6 15:28:17 2026 -0600
minor quality improvements
commit 6c4a1fb6d0f2eff761cd95e690f3ef0090266367
Author: Ali Yazdani <ayazdani@amd.com>
Date: Fri Feb 6 14:24:53 2026 -0700
[Origami] AutoWgm for NonTemporal Kernels. (#4218)
AIGESOLSEL-71
## Motivation
This PR enhances the Origami workgroup mapping (WGM) selection logic to
support nontemporal kernels and improve automatic WGM value selection.
Previously, nontemporal cases (NTA/NTB > 3) were excluded from automatic
workgroup mapping optimizations, limiting potential performance.
## Technical Details
1. Enabling NonTemporal support in AutoWGM with an enhanced logic
2. Improved WGM Candidate Generation
## Test Plan
CI, and locally ran performance tests.
## Test Result
Performance benchmarks show uplifts coming from changes.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit 50e39459b541f978edd9acf645792cf496c16bea
Author: Kerry Wang <kerrywang369@gmail.com>
Date: Fri Feb 6 15:21:16 2026 -0600
add string-based observer test
commit 16b3b1840d61871c116c9ae80cf59324233377d5
Author: Torre Zuk <42548444+TorreZuk@users.noreply.github.com>
Date: Fri Feb 6 14:19:00 2026 -0700
[rocBLAS] Users/torrezuk/rocm 1157 amd smi rocblas (#4353)
## Motivation
Deprecated dependency rocm-smi replaced by amd-smi
## Test Plan
Test frequency reporting using rocblas-bench with environment variable set
commit 45b616b1e6df1b1d3816a2f03a05a0f6ab754652
Author: Illia Silin <98187287+illsilin@users.noreply.github.com>
Date: Fri Feb 6 10:17:02 2026 -0800
[CK] fix path for build filter (#4375)
## Motivation
Fix the filter that determines whether CI builds are necessary.
## Technical Details
A script checks the files list returned by git diff and checks whether
any code source was modified. If not, if only documentation was changed,
it will allow skipping the builds. We make sure we only look at the
changes in projects/composablekernel/ folder.
commit d8e2826bedff1183eaedeb6d6f5b2eeaa65dab7b
Author: Geo Min <geomin12@amd.com>
Date: Fri Feb 6 09:59:29 2026 -0800
[ci] Adding mi350 required group ID (#4378)
After updating mi325 group-id, we are noticing errors for mi350.
Tested here for mi350:
https://github.com/ROCm/TheRock/actions/runs/21733399385/job/62692971370
Tested here for mi325:
https://github.com/ROCm/TheRock/actions/runs/21759203211/job/62778060417
Adding both work properly
commit 78497b37bb4e5853b5da3feb96381c6b643556f7
Author: bibek <108366729+bghimireamd@users.noreply.github.com>
Date: Fri Feb 6 10:46:48 2026 -0600
Fix intermittent kernel compilation failures in BnFwdTrainingSpatial (#4202)
## Motivation
Fix kernel compilation failures in BnFwdTrainingSpatial caused by two
related issues:
1. Uninitialized local size variables producing random garbage values
2. Missing compile-time guard for warp-reduction LDS arrays when
workgroup < 64 threads
## To Reproduce
```bash
rm -rf ~/.config/miopen/*.udb.txt ~/.cache/miopen/
rm -rf /tmp/.config/miopen/ /tmp/.cache/miopen/
MIOPEN_FIND_ENFORCE=SEARCH MIOPEN_LOG_LEVEL=5 ./bin/MIOpenDriver bnorm -n 1024 -c 64 -H 13 -W 13 -m 1 --forw 1 -s 1 -V 1
```
## Technical Details
### Bug 1: Uninitialized Variables (Host)
Variables declared without initialization:
```cpp
size_t xlocalsize, xgridsize; // uninitialized
```
For Variants 0/1/3, early returns skip initialization, leaving garbage
values that propagate to kernel template parameters. Depending on stack
memory state, errors include:
- `error: array is too large (18446744073709545792 elements)`
- `error: variable length array declaration cannot have 'static' storage
duration`
- `error: zero-length arrays are not permitted in HIP device code`
### Bug 2: Missing Compile-Time Guard (Kernel)
The warp-reduction path divides LDS size by 64:
```cpp
__shared__ FpAccumCType lcl_data_x[MIO_BN_GRP0_FINAL * MIO_BN_GRP1_FINAL * MIO_BN_GRP2_FINAL / 64];
```
When Variants 0/1/3 set workgroup size to 1×1×1, this computes `1/64 =
0`, producing illegal zero-length arrays. Note that `if constexpr` only
suppresses template instantiation, not parsing of ill-formed code like
zero-length arrays.
## Fix
### Host side (`common_spatial.hpp`, `forward_spatial.cpp`)
Initialize variables to safe defaults:
```cpp
size_t xlocalsize = 1, xgridsize = 1;
size_t ylocalsize = 1, ygridsize = 1;
size_t zlocalsize = 1, zgridsize = 1;
size_t nelements = 1;
unsigned int ldsgcn = 0, ldsnogcn = 0;
```
### Kernel side (`MIOpenBatchNormFwdTrainSpatial.cpp`)
Use C++17 constexpr ternary to ensure array size is always ≥ 1:
```cpp
else
{
// C++17 idiomatic: ensure array size is never zero using constexpr ternary
constexpr auto grp_final_total =
MIO_BN_GRP0_FINAL * MIO_BN_GRP1_FINAL * MIO_BN_GRP2_FINAL;
constexpr auto lds_gcn_array_size = grp_final_total >= 64 ? grp_final_total / 64 : 1;
commitID = 64;
__shared__ FpAccumCType lcl_data_x[lds_gcn_array_size];
__shared__ FpAccumCType lcl_data_y[lds_gcn_array_size];
miopen::reduction::gcn_reduce2(...);
}
```
__Why this works:__
- `constexpr` ensures compile-time evaluation (zero runtime overhead)
- When workgroup ≥ 64: array size = `grp_final_total / 64` (correct,
same as before)
- When workgroup < 64: array size = 1 (valid), but this `else` branch is
never taken due to `if constexpr` guard
- Dead code elimination removes the unused size-1 arrays from the binary
## Test Plan
- Existing batchnorm training tests pass
- Verified no zero-length array errors with deterministic workgroup
sizes
- Confirmed warp-reduction path only executes when workgroup ≥ 64
threads
commit 8f8b97a40d36cb4095e929b0ef1b71ffea7ba170
Author: SreecharanGundaboluAMD <sgundabo@amd.com>
Date: Fri Feb 6 08:18:12 2026 -0800
[miopen] upgrade clang-format (#4194)
This PR updates the project's code formatting tooling to use
`clang-format-18` instead of `clang-format-12` throughout the codebase
as a transition as we move towards TheRock for our CI.
**Tooling and Configuration Updates:**
* Updated all references to `clang-format-12` to `clang-format-18` in
the pre-commit hook (`.githooks/pre-commit`), CMake configuration
(`ClangCheck.cmake`), and Dockerfile (`Dockerfile`). The Dockerfile now
also adds the appropriate LLVM 18 repository and keyring for
installation.
[[1]](diffhunk://#diff-1436c8126d575a7576d98d0bc8a8c6d27e8eb4e2d7241d61fe64c286c0d7365cL7-R7)
[[2]](diffhunk://#diff-fc024f0d7573d33039081dab6b12f76f0f34c8e07e014552daa1bed9a276a548L9-R9)
[[3]](diffhunk://#diff-32304f8a254e46fb8ff524cf4c488eb6013ab54a89ca62709cfb20ccf58976f9R54-R61)
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit b34fa62134425a611b18c05aae687f1bc09c7d87
Author: BrianHarrisonAMD <169072757+BrianHarrisonAMD@users.noreply.github.com>
Date: Fri Feb 6 08:57:01 2026 -0700
[hipDNN] Remove spdlog dependency for consumers of hipdnn (#4312)
## Motivation
Draft of changes to remove spdlog and fmt dependencies from hipDNN
frontend & consumer facing SDKs.
Consumers of hipDNN can have conflicts as a result of these
dependencies.
Removing these extra dependencies, and relying on the C API logging
methods from hipDNN backend will reduce friction, and make it easier for
adoption of hipDNN.
## Technical Details
- Remove spdlog and fmt from frontend + consumer SDKs.
- Add new string stream style formatter that will forward to backend
logging API callback
- Note: since we are set at C++17 standards, we cannot use std::format.
- Add Spdlog and fmt optional dependencies as opt in for plugin_sdk
- This enables existing plugins to maintain logging style they have in
place, and allows plugin authors to decide what style of logging to use.
- Unify consumption of spdlog & fmt dependencies in CMake's using
unified method's
## Test Plan
- Ensure build and tests are working for all components
- Ensure tests with logs enabled are working properly with expected
format
- Ensure build of samples is working, and logging format is correct
## Test Result
Build, tests, and testing with logging enabled is working locally for
hipDNN, providers, and samples
commit 0c37fdc37ba545b2ced5211b9f59c2381fc93753
Author: amd-chunxlin <chunxlin@amd.com>
Date: Fri Feb 6 09:44:39 2026 -0600
[rocRoller] Address long StreamK test runtimes (#4095)
## Motivation
Some streamK tests take long time to finish, and this PR addresses the
performance issue.
<details>
<summary>Comparison of test runtimes</summary>
| Test name | Develop branch | This branch |
| --- | --- | --- |
| GPU_BasicGEMMStreamKWorkgroupMapping/0 | 14.412s | 4.059s|
| GPU_BasicGEMMStreamKWorkgroupMapping/1 | 14.447s | 4.018s|
| GPU_BasicGEMMStreamKWorkgroupMapping/2 | 14.452s | 4.026s|
| GPU_BasicGEMMStreamKWorkgroupMapping/3 | 14.444s | 4.042s|
| GPU_BasicGEMMStreamKWorkgroupMapping/4 | 38.872s | 9.446s|
| GPU_BasicGEMMStreamKWorkgroupMapping/5 | 38.828s | 9.445s|
| GPU_BasicGEMMStreamKWorkgroupMapping/6 | 38.913s | 9.446s|
| GPU_BasicGEMMStreamKWorkgroupMapping/7 | 38.812s | 9.435s|
| GPU_BasicGEMMStreamKWorkgroupMapping/8 | 38.878s | 9.456s|
| GPU_BasicGEMMStreamKWorkgroupMapping/9 | 38.889s | 9.463s|
| GPU_BasicGEMMStreamKWorkgroupMapping/10 | 38.884s | 9.463s|
| GPU_BasicGEMMStreamKWorkgroupMapping/11 | 38.859s | 9.463s|
| GPU_BasicGEMMStreamKWorkgroupMapping/12 | 14.414s | 4.035s|
| GPU_BasicGEMMStreamKWorkgroupMapping/13 | 14.429s | 4.024s|
| GPU_BasicGEMMStreamKWorkgroupMapping/14 | 14.47s | 4.045s|
| GPU_BasicGEMMStreamKWorkgroupMapping/15 | 14.428s | 4.044s|
| GPU_BasicGEMMStreamKWorkgroupMapping/16 | 38.854s | 9.454s|
| GPU_BasicGEMMStreamKWorkgroupMapping/17 | 38.861s | 9.448s|
| GPU_BasicGEMMStreamKWorkgroupMapping/18 | 38.95s | 9.461s|
| GPU_BasicGEMMStreamKWorkgroupMapping/19 | 38.826s | 9.458s|
| GPU_BasicGEMMStreamKWorkgroupMapping/20 | 38.833s | 9.48s|
| GPU_BasicGEMMStreamKWorkgroupMapping/21 | 38.888s | 9.472s|
| GPU_BasicGEMMStreamKWorkgroupMapping/22 | 38.868s | 9.477s|
| GPU_BasicGEMMStreamKWorkgroupMapping/23 | 38.907s | 9.485s|
| GPU_BasicGEMMStreamKWorkgroupMapping/24 | 14.426s | 4.025s|
| GPU_BasicGEMMStreamKWorkgroupMapping/25 | 14.435s | 4.051s|
| GPU_BasicGEMMStreamKWorkgroupMapping/26 | 14.472s | 4.053s|
| GPU_BasicGEMMStreamKWorkgroupMapping/27 | 14.471s | 4.058s|
| GPU_BasicGEMMStreamKWorkgroupMapping/28 | 38.879s | 9.457s|
| GPU_BasicGEMMStreamKWorkgroupMapping/29 | 38.814s | 9.445s|
| GPU_BasicGEMMStreamKWorkgroupMapping/30 | 38.853s | 9.45s|
| GPU_BasicGEMMStreamKWorkgroupMapping/31 | 38.963s | 9.458s|
| GPU_BasicGEMMStreamKWorkgroupMapping/32 | 38.924s | 9.466s|
| GPU_BasicGEMMStreamKWorkgroupMapping/33 | 38.898s | 9.482s|
| GPU_BasicGEMMStreamKWorkgroupMapping/34 | 38.951s | 9.455s|
| GPU_BasicGEMMStreamKWorkgroupMapping/35 | 38.924s | 9.459s|
| GPU_BasicGEMMStreamKWorkgroupMapping/36 | 14.461s | 4.037s|
| GPU_BasicGEMMStreamKWorkgroupMapping/37 | 14.452s | 4.032s|
| GPU_BasicGEMMStreamKWorkgroupMapping/38 | 14.43s | 4.053s|
| GPU_BasicGEMMStreamKWorkgroupMapping/39 | 14.43s | 4.042s|
| GPU_BasicGEMMStreamKWorkgroupMapping/40 | 38.868s | 9.473s|
| GPU_BasicGEMMStreamKWorkgroupMapping/41 | 38.925s | 9.461s|
| GPU_BasicGEMMStreamKWorkgroupMapping/42 | 38.884s | 9.452s|
| GPU_BasicGEMMStreamKWorkgroupMapping/43 | 38.925s | 9.455s|
| GPU_BasicGEMMStreamKWorkgroupMapping/44 | 39.012s | 9.476s|
| GPU_BasicGEMMStreamKWorkgroupMapping/45 | 38.915s | 9.479s|
| GPU_BasicGEMMStreamKWorkgroupMapping/46 | 38.933s | 9.457s|
| GPU_BasicGEMMStreamKWorkgroupMapping/47 | 38.936s | 9.469s|
| GPU_BasicGEMMStreamKWorkgroupMapping/48 | 14.461s | 4.041s|
| GPU_BasicGEMMStreamKWorkgroupMapping/49 | 14.468s | 4.049s|
| GPU_BasicGEMMStreamKWorkgroupMapping/50 | 14.466s | 4.046s|
| GPU_BasicGEMMStreamKWorkgroupMapping/51 | 14.479s | 4.038s|
| GPU_BasicGEMMStreamKWorkgroupMapping/52 | 38.907s | 9.473s|
| GPU_BasicGEMMStreamKWorkgroupMapping/53 | 38.914s | 9.471s|
| GPU_BasicGEMMStreamKWorkgroupMapping/54 | 38.885s | 9.463s|
| GPU_BasicGEMMStreamKWorkgroupMapping/55 | 38.891s | 9.463s|
| GPU_BasicGEMMStreamKWorkgroupMapping/56 | 38.859s | 9.472s|
| GPU_BasicGEMMStreamKWorkgroupMapping/57 | 38.899s | 9.475s|
| GPU_BasicGEMMStreamKWorkgroupMapping/58 | 38.936s | 9.47s|
| GPU_BasicGEMMStreamKWorkgroupMapping/59 | 38.952s | 9.472s|
| GPU_BasicGEMMStreamKWorkgroupMapping/60 | 14.485s | 4.036s|
| GPU_BasicGEMMStreamKWorkgroupMapping/61 | 14.419s | 4.031s|
| GPU_BasicGEMMStreamKWorkgroupMapping/62 | 14.455s | 4.035s|
| GPU_BasicGEMMStreamKWorkgroupMapping/63 | 14.488s | 4.045s|
| GPU_BasicGEMMStreamKWorkgroupMapping/64 | 38.846s | 9.462s|
| GPU_BasicGEMMStreamKWorkgroupMapping/65 | 38.923s | 9.475s|
| GPU_BasicGEMMStreamKWorkgroupMapping/66 | 38.884s | 9.463s|
| GPU_BasicGEMMStreamKWorkgroupMapping/67 | 38.93s | 9.441s|
| GPU_BasicGEMMStreamKWorkgroupMapping/68 | 38.9s | 9.473s|
| GPU_BasicGEMMStreamKWorkgroupMapping/69 | 38.914s | 9.477s|
| GPU_BasicGEMMStreamKWorkgroupMapping/70 | 38.961s | 9.487s|
| GPU_BasicGEMMStreamKWorkgroupMapping/71 | 38.93s | 9.489s|
| GPU_BasicGEMMStreamKWorkgroupMapping/72 | 14.432s | 4.051s|
| GPU_BasicGEMMStreamKWorkgroupMapping/73 | 14.442s | 4.042s|
| GPU_BasicGEMMStreamKWorkgroupMapping/74 | 14.44s | 4.036s|
| GPU_BasicGEMMStreamKWorkgroupMapping/75 | 14.456s | 4.042s|
| GPU_BasicGEMMStreamKWorkgroupMapping/76 | 38.891s | 9.486s|
| GPU_BasicGEMMStreamKWorkgroupMapping/77 | 38.938s | 9.444s|
| GPU_BasicGEMMStreamKWorkgroupMapping/78 | 38.919s | 9.448s|
| GPU_BasicGEMMStreamKWorkgroupMapping/79 | 38.858s | 9.455s|
| GPU_BasicGEMMStreamKWorkgroupMapping/80 | 38.878s | 9.491s|
| GPU_BasicGEMMStreamKWorkgroupMapping/81 | 38.975s | 9.478s|
| GPU_BasicGEMMStreamKWorkgroupMapping/82 | 38.945s | 9.496s|
| GPU_BasicGEMMStreamKWorkgroupMapping/83 | 38.905s | 9.477s|
| GPU_BasicGEMMStreamKWorkgroupMapping/84 | 14.433s | 4.033s|
| GPU_BasicGEMMStreamKWorkgroupMapping/85 | 14.41s | 4.038s|
| GPU_BasicGEMMStreamKWorkgroupMapping/86 | 14.478s | 4.044s|
| GPU_BasicGEMMStreamKWorkgroupMapping/87 | 14.473s | 4.035s|
| GPU_BasicGEMMStreamKWorkgroupMapping/88 | 38.927s | 9.49s|
| GPU_BasicGEMMStreamKWorkgroupMapping/89 | 38.938s | 9.472s|
| GPU_BasicGEMMStreamKWorkgroupMapping/90 | 38.913s | 9.454s|
| GPU_BasicGEMMStreamKWorkgroupMapping/91 | 38.835s | 9.463s|
| GPU_BasicGEMMStreamKWorkgroupMapping/92 | 38.901s | 9.494s|
| GPU_BasicGEMMStreamKWorkgroupMapping/93 | 38.864s | 9.488s|
| GPU_BasicGEMMStreamKWorkgroupMapping/94 | 38.839s | 9.508s|
| GPU_BasicGEMMStreamKWorkgroupMapping/95 | 38.921s | 9.487s|
| GPU_BasicGEMMStreamKWorkgroupMapping/96 | 14.445s | 4.05s|
| GPU_BasicGEMMStreamKWorkgroupMapping/97 | 14.432s | 4.041s|
| GPU_BasicGEMMStreamKWorkgroupMapping/98 | 14.472s | 4.055s|
| GPU_BasicGEMMStreamKWorkgroupMapping/99 | 14.449s | 4.039s|
| GPU_BasicGEMMStreamKWorkgroupMapping/100 | 38.87s | 9.485s|
| GPU_BasicGEMMStreamKWorkgroupMapping/101 | 38.866s | 9.474s|
| GPU_BasicGEMMStreamKWorkgroupMapping/102 | 38.836s | 9.471s|
| GPU_BasicGEMMStreamKWorkgroupMapping/103 | 38.847s | 9.464s|
| GPU_BasicGEMMStreamKWorkgroupMapping/104 | 38.985s | 9.468s|
| GPU_BasicGEMMStreamKWorkgroupMapping/105 | 38.928s | 9.469s|
| GPU_BasicGEMMStreamKWorkgroupMapping/106 | 38.866s | 9.496s|
| GPU_BasicGEMMStreamKWorkgroupMapping/107 | 38.92s | 9.48s|
| GPU_BasicGEMMStreamKWorkgroupMapping/108 | 14.452s | 4.046s|
| GPU_BasicGEMMStreamKWorkgroupMapping/109 | 14.448s | 4.044s|
| GPU_BasicGEMMStreamKWorkgroupMapping/110 | 14.469s | 4.044s|
| GPU_BasicGEMMStreamKWorkgroupMapping/111 | 14.457s | 4.052s|
| GPU_BasicGEMMStreamKWorkgroupMapping/112 | 38.895s | 9.474s|
| GPU_BasicGEMMStreamKWorkgroupMapping/113 | 38.901s | 9.453s|
| GPU_BasicGEMMStreamKWorkgroupMapping/114 | 38.882s | 9.467s|
| GPU_BasicGEMMStreamKWorkgroupMapping/115 | 38.861s | 9.446s|
| GPU_BasicGEMMStreamKWorkgroupMapping/116 | 38.945s | 9.498s|
| GPU_BasicGEMMStreamKWorkgroupMapping/117 | 38.94s | 9.488s|
| GPU_BasicGEMMStreamKWorkgroupMapping/118 | 38.881s | 9.488s|
| GPU_BasicGEMMStreamKWorkgroupMapping/119 | 38.911s | 9.486s|
| GPU_BasicGEMMStreamKWorkgroupMapping/120 | 14.434s | 4.044s|
| GPU_BasicGEMMStreamKWorkgroupMapping/121 | 14.455s | 4.061s|
| GPU_BasicGEMMStreamKWorkgroupMapping/122 | 14.464s | 4.067s|
| GPU_BasicGEMMStreamKWorkgroupMapping/123 | 14.488s | 4.058s|
| GPU_BasicGEMMStreamKWorkgroupMapping/124 | 38.932s | 9.499s|
| GPU_BasicGEMMStreamKWorkgroupMapping/125 | 38.868s | 9.488s|
| GPU_BasicGEMMStreamKWorkgroupMapping/126 | 38.863s | 9.488s|
| GPU_BasicGEMMStreamKWorkgroupMapping/127 | 38.899s | 9.467s|
| GPU_BasicGEMMStreamKWorkgroupMapping/128 | 38.948s | 9.511s|
| GPU_BasicGEMMStreamKWorkgroupMapping/129 | 38.915s | 9.502s|
| GPU_BasicGEMMStreamKWorkgroupMapping/130 | 38.932s | 9.508s|
| GPU_BasicGEMMStreamKWorkgroupMapping/131 | 38.914s | 9.491s|
| GPU_BasicGEMMStreamKWorkgroupMapping/132 | 14.42s | 4.031s|
| GPU_BasicGEMMStreamKWorkgroupMapping/133 | 14.464s | 4.054s|
| GPU_BasicGEMMStreamKWorkgroupMapping/134 | 14.494s | 4.048s|
| GPU_BasicGEMMStreamKWorkgroupMapping/135 | 14.48s | 4.046s|
| GPU_BasicGEMMStreamKWorkgroupMapping/136 | 38.911s | 9.49s|
| GPU_BasicGEMMStreamKWorkgroupMapping/137 | 38.91s | 9.478s|
| GPU_BasicGEMMStreamKWorkgroupMapping/138 | 38.914s | 9.478s|
| GPU_BasicGEMMStreamKWorkgroupMapping/139 | 38.92s | 9.475s|
| GPU_BasicGEMMStreamKWorkgroupMapping/140 | 38.883s | 9.508s|
| GPU_BasicGEMMStreamKWorkgroupMapping/141 | 38.935s | 9.5s|
| GPU_BasicGEMMStreamKWorkgroupMapping/142 | 38.848s | 9.494s|
| GPU_BasicGEMMStreamKWorkgroupMapping/143 | 38.932s | 9.502s|
</details>
## Technical Details
- Short-circuit expression comparison
- Remove unused code
- Caching expressions of kernel arguments to eliminate redundant
regeneration.
- Change `AssemblyKernelArgument` to a `class`
## Test Plan
No functional changes; covered by existing tests.
## Test Result
See CI report
commit 66e22ac6c6c0c286325cabf2b1faa269ea640446
Author: hcman2 <52367956+hcman2@users.noreply.github.com>
Date: Fri Feb 6 10:55:08 2026 +0800
[formocast] [tensilelite] enable tuning with formocast (#4043)
## Motivation
Integration plan of formocast and origami :
Steps
1.Push Formocast code to the origami subfolder. (we are here now)
2.Submit tuning code calling the API of origami.
3.Push and reserve origami prediction mode. Add APIs to pass sizemapping
data via config_t. Use an environment variable to switch modes without
affecting other code.
4.Enable predictionThreshold with tox tests.
5.Push Origami code with Formocast backend. This step will enable the
bench with different modes.
6.Refine Formocast and Origami to verify API usage and identify
functions to move.
This PR is to include step2 and step4.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Henry Ho <hehe790223@gmail.com>
Co-authored-by: Peter Cheng <Peter.Cheng@amd.com>
commit 37a74ef54eaa1bb1df603db6ec8aff22f342bc71
Author: Illia Silin <98187287+illsilin@users.noreply.github.com>
Date: Thu Feb 5 17:06:57 2026 -0800
[CK] a bunch of CI fixes. (#4361)
## Motivation
Fixing some of the CK CI issues
## Technical Details
fixing paths to dockerfiles and scripts;
moving codegen tests to separate stage (collides with main build since
you must call cmake from same folder but different options);
fixing a couple of clang compilation issues with staging compiler;
commit 808e9496d17be0826164dbb34457f54e157dd2bd
Author: Kerry Wang <kerrywang369@gmail.com>
Date: Thu Feb 5 17:00:01 2026 -0600
refractor observers to have new runtime with context concept
commit 3c9beb38b8dba1301a961cc5dc3f44ca9d4185e3
Author: Eiden Yoshida <47196116+eidenyoshida@users.noreply.github.com>
Date: Thu Feb 5 17:56:12 2026 -0500
[CK] MICI: Fix git diff in selective_test_filter.py (#4352)
## Motivation
- git diff needs access to reference repo
## Technical Details
- mount reference repo path into docker for selective_test_filter.py to
access
## Test Plan
- tested in MICI
## Test Result
- launch_tests.sh ran successfully
commit 1663ac026d46c3dd02edb73d2bfa7310c54695d5
Author: Torre Zuk <42548444+TorreZuk@users.noreply.github.com>
Date: Thu Feb 5 14:20:45 2026 -0700
[rocBLAS] trsm doc & test; trsv change noted in log (#4198)
## Motivation
Tests trsm use of new trsv kernel for big batches
Adds chagelog note on trsv which used to call hipGetDevice and
potentially hipSetDevice
commit d8bb9d2b9fe278d07fe63b395d87268c94e53fd8
Author: Jeffrey Novotny <jnovotny@amd.com>
Date: Thu Feb 5 14:54:48 2026 -0500
[rocsolver] Doxygen API cleanup part 2 (#4330)
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
Continuing with rocSOLVER API/Doxygen copy edits and polishing
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
Edit Doxygen comments in header file.
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
Build locally.
## Test Result
<!-- Briefly summarize test outcomes. -->
NA
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit 5aa1f1d4c189f779ea699be250fb1b284f3d6ac2
Author: Geo Min <geomin12@amd.com>
Date: Thu Feb 5 11:01:53 2026 -0800
[ci] Updating variable group-id for OSSCI (#4360)
OSSCI migrated mi325s, so need a new groupID
Sanity works here:
https://github.com/ROCm/TheRock/actions/runs/21723540679/job/62659665907
normal run works here:
https://github.com/ROCm/TheRock/actions/runs/21723540679/job/62659791422
I've dabbled with organization variables, however, this does not work
for forks so for now, we will do the manual update
commit 6273d3b30b32d6e0856394d37a421153dceb33c1
Author: Dmitrii Polomin <dmitriy.polomin@dxc.com>
Date: Thu Feb 5 19:53:45 2026 +0100
[MIOpen] Ported solver test to gtest (#3713)
## Motivation
Porting tests from CTest to GTest, in this case, `solver.cpp`
## Technical Details
Pretty straightforward port, although I had to get creative in order to
conform to `INSTANTIATE_TEST_SUITE_P` pattern and naming conventions
## Test Plan
Running locally, using the CI launched by this PR
## Test Result
See CI actions launched by this PR
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit ece0c48dd152856c3b443c12c58a89bc7d7c34e5
Author: Nathan Henderson <nathan.henderson@amd.com>
Date: Thu Feb 5 09:03:42 2026 -0700
[rocroller] Use StreamKMode in hipBLASLt integration and client (#4028)
## Motivation
RocRoller has a `StreamKMode` enum that is used internally and in the
GEMM tests. However, the client and hipBLASLt integration still used
boolean values to represent the StreamK state. This PR replaces the
individual boolean flags (`--streamK`, `--streamKTwoTile`,
`--streamKTwoTileDPFirst`) with a single `--streamK` string option that
accepts one of the `StreamKMode` values (`None`, `Standard`, `TwoTile`,
or `TwoTileDPFirst`).
## Technical Details
- Update the rocRoller GEMM client to use `StreamKMode` enum instead of
three separate booleans
- Update the hipBLASLt rocRoller integration to match
- Add `enumStrings<T>()` utility function in `Utils_impl.hpp` for CLI
validation of enum values
## Test Plan
Update `test_gemm_client.py` YAML fixtures to use the new `streamK:
None` format.
## Test Result
Validated by the StreamK rrperf tests
commit 3b98c98a23e76075a6a1e4e580482a627e39d59b
Author: Jobbins <john.robbins@amd.com>
Date: Thu Feb 5 08:56:42 2026 -0700
[composablekernel] fix failure status (#4351)
## Motivation
Pipelines were failing on Math CI status check.
## Technical Details
For the success case, we just changed the config in Jenkins to use a
proper app token and no code changes were required. However, the failure
case would not have worked as coded, so we needed to move that outside
of the `rocmnode()` block.
## Test Plan
I removed all of the CI in one of the commits to quickly test, and then
added it back. Got a successful "success" message and "failure" message
produced
commit 9bb7f5c31253643cd72363314c3d3ee02f723406
Author: Eiden Yoshida <47196116+eidenyoshida@users.noreply.github.com>
Date: Thu Feb 5 10:55:44 2026 -0500
[CK] MICI: Correct path for build trace script (#4349)
## Motivation
- Corrects path to script due to superrepo migration
- Forces all tests to run by default
## Technical Details
- now in /projects/composablekernel
---------
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
commit 120f91dd211117e308b3713593ac7f061cc02c08
Author: bibek <108366729+bghimireamd@users.noreply.github.com>
Date: Thu Feb 5 09:47:16 2026 -0600
[HIPDNN][DOC] Add TYPED_TEST guidance for multi-datatype tests (#4000)
## Motivation
Doc update : add `TYPED_TEST` guidance for multi-datatype tests
Update `.clinerules`, `.cursor/rules/testing.mdc`, and `docs/Testing.md`
to recommend `TYPED_TEST` for tests covering `float`/`half`/`bfloat16`.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Samuel Reeder <41528605+SamuelReeder@users.noreply.github.com>
commit d26a7820b58f789d19efd6064d4c2c4f4fc72a95
Author: Eiden Yoshida <47196116+eidenyoshida@users.noreply.github.com>
Date: Wed Feb 4 21:43:22 2026 -0500
[CK] MICI: Use reference repo for checkout operations (#4336)
## Motivation
- Maintain a reference repo on slave nodes that speeds up any
clone/checkout operations
## Technical Details
- clone a ref repo if it does not exist
- update ref repo if it does exist
- checkout after ref repo is updated
- eliminates double clone
## Test Result
- Initial checkouts succeeded
commit f2f187ab40738272232f571f58112697da405b1a
Author: Geo Min <geomin12@amd.com>
Date: Wed Feb 4 15:43:38 2026 -0800
[ci] Fixing rocm-libs race condition (#4192)
Currently, there is a race condition that overwrites BLAS libraries
during MIOpen/hipdnn builds. (error:
https://github.com/ROCm/rocm-libraries/actions/runs/21228188053/job/61080555083)
Tested locally:
```
# With all three
geom@geom:~/Code/rocm-libraries/.github/scripts$ PLATFORM="linux" python3 therock_matrix.py
[{'cmake_options': '-DTHEROCK_USE_EXTERNAL_COMPOSABLE_KERNEL=ON -DTHEROCK_ENABLE_COMPOSABLE_KERNEL=ON -DTHEROCK_ENABLE_MIOPEN=ON -DTHEROCK_ENABLE_MIOPEN_PLUGIN=ON -DTHEROCK_ENABLE_ALL=OFF -DTHEROCK_COMPOSABLE_KERNEL_SOURCE_DIR=../composable_kernel', 'project_to_test': 'miopen_plugin,miopen,hipdnn'}]
# Only hipdnn
geom@geom:~/Code/rocm-libraries/.github/scripts$ PLATFORM="linux" python3 therock_matrix.py
[{'cmake_options': '-DTHEROCK_USE_EXTERNAL_COMPOSABLE_KERNEL=ON -DTHEROCK_ENABLE_ALL=OFF -DTHEROCK_COMPOSABLE_KERNEL_SOURCE_DIR=../composable_kernel -DTHEROCK_ENABLE_MIOPEN_PLUGIN=ON -DTHEROCK_ENABLE_COMPOSABLE_KERNEL=ON', 'project_to_test': 'hipdnn,miopen_plugin'}]
```
This fixes this error, as if all libraries are ran, they will combine
(saves resources + no overwriting)
commit f34aec25c434b3044b75481d70693af3bf0ade1e
Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com>
Date: Wed Feb 4 18:25:31 2026 -0500
[CK] Add FP8 KV_BLOCKSCALE support for batch prefill (#4263)
Implement per-page K/V quantization for paged attention:
- Add KV_BLOCKSCALE enum to BlockAttentionQuantScaleEnum
- Use exp2 shift trick to eliminate explicit P scaling overhead
- Prefetch physical pages offset for KV cache, overlaps with
computations
## Proposed changes
Please describe the motivation behind the pull request, whether it
enables a new feature or fixes a bug. If there are associated pull
requests or issues, please link them to the pull request.
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged
## Discussion
If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
---
🔁 Imported from
[ROCm/composable_kernel#3696](https://github.com/ROCm/composable_kernel/pull/3696)
🧑💻 Originally authored by @Jeff-Huang
---------
Co-authored-by: Jeff Huang <chiachi.huang@amd.com>
Co-authored-by: Illia Silin <Illia.Silin@amd.com>
commit df32df51ea6a9ebfba9a459c77ed82c4877df22b
Author: Yiqian Liu <157505981+liu-yiqian@users.noreply.github.com>
Date: Wed Feb 4 17:12:11 2026 -0600
[rocRoller] [hipblaslt] Enable more workgroup tile sizes for pre-swizzled scale data (#4175)
## Motivation
Pre-swizzle is an optimization that pre-swizzle the scale data to match
the layout that kernel expects. The purpose of this PR is to add more
possible workgroup tile sizes that supports pre-swizzled scale data.
## Technical Details
1. Configure the workgroup tile size to 256 at K dimension when the
input data format is pre-swizzled.
2. Filter out the invalid workgroup tile size (i.e., MN dimension is not
multiple of 32, or MN dimension is 96).
3. Configure the solution parameters for pre-swizzled input.
## Test Plan
1. Added a rocRoller client test that uses 32x32x256 workgroup tile.
## Test Result
1. This PR should not change any kernel that the data is not
pre-swizzled.
2. All the tests should pass and no performance changes.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Yiqian Liu <yiqialiu@ctr2-alola-ctrl-01.amd.com>
Co-authored-by: yiqialiu <yiqialiu@amd.com>
commit 87d1a8fa005ef2f75e48e5c9c4e70f8235236b03
Author: Samuel Reeder <41528605+SamuelReeder@users.noreply.github.com>
Date: Wed Feb 4 16:10:01 2026 -0700
Use `--latest-release` flag for installing rocm in clang-tidy (#4120)
## Motivation
`--latest-release` was added in TheRock
[2997](https://github.com/ROCm/TheRock/pull/2997) to grab latest nightly
for the specified target. We can use this to keep the clang-tidy
workflow somewhat up-to-date.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit c5…
commit 8c40fb6cac48969d6237cccdcbbbad56b44ff0a3
Author: Kerry Wang <kerrywang369@gmail.com>
Date: Mon Feb 9 20:27:33 2026 -0500
more consistent skip text
commit 557e2764b3a001884a004f0a183a307c4fbc2bd2
Merge: 6bd6e49791 219f365e7b
Author: Kerry Wang <kerrywang369@gmail.com>
Date: Mon Feb 9 18:38:48 2026 -0600
Merge remote-tracking branch 'origin/develop' into users/kerrwang/lds-queue
commit 6bd6e497910e3ba681b22a47630bc5f0dedb16b8
Author: Kerry Wang <kerrywang369@gmail.com>
Date: Mon Feb 9 18:38:31 2026 -0600
fix format
commit 219f365e7bc40c9ce3f5c382228a7b2e14b90520
Author: James Sandham <33790278+jsandham@users.noreply.github.com>
Date: Mon Feb 9 19:08:04 2026 -0500
[hipsparse] Match behaviour of csr2csr_compress from rocsparse (#4420)
## Motivation
In the hipSPARSE test code host solution, we were incorrectly checking
if a value satisfied:
`testing_abs(csr_val_A[j]) > testing_real(tol) &&
testing_abs(csr_val_A[j]) > std::numeric_limits<float>::min()`
instead of the correct criteria:
`testing_abs(csr_val_A[j]) > testing_real(tol)`
commit 698d5d09184a24fde32ab7309fcd88410fc7ff8e
Author: amd-hsong <hao.song@amd.com>
Date: Mon Feb 9 16:40:07 2026 -0700
[rocprim] Fix a call to intrinsics in test_device_reduce_by_key (#4391)
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
Fix a call to __clzll in test_device_reduce_by_key
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
There are a couple of issues in the call to __clzll:
- the argument is cast to `long long`: it should be cast to `unsigned
long long` instead
- in rocprim there exists a wrapper for clz, so for better portability
rocprim::clz should be used instead.
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
Run test_device_reduce_by_key to verify the test runs correctly.
## Test Result
<!-- Briefly summarize test outcomes. -->
The test passes.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit 557f5baa6d68bb5a8126d9730a8d48983778aac3
Author: Kerry Wang <kerrywang369@gmail.com>
Date: Mon Feb 9 16:53:41 2026 -0600
skip on non-gfx950
commit 8b72bc8759d9c11dfcbf410182fa332152b97e69
Author: bnemanich <brad.nemanich@amd.com>
Date: Mon Feb 9 16:26:51 2026 -0500
[hipBLASLt] Enable custom MXFP4 kernels (#4384)
## Motivation
Allow hipBLASLt to call custom MX FP4 kernels for higher performance.
## Technical Details
A single kernel was added in this PR. The kernel was originally from:
https://github.com/ROCm/aiter/tree/main/hsa/gfx950/f4gemm.
This kernel used a slightly different shuffled scaling layout than
rocRoller. hipBLASLt will only support this new shuffled layout, plus
the original non-shuffled layout. All rocRoller kernels will be disabled
when using shuffled scales for now. Once rocRoller supports the new
layout, they will be added back in.
This PR also adds some new MX datatype generation patterns that were
useful during debugging.
New custom kernels can be added to the custom_kernels directory. They
will also need to be added in the customer_kernels.cpp file that was
added in this PR.
## Test Plan
Check that performance improved when using MXFP4 GEMMs with shuffled
scales.
## Test Result
Performance improved by about 17%.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Andrew Whittle <Andrew.Whittle@amd.com>
Co-authored-by: Bryant Nelson <bryant.nelson@amd.com>
commit 61f9f906dcc0a9d4f6c327fea713aebc6d4b0a1d
Author: Bartłomiej Kocot <barkocot@amd.com>
Date: Mon Feb 9 22:08:57 2026 +0100
[CK] CK Tile grouped convolution direct load (#4406)
## Motivation
CK Tile grouped convolution forward direct load support.
## Technical Details
Basic pipeline for direct load and new instances for forward for v1 and
v4 pipelines.
## Test Plan
test_grouped_convnd_fwd_tile
## Test Result
CI pending
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-130
commit b7f136734ad26314386ca2b4f5a99467804f1bb7
Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com>
Date: Mon Feb 9 20:58:57 2026 +0000
Enable group mode (varlen) kernel generation for PyTorch integration (#4292)
## Proposed changes
This PR enables group mode (variable-length attention) kernel generation
for PyTorch's CK SDPA backend.
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [X] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [X] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged
## Discussion
The change is minimal (single line deletion) but enables a significant
feature: variable-length attention support for ROCm users via PyTorch's
torch.nn.attention.varlen API.
---
🔁 Imported from
[ROCm/composable_kernel#3553](https://github.com/ROCm/composable_kernel/pull/3553)
🧑💻 Originally authored by @chinmaydk99
Co-authored-by: Chinmay_Kuchinad <ChinmayDattanand.Kuchinad@amd.com>
commit f48a5e63edb7102996b0b769e76114c0bbfd35cf
Author: Mihnea Chirila <37160326+mihnea-chirila@users.noreply.github.com>
Date: Mon Feb 9 14:55:12 2026 -0600
[Tensilelite] Added MIArchVgpr support for Complex Datatypes. (#4332)
## Motivation
Added MIArchVgpr support for Complex Datatypes.
## Technical Details
Fixed AlphaTmpVgpr initialization, and rocisa register offset bug:
- Updated condition to initialize AlphaTmpVgpr if MIArchVgpr parameter
is enabled. Required to generate `MulMIOutAlphaToArch' code
(https://github.com/ROCm/rocm-libraries/blob/c20a85b6c458ef44c1f0e30c35b286a0395fb8fa/projects/hipblaslt/tensilelite/Tensile/KernelWriterModules.py#L251)
regardless of postGSU Accumulation scheme.
- Fixed underlying `Holder` struct bug: correctly passes string passed
offsets to `RegisterContainer`. Required to update imaginary register
for C/ZGEMM.
(https://github.com/ROCm/rocm-libraries/blob/c20a85b6c458ef44c1f0e30c35b286a0395fb8fa/projects/hipblaslt/tensilelite/Tensile/KernelWriterModules.py#L288)
## Test Plan
Tested for C & Z with MIArchVgpr: [0, 1] on gfx942 and gfx950
## Test Result
Success
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit 3de83b9b4035006b5ddd825df9404edc43ca9b39
Author: CMiservaAMD <cmiserva@amd.com>
Date: Mon Feb 9 13:42:54 2026 -0700
[hipDNN] Add integration tests for frontend configuration knobs APIs. (#4307)
Add integration tests to verify correct operation of new hipDNN frontend API
functions for managing engine config settings.
commit 2752a8a5105e11929b876ce0e343bcc73a9cf308
Author: DarylHawkinsAMD <Daryl.Hawkins@amd.com>
Date: Mon Feb 9 13:08:30 2026 -0700
[MIOpen] First set of kernels using CK Builder end to end (#4123)
commit e55f37bad667987f74989bc95e08f86603438963
Author: Mitchell Ousdahl <mitch.ousdahl@amd.com>
Date: Mon Feb 9 10:05:26 2026 -0800
Modified test plugin rpaths (#4350)
## Motivation
In order to successfully get hipDNN added to the python ROCm wheels, the
RPATHs on Linux for the test plugins need to be updated to make them
portable. We will leverage TheRock's existing RPATH update mechanism to
do this.
## Technical Details
- Update all test plugin RPATHs
## Test Plan
- Build ROCm
- Build the wheels
- Use the "Test ROCm Wheel" workflow, which verifies that the test
plugins can load and find their dependencies.
## Test Result
- [ ] "Test ROCm Wheel" workflow succeeds
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit a7485411874b0650b31068364f6b1155d9890212
Author: Muhammad Osama <osama94@gmail.com>
Date: Mon Feb 9 09:30:52 2026 -0800
[Origami] Skip test-selector if torch not found. (#4359)
## Motivation
Makes `torch` completely optional by skipping dependent tests if it is
not found.
## Technical Details
```
# Skip entire module if torch is not available (selector requires torch)
torch = pytest.importorskip("torch", reason="torch is required for OrigamiMatmulSelector tests.")
```
## Test Plan
Run tests using CI + TheRock build.
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit 0c5cd629a94a454a350eb651b5921baeb1c82546
Author: Swati Rawat <120587655+SwRaw@users.noreply.github.com>
Date: Mon Feb 9 22:51:13 2026 +0530
Update Tensile CHANGELOG.md (#4164)
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit 331512e9e13e197d8d7fdf7b72f5b60eb63d7d1e
Author: Bartłomiej Kocot <barkocot@amd.com>
Date: Mon Feb 9 16:36:52 2026 +0100
[CK] Fix grouped conv fwd transform for merged groups (#4399)
## Motivation
[CK] Fix grouped conv fwd transform for merged groups for 1d and 3d.
## Technical Details
After optimizations for 2d there is a lack of implementation for 1d and
3d
## Test Plan
test_grouped_convnd_fwd
## Test Result
pending CI
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit 1c2927530e176c63cf814b44eb8147e89d2bcaf7
Author: Eiden Yoshida <47196116+eidenyoshida@users.noreply.github.com>
Date: Mon Feb 9 10:23:47 2026 -0500
[CK] MICI: Disable failure pattern checking (#4373)
## Motivation
- ck mici jobs hanging at end, possibly at failure pattern checking
## Technical Details
- Disable failure pattern checking to see if hanging goes away
## Test Plan
- Observe behavior after merge
commit a3058d1dc0b3f176f56fbecd040c2fc48c7258ad
Author: COrruDXC <carlo.orru@dxc.com>
Date: Mon Feb 9 14:02:05 2026 +0100
Reduce boost usage by replacing time calls (#3875)
## Motivation
Reduce boost usage by replacing time calls.
## Technical Details
Replace boost::posix_time related data types with the corresponsing
std::chrono data types.
## Submission Checklist
- [X] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit 959bd9393ad9a578711334c40948ac1321e41c1f
Author: Yi-Yao (Alex), Wang <68064688+alex391a@users.noreply.github.com>
Date: Mon Feb 9 17:15:47 2026 +0800
Update gfx942/gfx950 BBS/HHS/I8I8S SPB/SPA logic yaml (#4365)
## Motivation
- Update BBS/HHS/I8I8S SPB/SPA logic yaml for gfx942/gfx950
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
- Run local hipsparselt-test
- Run local tests for all matrix sizes using hipsparselt-bench
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: chiahlin <chiahlin@amd.com>
commit ad03e58dabbf2bbc348c031a06ec73011d85d2c3
Author: Chuck Wu <chuckwu92job@gmail.com>
Date: Mon Feb 9 13:04:38 2026 +0800
[hipblaslt] Fix memory leaks & uninitialized value use (#4338)
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
According to the
[ROCM-1835](https://amd-hub.atlassian.net/browse/ROCM-1835?focusedCommentId=109304&sourceType=mention),
there are some memory leaks and instances of uninitialized value being
used during the gtest.
<img width="450" height="367" alt="image"
src="https://github.com/user-attachments/assets/2345e1f8-6062-4a5a-b294-97042709b18e"
/>
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
1. Add the code to call the corresponding destroy functions for the data
that has not been released yet.
2. Add the default value to compute_input_typeA/B
Flow (before this commit) ->
a. hipblasLtMatmulDescCreate:
- compute_input_typeA = ???
- compute_input_typeB = ???
b. hipblasLtMatmulDescSetAttribute(COMPUTE_INPUT_TYPE_A, HIP_R_16F, ...)
- compute_input_typeA = HIP_R_16_F
- compute_input_typeB = ???
- call _matmul_desc_determine_compute_type()
- Read compute_input_typeA & compute_input_typeB
c. hipblasLtMatmulDescSetAttribute(COMPUTE_INPUT_TYPE_B, HIP_R_16F, ...)
- compute_input_typeA = HIP_R_16_F
- compute_input_typeB = HIP_R_16_F
- call _matmul_desc_determine_compute_type()
- Read compute_input_typeA & compute_input_typeB
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
1. Test command: `valgrind --leak-check=full ./hipblaslt-test
--gtest_filter=_/aux_test.*`
2. Before this commit:
Uninitialized value being used
<img width="691" height="81" alt="image"
src="https://github.com/user-attachments/assets/22a897f1-c25e-4608-850e-0c6bcb5ad0a3"
/>
Memory leaks
<img width="708" height="78" alt="image"
src="https://github.com/user-attachments/assets/02268893-a29a-4db4-95aa-c93385371d5a"
/>
## Test Result
<!-- Briefly summarize test outcomes. -->
1. The Valgrind output above isn’t showing.
2. gtest all pass on Navi3.
<img width="952" height="92" alt="image"
src="https://github.com/user-attachments/assets/6e1b9b01-afc4-4a51-8a9f-e0196fc8495a"
/>
## Submission Checklist
- [X] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[ROCM-1835]:
https://amd-hub.atlassian.net/browse/ROCM-1835?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
commit b7de1e14cea70681a23cd1a136df42910c776e4a
Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com>
Date: Mon Feb 9 11:54:54 2026 +0800
[CK_TILE] Add blockscale GEMM support for EightWarps on gfx950 (#4280)
## Proposed changes
gemm blockscale eightwarps support
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run `clang-format` on all changed files
- [x] Any dependent changes have been merged
## Discussion
If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
---
🔁 Imported from
[ROCm/composable_kernel#3650](https://github.com/ROCm/composable_kernel/pull/3650)
🧑💻 Originally authored by @kensclin
---------
Co-authored-by: KenSCLin <lshyhchy@amd.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
commit 774cfc6410ed55802691fef19a34449182878be5
Author: Ethan <Yi-Chen.Lin@amd.com>
Date: Mon Feb 9 11:39:20 2026 +0800
[hipblaslt] do some debug operations only in debug setting
## Motivation
Single solution selection time has increased slightly.
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
## Technical Details
Lots of "assign matchingTag" can be avoid if not in debug
(printProperty), but I still keep the Equal assign there since it has
been there before #2757
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit ff3e9821bbe2d14b9874e61ebb518bbbce621ac3
Author: jakpiase <jakpia21@gmail.com>
Date: Sun Feb 8 20:57:14 2026 +0100
[CK_TILE] Add support and tests for V6 pipeline in conv fwd (#4357)
Added support for conv v6 pipeline in ck tile's convolution forward
kernel. CK Tile v6 pipeline is the equivalent to old ck's V5 pipeline
and should be faster than other pipelines for some cases. This PR also
adds tests inside profiler that's currently inside experimental
directory, so now we should be able to detect regressions easier.
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: subhajitdchow <sduttach@amd.com>
commit 591f50450241d6b1965f9f6ee3fe2526ef71ab8d
Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com>
Date: Sun Feb 8 12:34:59 2026 +0100
[CK] Add fwd conv group merging to v3 conv instances (#4273)
## Proposed changes
Added conv group merging to the (universal) V3 fwd conv pipeline. The
new instance improves fwd conv performance when the number of
input/output channel per group is low.
On MI300 (`gfx942`) we get
| CK prof command | Baseline (TFLOPS) | V3 group merging (TFLOPS) |
|:-----|:------:|------:|
| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 4 4 3 3 200 200 1 1 1 1 1 1 1 1
| 3.86035 | 8.36796 |
| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 200 200 2 2 1 1 1 1 1 1
| 10.1867 | 13.4677 |
| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 100 100 1 2 1 1 1 1 1 1
| 11.7875 | 16.3657 |
---
🔁 Imported from
[ROCm/composable_kernel#3675](https://github.com/ROCm/composable_kernel/pull/3675)
🧑💻 Originally authored by @vpietila-amd
---------
Co-authored-by: Ville Pietilä <>
Co-authored-by: Ville Pietilä <188998872+vpietila-amd@users.noreply.github.com>
Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
commit cad7fa2c1849b0863ed52ef6cd47198e421d5b6e
Author: BrianHarrisonAMD <169072757+BrianHarrisonAMD@users.noreply.github.com>
Date: Fri Feb 6 23:48:54 2026 -0700
[hipDNN] Fix codecov target breaks (#4374)
## Motivation
code_cov stage for hipDNN was breaking due to environment differences.
Newer clang tooling flags false positives on added files.
Environments that are missing spdlog, but have fmt present were causing
issues due to mixed include expectations.
## Technical Details
- Ignore false positives for lint
- Only add fmt if spdlog was built with external FMT
## Test Plan
- Code cov target builds succeessfully
## Test Result
Waiting on CI
commit 91627789d86acc7dff4bf5eaafe3b774a7037f76
Author: Koji Nakajima <75698246+nakajee@users.noreply.github.com>
Date: Fri Feb 6 23:44:05 2026 -0700
[hipblaslt] Fix memory access error with DtlPlusLdsBuf (#4303)
## Motivation
Fix a memory access error with DtlPlusLdsBuf
## Technical Details
- generate all GlobalRead Inc code before local read addr swap
## Test Plan
Added a test case in dtl.yaml
## Test Result
Confirmed new test failed with before change and no error with after
change
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit 8b5a98b48c007663765865d2e14247ef1f056b01
Author: Aaron St George <aaronstgeorge@gmail.com>
Date: Fri Feb 6 23:18:13 2026 -0600
[hipDNN] Add `FUSILLI_PLUGIN` to `EngineNames.hpp` (#4362)
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
Following the approach outlined in
[hipdnn/docs/rfcs/0003_EngineIdDesign.md](https://github.com/ROCm/rocm-libraries/blob/develop/projects/hipdnn/docs/rfcs/0003_EngineIdDesign.md)
this PR defines an engine ID for fusilli in
`hipdnn/data_sdk/include/hipdnn_data_sdk/utilities/EngineNames.hpp`.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
ID + tests defined.
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
Test defined in the PR.
## Test Result
<!-- Briefly summarize test outcomes. -->
Tests pass.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit e3a9b3f95d29ce626efc3d2134e4e924b1c591a9
Author: James Newling <james.newling@gmail.com>
Date: Fri Feb 6 21:12:55 2026 -0800
[hipblaslt] Raise exception instead of segfaulting (#3995)
## Motivation
Faster problem diagnostic when failure.
## Technical Details
Throw exception if library is nullptr.
## Test
Before:
```
TensileLibrary.yaml:181:31: error: invalid boolean
customMainLoopScheduling: 0
^
[Lots of logging]
Segmentation fault
+ ERR2=139
+ ERR=0
```
After:
```
TensileLibrary.yaml:181:31: error: invalid boolean
customMainLoopScheduling: 0
terminate called after throwing an instance of 'std::runtime_error'
what(): Failed to load solution library
+ ERR2=134
+ ERR=0
```
commit f48eaa54f7395aa8ce4980dcc6725fe38784f7e6
Author: CMiservaAMD <cmiserva@amd.com>
Date: Fri Feb 6 21:48:52 2026 -0700
[hipDNN] Fix a couple log messages in test plugins broken by recent merge. (#4380)
Include correct function name in test plugin log output.
commit 5df3343ecfae6b39201995d8178fe39e061e0c40
Author: Emily Martins <65371150+ecamartins@users.noreply.github.com>
Date: Fri Feb 6 17:26:57 2026 -0700
[CK_TILE] Fix MMA concepts compiler error (#4381)
## Motivation
CK Tile is required to support certain older OSs; on these OSs, cpp 20
is not fully supported. For ROCm 7.2, compiler errors occur on one of
these older OSs. An example of this error is as follows:
```bash
/composable_kernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp:34:28: error: expected concept name with optional arguments
34 | { MmaOp::kAMBlock } -> std::convertible_to<unsigned int>;
|
```
The goal of this PR is to resolve these compiler errors.
## Technical Details
The existing guards around the mma concepts only check if the concepts
language feature is supported, as follows:
```cpp
#if defined(__cpp_concepts) && __cpp_concepts >= 201907L
// ...
template <typename CtrlFlags>
concept CtrlFlagsGfx9I = requires(CtrlFlags ctrlFlags) {
// Flag members for Gfx9 MFMA instructions
{ CtrlFlags::Cbsz } -> std::convertible_to<int>;
{ CtrlFlags::Abid } -> std::convertible_to<int>;
{ CtrlFlags::Blgp } -> std::convertible_to<int>;
};
#endif // defined(__cpp_concepts) && __cpp_concepts >= 201907L
```
That said, in cases where functionality from the `<concepts>` header is
used (e.g., `std::convertible_to`), this guard fails to check whether
the `<concepts>` header is available.
This change adds an additional check to the concepts that make use of
functionality from the `<concepts>` header to ensure the header is
available.
## Test Plan
I tested the changes on the relevant docker for gfx90a, gfx950, and
gfx942 and the compiler issue is not present.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit 06976b37a2f0353b80c71fb3d56bee92bb6b9bab
Author: Aviral Goel <aviral.goel@amd.com>
Date: Sat Feb 7 04:14:28 2026 +0400
Increase tolerance for FP16 GEMM tests to handle non-deterministic ro… (#4335)
…unding
Three tests were failing intermittently with small errors (0.01-1.5%)
due to non-deterministic FP16 accumulation order from GPU thread
scheduling:
- test_ck_tile_batched_gemm
- test_ck_tile_grouped_gemm_preshuffle
- test_ck_tile_grouped_gemm_multi_d
These tests use kbatch=1 (no split-K), so errors are from
order-dependent rounding, not atomics. Increased tolerances from 1e-3 to
2e-3 (0.2%) to account for FP16 precision limits while still catching
real bugs.
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
commit 07e9d561402c717946a1c08cfdce2681d5733335
Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com>
Date: Fri Feb 6 16:10:23 2026 -0800
[CK] add inter/intrawave scheduling concept doc (#4300)
## Proposed changes
Adding information about inter/intrawave scheduling
---
🔁 Imported from
[ROCm/composable_kernel#3660](https://github.com/ROCm/composable_kernel/pull/3660)
🧑💻 Originally authored by @spolifroni-amd
---------
Co-authored-by: spolifroni-amd <sandra.polifroni@amd.com>
Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
commit 4d773b636ca00996e971d55bcd0530f641837b42
Author: JonathanLichtnerAMD <195780826+JonathanLichtnerAMD@users.noreply.github.com>
Date: Fri Feb 6 17:09:55 2026 -0700
Add .cline_storage to .gitignore (#4390)
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit 738ffd7689ba0759f00c0e9430889b2ed995fa94
Author: Enrico Degregori <73224202+EnricoDeg@users.noreply.github.com>
Date: Sat Feb 7 01:09:08 2026 +0100
[CK] Workaround blockscale wp test failure (#4372)
## Motivation
Workaround to fix blockscale wp test failure for pipeline v3
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit 612bf0b710b399276916c222d8d4c5f9c34f9f62
Author: James Sandham <33790278+jsandham@users.noreply.github.com>
Date: Fri Feb 6 18:44:55 2026 -0500
[rocsparse] Add bfloat16 and complex-types tests for code coverage (#4204)
## Motivation
Add bfloat16 and complex-types tests for code coverage. Also renames the
atomic_add tests to belong to pre_checkin so that they will be run as
part of code coverage pre_checkin tests.
commit 287fbc900071d5f9f8df7efdf1cfd25d9c8ea338
Author: Kerry Wang <kerrywang369@gmail.com>
Date: Fri Feb 6 15:55:22 2026 -0600
don't include waitcnts; renames
commit f52966a377bfd26725f35f103fbc7975cd9b4ec9
Author: Yiqian Liu <157505981+liu-yiqian@users.noreply.github.com>
Date: Fri Feb 6 15:50:20 2026 -0600
[rocRoller] Explicitly convert when typeAcc differs with typeD (#3977)
## Motivation
This PR explicitly converts data type when Accumulator type is different
with matrix D. The purpose of this change is to make rocRoller client
adds the same operation as hipblaslt
## Technical Details
Added a convert operation when Accumulator is not the same type as
matrix D.
## Test Plan
All the existing tests should pass.
This change should not affect the performance.
## Test Result
Existing tests passed.
---------
Co-authored-by: yiqialiu <yiqialiu@amd.com>
commit 323a8d256e2409de54fa9dfa9523f4c50295c305
Author: Kerry Wang <kerrywang369@gmail.com>
Date: Fri Feb 6 15:28:17 2026 -0600
minor quality improvements
commit 6c4a1fb6d0f2eff761cd95e690f3ef0090266367
Author: Ali Yazdani <ayazdani@amd.com>
Date: Fri Feb 6 14:24:53 2026 -0700
[Origami] AutoWgm for NonTemporal Kernels. (#4218)
AIGESOLSEL-71
## Motivation
This PR enhances the Origami workgroup mapping (WGM) selection logic to
support nontemporal kernels and improve automatic WGM value selection.
Previously, nontemporal cases (NTA/NTB > 3) were excluded from automatic
workgroup mapping optimizations, limiting potential performance.
## Technical Details
1. Enabling NonTemporal support in AutoWGM with an enhanced logic
2. Improved WGM Candidate Generation
## Test Plan
CI, and locally ran performance tests.
## Test Result
Performance benchmarks show uplifts coming from changes.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit 50e39459b541f978edd9acf645792cf496c16bea
Author: Kerry Wang <kerrywang369@gmail.com>
Date: Fri Feb 6 15:21:16 2026 -0600
add string-based observer test
commit 16b3b1840d61871c116c9ae80cf59324233377d5
Author: Torre Zuk <42548444+TorreZuk@users.noreply.github.com>
Date: Fri Feb 6 14:19:00 2026 -0700
[rocBLAS] Users/torrezuk/rocm 1157 amd smi rocblas (#4353)
## Motivation
Deprecated dependency rocm-smi replaced by amd-smi
## Test Plan
Test frequency reporting using rocblas-bench with environment variable set
commit 45b616b1e6df1b1d3816a2f03a05a0f6ab754652
Author: Illia Silin <98187287+illsilin@users.noreply.github.com>
Date: Fri Feb 6 10:17:02 2026 -0800
[CK] fix path for build filter (#4375)
## Motivation
Fix the filter that determines whether CI builds are necessary.
## Technical Details
A script checks the files list returned by git diff and checks whether
any code source was modified. If not, if only documentation was changed,
it will allow skipping the builds. We make sure we only look at the
changes in projects/composablekernel/ folder.
commit d8e2826bedff1183eaedeb6d6f5b2eeaa65dab7b
Author: Geo Min <geomin12@amd.com>
Date: Fri Feb 6 09:59:29 2026 -0800
[ci] Adding mi350 required group ID (#4378)
After updating mi325 group-id, we are noticing errors for mi350.
Tested here for mi350:
https://github.com/ROCm/TheRock/actions/runs/21733399385/job/62692971370
Tested here for mi325:
https://github.com/ROCm/TheRock/actions/runs/21759203211/job/62778060417
Adding both work properly
commit 78497b37bb4e5853b5da3feb96381c6b643556f7
Author: bibek <108366729+bghimireamd@users.noreply.github.com>
Date: Fri Feb 6 10:46:48 2026 -0600
Fix intermittent kernel compilation failures in BnFwdTrainingSpatial (#4202)
## Motivation
Fix kernel compilation failures in BnFwdTrainingSpatial caused by two
related issues:
1. Uninitialized local size variables producing random garbage values
2. Missing compile-time guard for warp-reduction LDS arrays when
workgroup < 64 threads
## To Reproduce
```bash
rm -rf ~/.config/miopen/*.udb.txt ~/.cache/miopen/
rm -rf /tmp/.config/miopen/ /tmp/.cache/miopen/
MIOPEN_FIND_ENFORCE=SEARCH MIOPEN_LOG_LEVEL=5 ./bin/MIOpenDriver bnorm -n 1024 -c 64 -H 13 -W 13 -m 1 --forw 1 -s 1 -V 1
```
## Technical Details
### Bug 1: Uninitialized Variables (Host)
Variables declared without initialization:
```cpp
size_t xlocalsize, xgridsize; // uninitialized
```
For Variants 0/1/3, early returns skip initialization, leaving garbage
values that propagate to kernel template parameters. Depending on stack
memory state, errors include:
- `error: array is too large (18446744073709545792 elements)`
- `error: variable length array declaration cannot have 'static' storage
duration`
- `error: zero-length arrays are not permitted in HIP device code`
### Bug 2: Missing Compile-Time Guard (Kernel)
The warp-reduction path divides LDS size by 64:
```cpp
__shared__ FpAccumCType lcl_data_x[MIO_BN_GRP0_FINAL * MIO_BN_GRP1_FINAL * MIO_BN_GRP2_FINAL / 64];
```
When Variants 0/1/3 set workgroup size to 1×1×1, this computes `1/64 =
0`, producing illegal zero-length arrays. Note that `if constexpr` only
suppresses template instantiation, not parsing of ill-formed code like
zero-length arrays.
## Fix
### Host side (`common_spatial.hpp`, `forward_spatial.cpp`)
Initialize variables to safe defaults:
```cpp
size_t xlocalsize = 1, xgridsize = 1;
size_t ylocalsize = 1, ygridsize = 1;
size_t zlocalsize = 1, zgridsize = 1;
size_t nelements = 1;
unsigned int ldsgcn = 0, ldsnogcn = 0;
```
### Kernel side (`MIOpenBatchNormFwdTrainSpatial.cpp`)
Use C++17 constexpr ternary to ensure array size is always ≥ 1:
```cpp
else
{
// C++17 idiomatic: ensure array size is never zero using constexpr ternary
constexpr auto grp_final_total =
MIO_BN_GRP0_FINAL * MIO_BN_GRP1_FINAL * MIO_BN_GRP2_FINAL;
constexpr auto lds_gcn_array_size = grp_final_total >= 64 ? grp_final_total / 64 : 1;
commitID = 64;
__shared__ FpAccumCType lcl_data_x[lds_gcn_array_size];
__shared__ FpAccumCType lcl_data_y[lds_gcn_array_size];
miopen::reduction::gcn_reduce2(...);
}
```
__Why this works:__
- `constexpr` ensures compile-time evaluation (zero runtime overhead)
- When workgroup ≥ 64: array size = `grp_final_total / 64` (correct,
same as before)
- When workgroup < 64: array size = 1 (valid), but this `else` branch is
never taken due to `if constexpr` guard
- Dead code elimination removes the unused size-1 arrays from the binary
## Test Plan
- Existing batchnorm training tests pass
- Verified no zero-length array errors with deterministic workgroup
sizes
- Confirmed warp-reduction path only executes when workgroup ≥ 64
threads
commit 8f8b97a40d36cb4095e929b0ef1b71ffea7ba170
Author: SreecharanGundaboluAMD <sgundabo@amd.com>
Date: Fri Feb 6 08:18:12 2026 -0800
[miopen] upgrade clang-format (#4194)
This PR updates the project's code formatting tooling to use
`clang-format-18` instead of `clang-format-12` throughout the codebase
as a transition as we move towards TheRock for our CI.
**Tooling and Configuration Updates:**
* Updated all references to `clang-format-12` to `clang-format-18` in
the pre-commit hook (`.githooks/pre-commit`), CMake configuration
(`ClangCheck.cmake`), and Dockerfile (`Dockerfile`). The Dockerfile now
also adds the appropriate LLVM 18 repository and keyring for
installation.
[[1]](diffhunk://#diff-1436c8126d575a7576d98d0bc8a8c6d27e8eb4e2d7241d61fe64c286c0d7365cL7-R7)
[[2]](diffhunk://#diff-fc024f0d7573d33039081dab6b12f76f0f34c8e07e014552daa1bed9a276a548L9-R9)
[[3]](diffhunk://#diff-32304f8a254e46fb8ff524cf4c488eb6013ab54a89ca62709cfb20ccf58976f9R54-R61)
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit b34fa62134425a611b18c05aae687f1bc09c7d87
Author: BrianHarrisonAMD <169072757+BrianHarrisonAMD@users.noreply.github.com>
Date: Fri Feb 6 08:57:01 2026 -0700
[hipDNN] Remove spdlog dependency for consumers of hipdnn (#4312)
## Motivation
Draft of changes to remove spdlog and fmt dependencies from hipDNN
frontend & consumer facing SDKs.
Consumers of hipDNN can have conflicts as a result of these
dependencies.
Removing these extra dependencies, and relying on the C API logging
methods from hipDNN backend will reduce friction, and make it easier for
adoption of hipDNN.
## Technical Details
- Remove spdlog and fmt from frontend + consumer SDKs.
- Add new string stream style formatter that will forward to backend
logging API callback
- Note: since we are set at C++17 standards, we cannot use std::format.
- Add Spdlog and fmt optional dependencies as opt in for plugin_sdk
- This enables existing plugins to maintain logging style they have in
place, and allows plugin authors to decide what style of logging to use.
- Unify consumption of spdlog & fmt dependencies in CMake's using
unified method's
## Test Plan
- Ensure build and tests are working for all components
- Ensure tests with logs enabled are working properly with expected
format
- Ensure build of samples is working, and logging format is correct
## Test Result
Build, tests, and testing with logging enabled is working locally for
hipDNN, providers, and samples
commit 0c37fdc37ba545b2ced5211b9f59c2381fc93753
Author: amd-chunxlin <chunxlin@amd.com>
Date: Fri Feb 6 09:44:39 2026 -0600
[rocRoller] Address long StreamK test runtimes (#4095)
## Motivation
Some streamK tests take long time to finish, and this PR addresses the
performance issue.
<details>
<summary>Comparison of test runtimes</summary>
| Test name | Develop branch | This branch |
| --- | --- | --- |
| GPU_BasicGEMMStreamKWorkgroupMapping/0 | 14.412s | 4.059s|
| GPU_BasicGEMMStreamKWorkgroupMapping/1 | 14.447s | 4.018s|
| GPU_BasicGEMMStreamKWorkgroupMapping/2 | 14.452s | 4.026s|
| GPU_BasicGEMMStreamKWorkgroupMapping/3 | 14.444s | 4.042s|
| GPU_BasicGEMMStreamKWorkgroupMapping/4 | 38.872s | 9.446s|
| GPU_BasicGEMMStreamKWorkgroupMapping/5 | 38.828s | 9.445s|
| GPU_BasicGEMMStreamKWorkgroupMapping/6 | 38.913s | 9.446s|
| GPU_BasicGEMMStreamKWorkgroupMapping/7 | 38.812s | 9.435s|
| GPU_BasicGEMMStreamKWorkgroupMapping/8 | 38.878s | 9.456s|
| GPU_BasicGEMMStreamKWorkgroupMapping/9 | 38.889s | 9.463s|
| GPU_BasicGEMMStreamKWorkgroupMapping/10 | 38.884s | 9.463s|
| GPU_BasicGEMMStreamKWorkgroupMapping/11 | 38.859s | 9.463s|
| GPU_BasicGEMMStreamKWorkgroupMapping/12 | 14.414s | 4.035s|
| GPU_BasicGEMMStreamKWorkgroupMapping/13 | 14.429s | 4.024s|
| GPU_BasicGEMMStreamKWorkgroupMapping/14 | 14.47s | 4.045s|
| GPU_BasicGEMMStreamKWorkgroupMapping/15 | 14.428s | 4.044s|
| GPU_BasicGEMMStreamKWorkgroupMapping/16 | 38.854s | 9.454s|
| GPU_BasicGEMMStreamKWorkgroupMapping/17 | 38.861s | 9.448s|
| GPU_BasicGEMMStreamKWorkgroupMapping/18 | 38.95s | 9.461s|
| GPU_BasicGEMMStreamKWorkgroupMapping/19 | 38.826s | 9.458s|
| GPU_BasicGEMMStreamKWorkgroupMapping/20 | 38.833s | 9.48s|
| GPU_BasicGEMMStreamKWorkgroupMapping/21 | 38.888s | 9.472s|
| GPU_BasicGEMMStreamKWorkgroupMapping/22 | 38.868s | 9.477s|
| GPU_BasicGEMMStreamKWorkgroupMapping/23 | 38.907s | 9.485s|
| GPU_BasicGEMMStreamKWorkgroupMapping/24 | 14.426s | 4.025s|
| GPU_BasicGEMMStreamKWorkgroupMapping/25 | 14.435s | 4.051s|
| GPU_BasicGEMMStreamKWorkgroupMapping/26 | 14.472s | 4.053s|
| GPU_BasicGEMMStreamKWorkgroupMapping/27 | 14.471s | 4.058s|
| GPU_BasicGEMMStreamKWorkgroupMapping/28 | 38.879s | 9.457s|
| GPU_BasicGEMMStreamKWorkgroupMapping/29 | 38.814s | 9.445s|
| GPU_BasicGEMMStreamKWorkgroupMapping/30 | 38.853s | 9.45s|
| GPU_BasicGEMMStreamKWorkgroupMapping/31 | 38.963s | 9.458s|
| GPU_BasicGEMMStreamKWorkgroupMapping/32 | 38.924s | 9.466s|
| GPU_BasicGEMMStreamKWorkgroupMapping/33 | 38.898s | 9.482s|
| GPU_BasicGEMMStreamKWorkgroupMapping/34 | 38.951s | 9.455s|
| GPU_BasicGEMMStreamKWorkgroupMapping/35 | 38.924s | 9.459s|
| GPU_BasicGEMMStreamKWorkgroupMapping/36 | 14.461s | 4.037s|
| GPU_BasicGEMMStreamKWorkgroupMapping/37 | 14.452s | 4.032s|
| GPU_BasicGEMMStreamKWorkgroupMapping/38 | 14.43s | 4.053s|
| GPU_BasicGEMMStreamKWorkgroupMapping/39 | 14.43s | 4.042s|
| GPU_BasicGEMMStreamKWorkgroupMapping/40 | 38.868s | 9.473s|
| GPU_BasicGEMMStreamKWorkgroupMapping/41 | 38.925s | 9.461s|
| GPU_BasicGEMMStreamKWorkgroupMapping/42 | 38.884s | 9.452s|
| GPU_BasicGEMMStreamKWorkgroupMapping/43 | 38.925s | 9.455s|
| GPU_BasicGEMMStreamKWorkgroupMapping/44 | 39.012s | 9.476s|
| GPU_BasicGEMMStreamKWorkgroupMapping/45 | 38.915s | 9.479s|
| GPU_BasicGEMMStreamKWorkgroupMapping/46 | 38.933s | 9.457s|
| GPU_BasicGEMMStreamKWorkgroupMapping/47 | 38.936s | 9.469s|
| GPU_BasicGEMMStreamKWorkgroupMapping/48 | 14.461s | 4.041s|
| GPU_BasicGEMMStreamKWorkgroupMapping/49 | 14.468s | 4.049s|
| GPU_BasicGEMMStreamKWorkgroupMapping/50 | 14.466s | 4.046s|
| GPU_BasicGEMMStreamKWorkgroupMapping/51 | 14.479s | 4.038s|
| GPU_BasicGEMMStreamKWorkgroupMapping/52 | 38.907s | 9.473s|
| GPU_BasicGEMMStreamKWorkgroupMapping/53 | 38.914s | 9.471s|
| GPU_BasicGEMMStreamKWorkgroupMapping/54 | 38.885s | 9.463s|
| GPU_BasicGEMMStreamKWorkgroupMapping/55 | 38.891s | 9.463s|
| GPU_BasicGEMMStreamKWorkgroupMapping/56 | 38.859s | 9.472s|
| GPU_BasicGEMMStreamKWorkgroupMapping/57 | 38.899s | 9.475s|
| GPU_BasicGEMMStreamKWorkgroupMapping/58 | 38.936s | 9.47s|
| GPU_BasicGEMMStreamKWorkgroupMapping/59 | 38.952s | 9.472s|
| GPU_BasicGEMMStreamKWorkgroupMapping/60 | 14.485s | 4.036s|
| GPU_BasicGEMMStreamKWorkgroupMapping/61 | 14.419s | 4.031s|
| GPU_BasicGEMMStreamKWorkgroupMapping/62 | 14.455s | 4.035s|
| GPU_BasicGEMMStreamKWorkgroupMapping/63 | 14.488s | 4.045s|
| GPU_BasicGEMMStreamKWorkgroupMapping/64 | 38.846s | 9.462s|
| GPU_BasicGEMMStreamKWorkgroupMapping/65 | 38.923s | 9.475s|
| GPU_BasicGEMMStreamKWorkgroupMapping/66 | 38.884s | 9.463s|
| GPU_BasicGEMMStreamKWorkgroupMapping/67 | 38.93s | 9.441s|
| GPU_BasicGEMMStreamKWorkgroupMapping/68 | 38.9s | 9.473s|
| GPU_BasicGEMMStreamKWorkgroupMapping/69 | 38.914s | 9.477s|
| GPU_BasicGEMMStreamKWorkgroupMapping/70 | 38.961s | 9.487s|
| GPU_BasicGEMMStreamKWorkgroupMapping/71 | 38.93s | 9.489s|
| GPU_BasicGEMMStreamKWorkgroupMapping/72 | 14.432s | 4.051s|
| GPU_BasicGEMMStreamKWorkgroupMapping/73 | 14.442s | 4.042s|
| GPU_BasicGEMMStreamKWorkgroupMapping/74 | 14.44s | 4.036s|
| GPU_BasicGEMMStreamKWorkgroupMapping/75 | 14.456s | 4.042s|
| GPU_BasicGEMMStreamKWorkgroupMapping/76 | 38.891s | 9.486s|
| GPU_BasicGEMMStreamKWorkgroupMapping/77 | 38.938s | 9.444s|
| GPU_BasicGEMMStreamKWorkgroupMapping/78 | 38.919s | 9.448s|
| GPU_BasicGEMMStreamKWorkgroupMapping/79 | 38.858s | 9.455s|
| GPU_BasicGEMMStreamKWorkgroupMapping/80 | 38.878s | 9.491s|
| GPU_BasicGEMMStreamKWorkgroupMapping/81 | 38.975s | 9.478s|
| GPU_BasicGEMMStreamKWorkgroupMapping/82 | 38.945s | 9.496s|
| GPU_BasicGEMMStreamKWorkgroupMapping/83 | 38.905s | 9.477s|
| GPU_BasicGEMMStreamKWorkgroupMapping/84 | 14.433s | 4.033s|
| GPU_BasicGEMMStreamKWorkgroupMapping/85 | 14.41s | 4.038s|
| GPU_BasicGEMMStreamKWorkgroupMapping/86 | 14.478s | 4.044s|
| GPU_BasicGEMMStreamKWorkgroupMapping/87 | 14.473s | 4.035s|
| GPU_BasicGEMMStreamKWorkgroupMapping/88 | 38.927s | 9.49s|
| GPU_BasicGEMMStreamKWorkgroupMapping/89 | 38.938s | 9.472s|
| GPU_BasicGEMMStreamKWorkgroupMapping/90 | 38.913s | 9.454s|
| GPU_BasicGEMMStreamKWorkgroupMapping/91 | 38.835s | 9.463s|
| GPU_BasicGEMMStreamKWorkgroupMapping/92 | 38.901s | 9.494s|
| GPU_BasicGEMMStreamKWorkgroupMapping/93 | 38.864s | 9.488s|
| GPU_BasicGEMMStreamKWorkgroupMapping/94 | 38.839s | 9.508s|
| GPU_BasicGEMMStreamKWorkgroupMapping/95 | 38.921s | 9.487s|
| GPU_BasicGEMMStreamKWorkgroupMapping/96 | 14.445s | 4.05s|
| GPU_BasicGEMMStreamKWorkgroupMapping/97 | 14.432s | 4.041s|
| GPU_BasicGEMMStreamKWorkgroupMapping/98 | 14.472s | 4.055s|
| GPU_BasicGEMMStreamKWorkgroupMapping/99 | 14.449s | 4.039s|
| GPU_BasicGEMMStreamKWorkgroupMapping/100 | 38.87s | 9.485s|
| GPU_BasicGEMMStreamKWorkgroupMapping/101 | 38.866s | 9.474s|
| GPU_BasicGEMMStreamKWorkgroupMapping/102 | 38.836s | 9.471s|
| GPU_BasicGEMMStreamKWorkgroupMapping/103 | 38.847s | 9.464s|
| GPU_BasicGEMMStreamKWorkgroupMapping/104 | 38.985s | 9.468s|
| GPU_BasicGEMMStreamKWorkgroupMapping/105 | 38.928s | 9.469s|
| GPU_BasicGEMMStreamKWorkgroupMapping/106 | 38.866s | 9.496s|
| GPU_BasicGEMMStreamKWorkgroupMapping/107 | 38.92s | 9.48s|
| GPU_BasicGEMMStreamKWorkgroupMapping/108 | 14.452s | 4.046s|
| GPU_BasicGEMMStreamKWorkgroupMapping/109 | 14.448s | 4.044s|
| GPU_BasicGEMMStreamKWorkgroupMapping/110 | 14.469s | 4.044s|
| GPU_BasicGEMMStreamKWorkgroupMapping/111 | 14.457s | 4.052s|
| GPU_BasicGEMMStreamKWorkgroupMapping/112 | 38.895s | 9.474s|
| GPU_BasicGEMMStreamKWorkgroupMapping/113 | 38.901s | 9.453s|
| GPU_BasicGEMMStreamKWorkgroupMapping/114 | 38.882s | 9.467s|
| GPU_BasicGEMMStreamKWorkgroupMapping/115 | 38.861s | 9.446s|
| GPU_BasicGEMMStreamKWorkgroupMapping/116 | 38.945s | 9.498s|
| GPU_BasicGEMMStreamKWorkgroupMapping/117 | 38.94s | 9.488s|
| GPU_BasicGEMMStreamKWorkgroupMapping/118 | 38.881s | 9.488s|
| GPU_BasicGEMMStreamKWorkgroupMapping/119 | 38.911s | 9.486s|
| GPU_BasicGEMMStreamKWorkgroupMapping/120 | 14.434s | 4.044s|
| GPU_BasicGEMMStreamKWorkgroupMapping/121 | 14.455s | 4.061s|
| GPU_BasicGEMMStreamKWorkgroupMapping/122 | 14.464s | 4.067s|
| GPU_BasicGEMMStreamKWorkgroupMapping/123 | 14.488s | 4.058s|
| GPU_BasicGEMMStreamKWorkgroupMapping/124 | 38.932s | 9.499s|
| GPU_BasicGEMMStreamKWorkgroupMapping/125 | 38.868s | 9.488s|
| GPU_BasicGEMMStreamKWorkgroupMapping/126 | 38.863s | 9.488s|
| GPU_BasicGEMMStreamKWorkgroupMapping/127 | 38.899s | 9.467s|
| GPU_BasicGEMMStreamKWorkgroupMapping/128 | 38.948s | 9.511s|
| GPU_BasicGEMMStreamKWorkgroupMapping/129 | 38.915s | 9.502s|
| GPU_BasicGEMMStreamKWorkgroupMapping/130 | 38.932s | 9.508s|
| GPU_BasicGEMMStreamKWorkgroupMapping/131 | 38.914s | 9.491s|
| GPU_BasicGEMMStreamKWorkgroupMapping/132 | 14.42s | 4.031s|
| GPU_BasicGEMMStreamKWorkgroupMapping/133 | 14.464s | 4.054s|
| GPU_BasicGEMMStreamKWorkgroupMapping/134 | 14.494s | 4.048s|
| GPU_BasicGEMMStreamKWorkgroupMapping/135 | 14.48s | 4.046s|
| GPU_BasicGEMMStreamKWorkgroupMapping/136 | 38.911s | 9.49s|
| GPU_BasicGEMMStreamKWorkgroupMapping/137 | 38.91s | 9.478s|
| GPU_BasicGEMMStreamKWorkgroupMapping/138 | 38.914s | 9.478s|
| GPU_BasicGEMMStreamKWorkgroupMapping/139 | 38.92s | 9.475s|
| GPU_BasicGEMMStreamKWorkgroupMapping/140 | 38.883s | 9.508s|
| GPU_BasicGEMMStreamKWorkgroupMapping/141 | 38.935s | 9.5s|
| GPU_BasicGEMMStreamKWorkgroupMapping/142 | 38.848s | 9.494s|
| GPU_BasicGEMMStreamKWorkgroupMapping/143 | 38.932s | 9.502s|
</details>
## Technical Details
- Short-circuit expression comparison
- Remove unused code
- Caching expressions of kernel arguments to eliminate redundant
regeneration.
- Change `AssemblyKernelArgument` to a `class`
## Test Plan
No functional changes; covered by existing tests.
## Test Result
See CI report
commit 66e22ac6c6c0c286325cabf2b1faa269ea640446
Author: hcman2 <52367956+hcman2@users.noreply.github.com>
Date: Fri Feb 6 10:55:08 2026 +0800
[formocast] [tensilelite] enable tuning with formocast (#4043)
## Motivation
Integration plan of formocast and origami :
Steps
1.Push Formocast code to the origami subfolder. (we are here now)
2.Submit tuning code calling the API of origami.
3.Push and reserve origami prediction mode. Add APIs to pass sizemapping
data via config_t. Use an environment variable to switch modes without
affecting other code.
4.Enable predictionThreshold with tox tests.
5.Push Origami code with Formocast backend. This step will enable the
bench with different modes.
6.Refine Formocast and Origami to verify API usage and identify
functions to move.
This PR is to include step2 and step4.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Henry Ho <hehe790223@gmail.com>
Co-authored-by: Peter Cheng <Peter.Cheng@amd.com>
commit 37a74ef54eaa1bb1df603db6ec8aff22f342bc71
Author: Illia Silin <98187287+illsilin@users.noreply.github.com>
Date: Thu Feb 5 17:06:57 2026 -0800
[CK] a bunch of CI fixes. (#4361)
## Motivation
Fixing some of the CK CI issues
## Technical Details
fixing paths to dockerfiles and scripts;
moving codegen tests to separate stage (collides with main build since
you must call cmake from same folder but different options);
fixing a couple of clang compilation issues with staging compiler;
commit 808e9496d17be0826164dbb34457f54e157dd2bd
Author: Kerry Wang <kerrywang369@gmail.com>
Date: Thu Feb 5 17:00:01 2026 -0600
refractor observers to have new runtime with context concept
commit 3c9beb38b8dba1301a961cc5dc3f44ca9d4185e3
Author: Eiden Yoshida <47196116+eidenyoshida@users.noreply.github.com>
Date: Thu Feb 5 17:56:12 2026 -0500
[CK] MICI: Fix git diff in selective_test_filter.py (#4352)
## Motivation
- git diff needs access to reference repo
## Technical Details
- mount reference repo path into docker for selective_test_filter.py to
access
## Test Plan
- tested in MICI
## Test Result
- launch_tests.sh ran successfully
commit 1663ac026d46c3dd02edb73d2bfa7310c54695d5
Author: Torre Zuk <42548444+TorreZuk@users.noreply.github.com>
Date: Thu Feb 5 14:20:45 2026 -0700
[rocBLAS] trsm doc & test; trsv change noted in log (#4198)
## Motivation
Tests trsm use of new trsv kernel for big batches
Adds chagelog note on trsv which used to call hipGetDevice and
potentially hipSetDevice
commit d8bb9d2b9fe278d07fe63b395d87268c94e53fd8
Author: Jeffrey Novotny <jnovotny@amd.com>
Date: Thu Feb 5 14:54:48 2026 -0500
[rocsolver] Doxygen API cleanup part 2 (#4330)
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
Continuing with rocSOLVER API/Doxygen copy edits and polishing
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
Edit Doxygen comments in header file.
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
Build locally.
## Test Result
<!-- Briefly summarize test outcomes. -->
NA
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit 5aa1f1d4c189f779ea699be250fb1b284f3d6ac2
Author: Geo Min <geomin12@amd.com>
Date: Thu Feb 5 11:01:53 2026 -0800
[ci] Updating variable group-id for OSSCI (#4360)
OSSCI migrated mi325s, so need a new groupID
Sanity works here:
https://github.com/ROCm/TheRock/actions/runs/21723540679/job/62659665907
normal run works here:
https://github.com/ROCm/TheRock/actions/runs/21723540679/job/62659791422
I've dabbled with organization variables, however, this does not work
for forks so for now, we will do the manual update
commit 6273d3b30b32d6e0856394d37a421153dceb33c1
Author: Dmitrii Polomin <dmitriy.polomin@dxc.com>
Date: Thu Feb 5 19:53:45 2026 +0100
[MIOpen] Ported solver test to gtest (#3713)
## Motivation
Porting tests from CTest to GTest, in this case, `solver.cpp`
## Technical Details
Pretty straightforward port, although I had to get creative in order to
conform to `INSTANTIATE_TEST_SUITE_P` pattern and naming conventions
## Test Plan
Running locally, using the CI launched by this PR
## Test Result
See CI actions launched by this PR
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit ece0c48dd152856c3b443c12c58a89bc7d7c34e5
Author: Nathan Henderson <nathan.henderson@amd.com>
Date: Thu Feb 5 09:03:42 2026 -0700
[rocroller] Use StreamKMode in hipBLASLt integration and client (#4028)
## Motivation
RocRoller has a `StreamKMode` enum that is used internally and in the
GEMM tests. However, the client and hipBLASLt integration still used
boolean values to represent the StreamK state. This PR replaces the
individual boolean flags (`--streamK`, `--streamKTwoTile`,
`--streamKTwoTileDPFirst`) with a single `--streamK` string option that
accepts one of the `StreamKMode` values (`None`, `Standard`, `TwoTile`,
or `TwoTileDPFirst`).
## Technical Details
- Update the rocRoller GEMM client to use `StreamKMode` enum instead of
three separate booleans
- Update the hipBLASLt rocRoller integration to match
- Add `enumStrings<T>()` utility function in `Utils_impl.hpp` for CLI
validation of enum values
## Test Plan
Update `test_gemm_client.py` YAML fixtures to use the new `streamK:
None` format.
## Test Result
Validated by the StreamK rrperf tests
commit 3b98c98a23e76075a6a1e4e580482a627e39d59b
Author: Jobbins <john.robbins@amd.com>
Date: Thu Feb 5 08:56:42 2026 -0700
[composablekernel] fix failure status (#4351)
## Motivation
Pipelines were failing on Math CI status check.
## Technical Details
For the success case, we just changed the config in Jenkins to use a
proper app token and no code changes were required. However, the failure
case would not have worked as coded, so we needed to move that outside
of the `rocmnode()` block.
## Test Plan
I removed all of the CI in one of the commits to quickly test, and then
added it back. Got a successful "success" message and "failure" message
produced
commit 9bb7f5c31253643cd72363314c3d3ee02f723406
Author: Eiden Yoshida <47196116+eidenyoshida@users.noreply.github.com>
Date: Thu Feb 5 10:55:44 2026 -0500
[CK] MICI: Correct path for build trace script (#4349)
## Motivation
- Corrects path to script due to superrepo migration
- Forces all tests to run by default
## Technical Details
- now in /projects/composablekernel
---------
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
commit 120f91dd211117e308b3713593ac7f061cc02c08
Author: bibek <108366729+bghimireamd@users.noreply.github.com>
Date: Thu Feb 5 09:47:16 2026 -0600
[HIPDNN][DOC] Add TYPED_TEST guidance for multi-datatype tests (#4000)
## Motivation
Doc update : add `TYPED_TEST` guidance for multi-datatype tests
Update `.clinerules`, `.cursor/rules/testing.mdc`, and `docs/Testing.md`
to recommend `TYPED_TEST` for tests covering `float`/`half`/`bfloat16`.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Samuel Reeder <41528605+SamuelReeder@users.noreply.github.com>
commit d26a7820b58f789d19efd6064d4c2c4f4fc72a95
Author: Eiden Yoshida <47196116+eidenyoshida@users.noreply.github.com>
Date: Wed Feb 4 21:43:22 2026 -0500
[CK] MICI: Use reference repo for checkout operations (#4336)
## Motivation
- Maintain a reference repo on slave nodes that speeds up any
clone/checkout operations
## Technical Details
- clone a ref repo if it does not exist
- update ref repo if it does exist
- checkout after ref repo is updated
- eliminates double clone
## Test Result
- Initial checkouts succeeded
commit f2f187ab40738272232f571f58112697da405b1a
Author: Geo Min <geomin12@amd.com>
Date: Wed Feb 4 15:43:38 2026 -0800
[ci] Fixing rocm-libs race condition (#4192)
Currently, there is a race condition that overwrites BLAS libraries
during MIOpen/hipdnn builds. (error:
https://github.com/ROCm/rocm-libraries/actions/runs/21228188053/job/61080555083)
Tested locally:
```
# With all three
geom@geom:~/Code/rocm-libraries/.github/scripts$ PLATFORM="linux" python3 therock_matrix.py
[{'cmake_options': '-DTHEROCK_USE_EXTERNAL_COMPOSABLE_KERNEL=ON -DTHEROCK_ENABLE_COMPOSABLE_KERNEL=ON -DTHEROCK_ENABLE_MIOPEN=ON -DTHEROCK_ENABLE_MIOPEN_PLUGIN=ON -DTHEROCK_ENABLE_ALL=OFF -DTHEROCK_COMPOSABLE_KERNEL_SOURCE_DIR=../composable_kernel', 'project_to_test': 'miopen_plugin,miopen,hipdnn'}]
# Only hipdnn
geom@geom:~/Code/rocm-libraries/.github/scripts$ PLATFORM="linux" python3 therock_matrix.py
[{'cmake_options': '-DTHEROCK_USE_EXTERNAL_COMPOSABLE_KERNEL=ON -DTHEROCK_ENABLE_ALL=OFF -DTHEROCK_COMPOSABLE_KERNEL_SOURCE_DIR=../composable_kernel -DTHEROCK_ENABLE_MIOPEN_PLUGIN=ON -DTHEROCK_ENABLE_COMPOSABLE_KERNEL=ON', 'project_to_test': 'hipdnn,miopen_plugin'}]
```
This fixes this error, as if all libraries are ran, they will combine
(saves resources + no overwriting)
commit f34aec25c434b3044b75481d70693af3bf0ade1e
Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com>
Date: Wed Feb 4 18:25:31 2026 -0500
[CK] Add FP8 KV_BLOCKSCALE support for batch prefill (#4263)
Implement per-page K/V quantization for paged attention:
- Add KV_BLOCKSCALE enum to BlockAttentionQuantScaleEnum
- Use exp2 shift trick to eliminate explicit P scaling overhead
- Prefetch physical pages offset for KV cache, overlaps with
computations
## Proposed changes
Please describe the motivation behind the pull request, whether it
enables a new feature or fixes a bug. If there are associated pull
requests or issues, please link them to the pull request.
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged
## Discussion
If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
---
🔁 Imported from
[ROCm/composable_kernel#3696](https://github.com/ROCm/composable_kernel/pull/3696)
🧑💻 Originally authored by @Jeff-Huang
---------
Co-authored-by: Jeff Huang <chiachi.huang@amd.com>
Co-authored-by: Illia Silin <Illia.Silin@amd.com>
commit df32df51ea6a9ebfba9a459c77ed82c4877df22b
Author: Yiqian Liu <157505981+liu-yiqian@users.noreply.github.com>
Date: Wed Feb 4 17:12:11 2026 -0600
[rocRoller] [hipblaslt] Enable more workgroup tile sizes for pre-swizzled scale data (#4175)
## Motivation
Pre-swizzle is an optimization that pre-swizzle the scale data to match
the layout that kernel expects. The purpose of this PR is to add more
possible workgroup tile sizes that supports pre-swizzled scale data.
## Technical Details
1. Configure the workgroup tile size to 256 at K dimension when the
input data format is pre-swizzled.
2. Filter out the invalid workgroup tile size (i.e., MN dimension is not
multiple of 32, or MN dimension is 96).
3. Configure the solution parameters for pre-swizzled input.
## Test Plan
1. Added a rocRoller client test that uses 32x32x256 workgroup tile.
## Test Result
1. This PR should not change any kernel that the data is not
pre-swizzled.
2. All the tests should pass and no performance changes.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Yiqian Liu <yiqialiu@ctr2-alola-ctrl-01.amd.com>
Co-authored-by: yiqialiu <yiqialiu@amd.com>
commit 87d1a8fa005ef2f75e48e5c9c4e70f8235236b03
Author: Samuel Reeder <41528605+SamuelReeder@users.noreply.github.com>
Date: Wed Feb 4 16:10:01 2026 -0700
Use `--latest-release` flag for installing rocm in clang-tidy (#4120)
## Motivation
`--latest-release` was added in TheRock
[2997](https://github.com/ROCm/TheRock/pull/2997) to grab latest nightly
for the specified target. We can use this to keep the clang-tidy
workflow somewhat up-to-date.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
commit c5…
Implement per-page K/V quantization for paged attention:
Proposed changes
Please describe the motivation behind the pull request, whether it enables a new feature or fixes a bug. If there are associated pull requests or issues, please link them to the pull request.
Checklist
Please put an
xinto the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.clang-formaton all changed filesDiscussion
If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered