[CK] Add FP8 KV_BLOCKSCALE support for batch prefill by Jeff-Huang · Pull Request #3696 · ROCm/composable_kernel

Jeff-Huang · 2026-01-31T04:01:22Z

Implement per-page K/V quantization for paged attention:

Add KV_BLOCKSCALE enum to BlockAttentionQuantScaleEnum
Use exp2 shift trick to eliminate explicit P scaling overhead
Prefetch physical pages offset for KV cache, overlaps with computations

Proposed changes

Please describe the motivation behind the pull request, whether it enables a new feature or fixes a bug. If there are associated pull requests or issues, please link them to the pull request.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

Copilot

Pull request overview

This pull request implements FP8 KV_BLOCKSCALE support for batch prefill in the FMHA (Fused Multi-Head Attention) pipeline. The implementation adds per-page K/V quantization for paged attention caches, uses an exp2 shift trick to eliminate explicit P scaling overhead, and prefetches physical page offsets to hide memory latency.

Changes:

Added KV_BLOCKSCALE enum to BlockAttentionQuantScaleEnum for per-page K/V quantization
Refactored KV cache offset computation by splitting into load_physical_pages() and kv_offset_array_transform()
Implemented exp2 shift trick to scale attention probabilities by 2^shift without explicit multiplication
Added prefetching of physical pages to overlap memory loads with computation

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp	Core pipeline implementation: added load_physical_pages(), refactored kv_offset_array_transform(), implemented KV_BLOCKSCALE logic with exp2 shift trick and prefetching
fmha_batch_prefill_kernel.hpp	Added KV_BLOCKSCALE kernel argument structures (FmhaFwdKVBlockScaleKargs), QScaleKargsSelector template, and operator() overloads
block_attention_quant_scale_enum.hpp	Added KV_BLOCKSCALE enum value (3) and string conversion template specialization
quant.hpp	Added kv_blockscale enum value and decode/encode logic for command-line interface
fmha_fwd.hpp	Added kv_block_descale parameters to fmha_batch_prefill_args structure
fmha_batch_prefill.py	Restricted kv_blockscale to page_size=1024 in code generation
cpp_symbol_map.py	Added kv_blockscale mappings to QSCALE_MAP and QSCALE_CHECK_MAP

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

include/ck_tile/ops/fmha/pipeline/block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp

Copilot · 2026-02-02T04:53:24Z

include/ck_tile/ops/fmha/pipeline/block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp

    static constexpr auto kKVMemoryLayout   = Problem::kKVMemoryLayout;
+    static constexpr auto QScaleEnum        = Problem::QScaleEnum;
+
+    // For KV_BLOCKSCALE: shift value for exp2(x + shift) to scale P to [0, 2^shift]


The comment says "For KV_BLOCKSCALE" but these shift constants appear to be general FP8 constants that could be used for both BLOCKSCALE and KV_BLOCKSCALE. In other pipeline files (block_fmha_pipeline_qr_ks_vs.hpp and block_fmha_pipeline_qr_ks_vs_async.hpp), the same constants have comments saying "For BLOCKSCALE". Consider making the comment more general (e.g., "For BLOCKSCALE/KV_BLOCKSCALE") or extracting these constants to a shared location to avoid duplication and inconsistency.

Suggested change

// For KV_BLOCKSCALE: shift value for exp2(x + shift) to scale P to [0, 2^shift]

// For BLOCKSCALE/KV_BLOCKSCALE: shift value for exp2(x + shift) to scale P to [0, 2^shift]

Copilot · 2026-02-02T04:53:24Z

include/ck_tile/ops/fmha/pipeline/block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp

+    else // !kVTileCrossesPages
+    {
+        // V cache: use physical_pages[k0] for each token
+        // physical_pages was already populated correctly by load_physical_pages(), handling:
+        //   - page_size=1: page_idx maps token_idx -> physical_page directly
+        //   - V tile crosses pages: per-token page lookup
+        //   - V tile in single page: lane0 lookup with broadcast to all lanes


The comment "!kVTileCrossesPages" is misleading. This is the else branch for "if constexpr(kIsKcache)", so it handles V cache regardless of whether the V tile crosses pages or not. The comment should be "!kIsKcache" or "V cache" to accurately describe this branch.

Suggested change

else // !kVTileCrossesPages

{

// V cache: use physical_pages[k0] for each token

// physical_pages was already populated correctly by load_physical_pages(), handling:

// - page_size=1: page_idx maps token_idx -> physical_page directly

// - V tile crosses pages: per-token page lookup

// - V tile in single page: lane0 lookup with broadcast to all lanes

else // !kIsKcache (V cache)

{

// V cache: use physical_pages[k0] for each token

// physical_pages was already populated correctly by load_physical_pages(), handling:

// - page_size=1: page_idx maps token_idx -> physical_page directly

// - V tile crosses pages: per-token page lookup

- // - V tile in single page: lane0 lookup with broadcast to all lanes

include/ck_tile/ops/fmha/pipeline/block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp

include/ck_tile/ops/fmha/kernel/fmha_batch_prefill_kernel.hpp

asleepzzz

some minor issues

example/ck_tile/01_fmha/codegen/ops/fmha_batch_prefill.py

poyenc · 2026-02-02T06:23:02Z

include/ck_tile/ops/fmha/pipeline/block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp

+        // all tokens in a main loop iteration belong to the same page
+        if constexpr(QScaleEnum == BlockAttentionQuantScaleEnum::KV_BLOCKSCALE)
+        {
+            static_assert(kPageBlockSize >= kN0, "KV_BLOCKSCALE requires kPageBlockSize >= kN0");


Please clarify that the pipeline only supports kPageBlockSize==1024 if QScaleEnum ==BlockAttentionQuantScaleEnum::KV_BLOCKSCALE

include/ck_tile/ops/fmha/kernel/fmha_batch_prefill_kernel.hpp

include/ck_tile/ops/fmha/block/block_attention_quant_scale_enum.hpp

include/ck_tile/ops/fmha/pipeline/block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp

example/ck_tile/01_fmha/fmha_fwd.hpp

example/ck_tile/01_fmha/codegen/ops/fmha_batch_prefill.py

include/ck_tile/ops/fmha/pipeline/block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp

ltqin · 2026-02-02T07:36:50Z

include/ck_tile/ops/fmha/kernel/fmha_batch_prefill_kernel.hpp

    }

    template <bool Cond = kIsGroupMode>
    CK_TILE_HOST static constexpr std::enable_if_t<Cond, Kargs>


Group mode is not used. Should it be deleted, or should we use static_assert to mask it

It seems Aiter still uses batch prefill kernel with group mode in the mha_batch_prefill_kernels.cu:

float t = aiter::mha_batch_prefill(args, stream_config, dtype_str, true, // is_group_mode mask.type, bias_type, has_lse, qscale_type, false);

Do you have a specific concern about group mode in batch prefill kernel, or was there another reason you suggested removing it? I'm happy to discuss further if there's a better way to structure this.

ltqin · 2026-02-02T07:41:58Z

include/ck_tile/ops/fmha/pipeline/block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp

+            {
+                const index_t scale_offset =
+                    k_physical_pages[0] * kv_block_descale_stride_block +
+                    block_indices.kv_head_idx * kv_block_descale_stride_head;


If using group mode, the calculation method is incorrect. It is recommended to disable group mode

Could you clarify which part of the calculation you think is incorrect in group mode? I'd like to make sure I understand your concern correctly.

From my understanding, the current implementation should work for both batch mode and group mode:

k_physical_pages[0] is the physical page index obtained from page table lookup. The page_idx pointer is already per-batch adjusted:

SGLang: page_idx = kv_page_indices + kv_indptr[i_batch]

vLLM: page_idx = block_table_ptr + i_batch * batch_stride

kv_head_idx is computed as i_nhead / nhead_ratio_qk, which should be the same for both modes.

The scale tensor is indexed by physical page (not logical sequence position), so the offset calculation physical_page * nblock_stride + kv_head_idx * nhead_stride should be correct regardless of batch mode or group mode.

Is there a specific scenario or edge case you have in mind where this would produce incorrect results? Happy to discuss further!

include/ck_tile/ops/fmha/kernel/fmha_batch_prefill_kernel.hpp

Implement per-page K/V quantization for paged attention: - Add KV_BLOCKSCALE enum to BlockAttentionQuantScaleEnum - Use exp2 shift trick to eliminate explicit P scaling overhead - Prefetch physical pages offset for KV cache, overlaps with computations

2. Rename QScaleKargsSelector -> GetQScaleKargs for naming consistency 3. Remove unused BlockAttentionQuantScaleEnumToStr<KV_BLOCKSCALE>

…intain flexibility.

ammallya · 2026-02-03T22:00:56Z

Imported to ROCm/rocm-libraries

@Jeff-Huang

Implement per-page K/V quantization for paged attention: - Add KV_BLOCKSCALE enum to BlockAttentionQuantScaleEnum - Use exp2 shift trick to eliminate explicit P scaling overhead - Prefetch physical pages offset for KV cache, overlaps with computations ## Proposed changes Please describe the motivation behind the pull request, whether it enables a new feature or fixes a bug. If there are associated pull requests or issues, please link them to the pull request. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [ ] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3696](ROCm/composable_kernel#3696) 🧑‍💻 Originally authored by @Jeff-Huang --------- Co-authored-by: Jeff Huang <chiachi.huang@amd.com> Co-authored-by: Illia Silin <Illia.Silin@amd.com>

commit 8c40fb6cac48969d6237cccdcbbbad56b44ff0a3 Author: Kerry Wang <kerrywang369@gmail.com> Date: Mon Feb 9 20:27:33 2026 -0500 more consistent skip text commit 557e2764b3a001884a004f0a183a307c4fbc2bd2 Merge: 6bd6e49791 219f365e7b Author: Kerry Wang <kerrywang369@gmail.com> Date: Mon Feb 9 18:38:48 2026 -0600 Merge remote-tracking branch 'origin/develop' into users/kerrwang/lds-queue commit 6bd6e497910e3ba681b22a47630bc5f0dedb16b8 Author: Kerry Wang <kerrywang369@gmail.com> Date: Mon Feb 9 18:38:31 2026 -0600 fix format commit 219f365e7bc40c9ce3f5c382228a7b2e14b90520 Author: James Sandham <33790278+jsandham@users.noreply.github.com> Date: Mon Feb 9 19:08:04 2026 -0500 [hipsparse] Match behaviour of csr2csr_compress from rocsparse (#4420) ## Motivation In the hipSPARSE test code host solution, we were incorrectly checking if a value satisfied: `testing_abs(csr_val_A[j]) > testing_real(tol) && testing_abs(csr_val_A[j]) > std::numeric_limits<float>::min()` instead of the correct criteria: `testing_abs(csr_val_A[j]) > testing_real(tol)` commit 698d5d09184a24fde32ab7309fcd88410fc7ff8e Author: amd-hsong <hao.song@amd.com> Date: Mon Feb 9 16:40:07 2026 -0700 [rocprim] Fix a call to intrinsics in test_device_reduce_by_key (#4391) ## Motivation  Fix a call to __clzll in test_device_reduce_by_key ## Technical Details  There are a couple of issues in the call to __clzll: - the argument is cast to `long long`: it should be cast to `unsigned long long` instead - in rocprim there exists a wrapper for clz, so for better portability rocprim::clz should be used instead. ## Test Plan  Run test_device_reduce_by_key to verify the test runs correctly. ## Test Result  The test passes. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit 557f5baa6d68bb5a8126d9730a8d48983778aac3 Author: Kerry Wang <kerrywang369@gmail.com> Date: Mon Feb 9 16:53:41 2026 -0600 skip on non-gfx950 commit 8b72bc8759d9c11dfcbf410182fa332152b97e69 Author: bnemanich <brad.nemanich@amd.com> Date: Mon Feb 9 16:26:51 2026 -0500 [hipBLASLt] Enable custom MXFP4 kernels (#4384) ## Motivation Allow hipBLASLt to call custom MX FP4 kernels for higher performance. ## Technical Details A single kernel was added in this PR. The kernel was originally from: https://github.com/ROCm/aiter/tree/main/hsa/gfx950/f4gemm. This kernel used a slightly different shuffled scaling layout than rocRoller. hipBLASLt will only support this new shuffled layout, plus the original non-shuffled layout. All rocRoller kernels will be disabled when using shuffled scales for now. Once rocRoller supports the new layout, they will be added back in. This PR also adds some new MX datatype generation patterns that were useful during debugging. New custom kernels can be added to the custom_kernels directory. They will also need to be added in the customer_kernels.cpp file that was added in this PR. ## Test Plan Check that performance improved when using MXFP4 GEMMs with shuffled scales. ## Test Result Performance improved by about 17%. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Andrew Whittle <Andrew.Whittle@amd.com> Co-authored-by: Bryant Nelson <bryant.nelson@amd.com> commit 61f9f906dcc0a9d4f6c327fea713aebc6d4b0a1d Author: Bartłomiej Kocot <barkocot@amd.com> Date: Mon Feb 9 22:08:57 2026 +0100 [CK] CK Tile grouped convolution direct load (#4406) ## Motivation CK Tile grouped convolution forward direct load support. ## Technical Details Basic pipeline for direct load and new instances for forward for v1 and v4 pipelines. ## Test Plan test_grouped_convnd_fwd_tile ## Test Result CI pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-130 commit b7f136734ad26314386ca2b4f5a99467804f1bb7 Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com> Date: Mon Feb 9 20:58:57 2026 +0000 Enable group mode (varlen) kernel generation for PyTorch integration (#4292) ## Proposed changes This PR enables group mode (variable-length attention) kernel generation for PyTorch's CK SDPA backend. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [X] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [X] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion The change is minimal (single line deletion) but enables a significant feature: variable-length attention support for ROCm users via PyTorch's torch.nn.attention.varlen API. --- 🔁 Imported from [ROCm/composable_kernel#3553](https://github.com/ROCm/composable_kernel/pull/3553) 🧑‍💻 Originally authored by @chinmaydk99 Co-authored-by: Chinmay_Kuchinad <ChinmayDattanand.Kuchinad@amd.com> commit f48a5e63edb7102996b0b769e76114c0bbfd35cf Author: Mihnea Chirila <37160326+mihnea-chirila@users.noreply.github.com> Date: Mon Feb 9 14:55:12 2026 -0600 [Tensilelite] Added MIArchVgpr support for Complex Datatypes. (#4332) ## Motivation Added MIArchVgpr support for Complex Datatypes. ## Technical Details Fixed AlphaTmpVgpr initialization, and rocisa register offset bug: - Updated condition to initialize AlphaTmpVgpr if MIArchVgpr parameter is enabled. Required to generate `MulMIOutAlphaToArch' code (https://github.com/ROCm/rocm-libraries/blob/c20a85b6c458ef44c1f0e30c35b286a0395fb8fa/projects/hipblaslt/tensilelite/Tensile/KernelWriterModules.py#L251) regardless of postGSU Accumulation scheme. - Fixed underlying `Holder` struct bug: correctly passes string passed offsets to `RegisterContainer`. Required to update imaginary register for C/ZGEMM. (https://github.com/ROCm/rocm-libraries/blob/c20a85b6c458ef44c1f0e30c35b286a0395fb8fa/projects/hipblaslt/tensilelite/Tensile/KernelWriterModules.py#L288) ## Test Plan Tested for C & Z with MIArchVgpr: [0, 1] on gfx942 and gfx950 ## Test Result Success ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit 3de83b9b4035006b5ddd825df9404edc43ca9b39 Author: CMiservaAMD <cmiserva@amd.com> Date: Mon Feb 9 13:42:54 2026 -0700 [hipDNN] Add integration tests for frontend configuration knobs APIs. (#4307) Add integration tests to verify correct operation of new hipDNN frontend API functions for managing engine config settings. commit 2752a8a5105e11929b876ce0e343bcc73a9cf308 Author: DarylHawkinsAMD <Daryl.Hawkins@amd.com> Date: Mon Feb 9 13:08:30 2026 -0700 [MIOpen] First set of kernels using CK Builder end to end (#4123) commit e55f37bad667987f74989bc95e08f86603438963 Author: Mitchell Ousdahl <mitch.ousdahl@amd.com> Date: Mon Feb 9 10:05:26 2026 -0800 Modified test plugin rpaths (#4350) ## Motivation In order to successfully get hipDNN added to the python ROCm wheels, the RPATHs on Linux for the test plugins need to be updated to make them portable. We will leverage TheRock's existing RPATH update mechanism to do this. ## Technical Details - Update all test plugin RPATHs ## Test Plan - Build ROCm - Build the wheels - Use the "Test ROCm Wheel" workflow, which verifies that the test plugins can load and find their dependencies. ## Test Result - [ ] "Test ROCm Wheel" workflow succeeds ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit a7485411874b0650b31068364f6b1155d9890212 Author: Muhammad Osama <osama94@gmail.com> Date: Mon Feb 9 09:30:52 2026 -0800 [Origami] Skip test-selector if torch not found. (#4359) ## Motivation Makes `torch` completely optional by skipping dependent tests if it is not found. ## Technical Details ``` # Skip entire module if torch is not available (selector requires torch) torch = pytest.importorskip("torch", reason="torch is required for OrigamiMatmulSelector tests.") ``` ## Test Plan Run tests using CI + TheRock build. ## Test Result  ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit 0c5cd629a94a454a350eb651b5921baeb1c82546 Author: Swati Rawat <120587655+SwRaw@users.noreply.github.com> Date: Mon Feb 9 22:51:13 2026 +0530 Update Tensile CHANGELOG.md (#4164) ## Motivation  ## Technical Details  ## Test Plan  ## Test Result  ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit 331512e9e13e197d8d7fdf7b72f5b60eb63d7d1e Author: Bartłomiej Kocot <barkocot@amd.com> Date: Mon Feb 9 16:36:52 2026 +0100 [CK] Fix grouped conv fwd transform for merged groups (#4399) ## Motivation [CK] Fix grouped conv fwd transform for merged groups for 1d and 3d. ## Technical Details After optimizations for 2d there is a lack of implementation for 1d and 3d ## Test Plan test_grouped_convnd_fwd ## Test Result pending CI ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit 1c2927530e176c63cf814b44eb8147e89d2bcaf7 Author: Eiden Yoshida <47196116+eidenyoshida@users.noreply.github.com> Date: Mon Feb 9 10:23:47 2026 -0500 [CK] MICI: Disable failure pattern checking (#4373) ## Motivation - ck mici jobs hanging at end, possibly at failure pattern checking ## Technical Details - Disable failure pattern checking to see if hanging goes away ## Test Plan - Observe behavior after merge commit a3058d1dc0b3f176f56fbecd040c2fc48c7258ad Author: COrruDXC <carlo.orru@dxc.com> Date: Mon Feb 9 14:02:05 2026 +0100 Reduce boost usage by replacing time calls (#3875) ## Motivation Reduce boost usage by replacing time calls. ## Technical Details Replace boost::posix_time related data types with the corresponsing std::chrono data types. ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit 959bd9393ad9a578711334c40948ac1321e41c1f Author: Yi-Yao (Alex), Wang <68064688+alex391a@users.noreply.github.com> Date: Mon Feb 9 17:15:47 2026 +0800 Update gfx942/gfx950 BBS/HHS/I8I8S SPB/SPA logic yaml (#4365) ## Motivation - Update BBS/HHS/I8I8S SPB/SPA logic yaml for gfx942/gfx950 ## Technical Details  ## Test Plan - Run local hipsparselt-test - Run local tests for all matrix sizes using hipsparselt-bench ## Test Result  ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: chiahlin <chiahlin@amd.com> commit ad03e58dabbf2bbc348c031a06ec73011d85d2c3 Author: Chuck Wu <chuckwu92job@gmail.com> Date: Mon Feb 9 13:04:38 2026 +0800 [hipblaslt] Fix memory leaks & uninitialized value use (#4338) ## Motivation  According to the [ROCM-1835](https://amd-hub.atlassian.net/browse/ROCM-1835?focusedCommentId=109304&sourceType=mention), there are some memory leaks and instances of uninitialized value being used during the gtest. <img width="450" height="367" alt="image" src="https://github.com/user-attachments/assets/2345e1f8-6062-4a5a-b294-97042709b18e" /> ## Technical Details  1. Add the code to call the corresponding destroy functions for the data that has not been released yet. 2. Add the default value to compute_input_typeA/B Flow (before this commit) -> a. hipblasLtMatmulDescCreate: - compute_input_typeA = ??? - compute_input_typeB = ??? b. hipblasLtMatmulDescSetAttribute(COMPUTE_INPUT_TYPE_A, HIP_R_16F, ...) - compute_input_typeA = HIP_R_16_F - compute_input_typeB = ??? - call _matmul_desc_determine_compute_type() - Read compute_input_typeA & compute_input_typeB c. hipblasLtMatmulDescSetAttribute(COMPUTE_INPUT_TYPE_B, HIP_R_16F, ...) - compute_input_typeA = HIP_R_16_F - compute_input_typeB = HIP_R_16_F - call _matmul_desc_determine_compute_type() - Read compute_input_typeA & compute_input_typeB ## Test Plan  1. Test command: `valgrind --leak-check=full ./hipblaslt-test --gtest_filter=_/aux_test.*` 2. Before this commit: Uninitialized value being used <img width="691" height="81" alt="image" src="https://github.com/user-attachments/assets/22a897f1-c25e-4608-850e-0c6bcb5ad0a3" /> Memory leaks <img width="708" height="78" alt="image" src="https://github.com/user-attachments/assets/02268893-a29a-4db4-95aa-c93385371d5a" /> ## Test Result  1. The Valgrind output above isn’t showing. 2. gtest all pass on Navi3. <img width="952" height="92" alt="image" src="https://github.com/user-attachments/assets/6e1b9b01-afc4-4a51-8a9f-e0196fc8495a" /> ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. [ROCM-1835]: https://amd-hub.atlassian.net/browse/ROCM-1835?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ commit b7de1e14cea70681a23cd1a136df42910c776e4a Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com> Date: Mon Feb 9 11:54:54 2026 +0800 [CK_TILE] Add blockscale GEMM support for EightWarps on gfx950 (#4280) ## Proposed changes gemm blockscale eightwarps support ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run `clang-format` on all changed files - [x] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3650](https://github.com/ROCm/composable_kernel/pull/3650) 🧑‍💻 Originally authored by @kensclin --------- Co-authored-by: KenSCLin <lshyhchy@amd.com> Co-authored-by: Ding, Yi <yi.ding@amd.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> commit 774cfc6410ed55802691fef19a34449182878be5 Author: Ethan <Yi-Chen.Lin@amd.com> Date: Mon Feb 9 11:39:20 2026 +0800 [hipblaslt] do some debug operations only in debug setting ## Motivation Single solution selection time has increased slightly.  ## Technical Details Lots of "assign matchingTag" can be avoid if not in debug (printProperty), but I still keep the Equal assign there since it has been there before #2757  ## Test Plan  ## Test Result  ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit ff3e9821bbe2d14b9874e61ebb518bbbce621ac3 Author: jakpiase <jakpia21@gmail.com> Date: Sun Feb 8 20:57:14 2026 +0100 [CK_TILE] Add support and tests for V6 pipeline in conv fwd (#4357) Added support for conv v6 pipeline in ck tile's convolution forward kernel. CK Tile v6 pipeline is the equivalent to old ck's V5 pipeline and should be faster than other pipelines for some cases. This PR also adds tests inside profiler that's currently inside experimental directory, so now we should be able to detect regressions easier. --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: subhajitdchow <sduttach@amd.com> commit 591f50450241d6b1965f9f6ee3fe2526ef71ab8d Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com> Date: Sun Feb 8 12:34:59 2026 +0100 [CK] Add fwd conv group merging to v3 conv instances (#4273) ## Proposed changes Added conv group merging to the (universal) V3 fwd conv pipeline. The new instance improves fwd conv performance when the number of input/output channel per group is low. On MI300 (`gfx942`) we get | CK prof command | Baseline (TFLOPS) | V3 group merging (TFLOPS) | |:-----|:------:|------:| | grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 4 4 3 3 200 200 1 1 1 1 1 1 1 1 | 3.86035 | 8.36796 | | grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 200 200 2 2 1 1 1 1 1 1 | 10.1867 | 13.4677 | | grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 100 100 1 2 1 1 1 1 1 1 | 11.7875 | 16.3657 | --- 🔁 Imported from [ROCm/composable_kernel#3675](https://github.com/ROCm/composable_kernel/pull/3675) 🧑‍💻 Originally authored by @vpietila-amd --------- Co-authored-by: Ville Pietilä <> Co-authored-by: Ville Pietilä <188998872+vpietila-amd@users.noreply.github.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Bartlomiej Kocot <barkocot@amd.com> commit cad7fa2c1849b0863ed52ef6cd47198e421d5b6e Author: BrianHarrisonAMD <169072757+BrianHarrisonAMD@users.noreply.github.com> Date: Fri Feb 6 23:48:54 2026 -0700 [hipDNN] Fix codecov target breaks (#4374) ## Motivation code_cov stage for hipDNN was breaking due to environment differences. Newer clang tooling flags false positives on added files. Environments that are missing spdlog, but have fmt present were causing issues due to mixed include expectations. ## Technical Details - Ignore false positives for lint - Only add fmt if spdlog was built with external FMT ## Test Plan - Code cov target builds succeessfully ## Test Result Waiting on CI commit 91627789d86acc7dff4bf5eaafe3b774a7037f76 Author: Koji Nakajima <75698246+nakajee@users.noreply.github.com> Date: Fri Feb 6 23:44:05 2026 -0700 [hipblaslt] Fix memory access error with DtlPlusLdsBuf (#4303) ## Motivation Fix a memory access error with DtlPlusLdsBuf ## Technical Details - generate all GlobalRead Inc code before local read addr swap ## Test Plan Added a test case in dtl.yaml ## Test Result Confirmed new test failed with before change and no error with after change ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit 8b5a98b48c007663765865d2e14247ef1f056b01 Author: Aaron St George <aaronstgeorge@gmail.com> Date: Fri Feb 6 23:18:13 2026 -0600 [hipDNN] Add `FUSILLI_PLUGIN` to `EngineNames.hpp` (#4362) ## Motivation  Following the approach outlined in [hipdnn/docs/rfcs/0003_EngineIdDesign.md](https://github.com/ROCm/rocm-libraries/blob/develop/projects/hipdnn/docs/rfcs/0003_EngineIdDesign.md) this PR defines an engine ID for fusilli in `hipdnn/data_sdk/include/hipdnn_data_sdk/utilities/EngineNames.hpp`. ## Technical Details  ID + tests defined. ## Test Plan  Test defined in the PR. ## Test Result  Tests pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit e3a9b3f95d29ce626efc3d2134e4e924b1c591a9 Author: James Newling <james.newling@gmail.com> Date: Fri Feb 6 21:12:55 2026 -0800 [hipblaslt] Raise exception instead of segfaulting (#3995) ## Motivation Faster problem diagnostic when failure. ## Technical Details Throw exception if library is nullptr. ## Test Before: ``` TensileLibrary.yaml:181:31: error: invalid boolean customMainLoopScheduling: 0 ^ [Lots of logging] Segmentation fault + ERR2=139 + ERR=0 ``` After: ``` TensileLibrary.yaml:181:31: error: invalid boolean customMainLoopScheduling: 0 terminate called after throwing an instance of 'std::runtime_error' what(): Failed to load solution library + ERR2=134 + ERR=0 ``` commit f48eaa54f7395aa8ce4980dcc6725fe38784f7e6 Author: CMiservaAMD <cmiserva@amd.com> Date: Fri Feb 6 21:48:52 2026 -0700 [hipDNN] Fix a couple log messages in test plugins broken by recent merge. (#4380) Include correct function name in test plugin log output. commit 5df3343ecfae6b39201995d8178fe39e061e0c40 Author: Emily Martins <65371150+ecamartins@users.noreply.github.com> Date: Fri Feb 6 17:26:57 2026 -0700 [CK_TILE] Fix MMA concepts compiler error (#4381) ## Motivation CK Tile is required to support certain older OSs; on these OSs, cpp 20 is not fully supported. For ROCm 7.2, compiler errors occur on one of these older OSs. An example of this error is as follows: ```bash /composable_kernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp:34:28: error: expected concept name with optional arguments 34 | { MmaOp::kAMBlock } -> std::convertible_to<unsigned int>; | ``` The goal of this PR is to resolve these compiler errors. ## Technical Details The existing guards around the mma concepts only check if the concepts language feature is supported, as follows: ```cpp #if defined(__cpp_concepts) && __cpp_concepts >= 201907L // ... template <typename CtrlFlags> concept CtrlFlagsGfx9I = requires(CtrlFlags ctrlFlags) { // Flag members for Gfx9 MFMA instructions { CtrlFlags::Cbsz } -> std::convertible_to<int>; { CtrlFlags::Abid } -> std::convertible_to<int>; { CtrlFlags::Blgp } -> std::convertible_to<int>; }; #endif // defined(__cpp_concepts) && __cpp_concepts >= 201907L ``` That said, in cases where functionality from the `<concepts>` header is used (e.g., `std::convertible_to`), this guard fails to check whether the `<concepts>` header is available. This change adds an additional check to the concepts that make use of functionality from the `<concepts>` header to ensure the header is available. ## Test Plan I tested the changes on the relevant docker for gfx90a, gfx950, and gfx942 and the compiler issue is not present. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit 06976b37a2f0353b80c71fb3d56bee92bb6b9bab Author: Aviral Goel <aviral.goel@amd.com> Date: Sat Feb 7 04:14:28 2026 +0400 Increase tolerance for FP16 GEMM tests to handle non-deterministic ro… (#4335) …unding Three tests were failing intermittently with small errors (0.01-1.5%) due to non-deterministic FP16 accumulation order from GPU thread scheduling: - test_ck_tile_batched_gemm - test_ck_tile_grouped_gemm_preshuffle - test_ck_tile_grouped_gemm_multi_d These tests use kbatch=1 (no split-K), so errors are from order-dependent rounding, not atomics. Increased tolerances from 1e-3 to 2e-3 (0.2%) to account for FP16 precision limits while still catching real bugs. - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> commit 07e9d561402c717946a1c08cfdce2681d5733335 Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com> Date: Fri Feb 6 16:10:23 2026 -0800 [CK] add inter/intrawave scheduling concept doc (#4300) ## Proposed changes Adding information about inter/intrawave scheduling --- 🔁 Imported from [ROCm/composable_kernel#3660](https://github.com/ROCm/composable_kernel/pull/3660) 🧑‍💻 Originally authored by @spolifroni-amd --------- Co-authored-by: spolifroni-amd <sandra.polifroni@amd.com> Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> commit 4d773b636ca00996e971d55bcd0530f641837b42 Author: JonathanLichtnerAMD <195780826+JonathanLichtnerAMD@users.noreply.github.com> Date: Fri Feb 6 17:09:55 2026 -0700 Add .cline_storage to .gitignore (#4390) ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit 738ffd7689ba0759f00c0e9430889b2ed995fa94 Author: Enrico Degregori <73224202+EnricoDeg@users.noreply.github.com> Date: Sat Feb 7 01:09:08 2026 +0100 [CK] Workaround blockscale wp test failure (#4372) ## Motivation Workaround to fix blockscale wp test failure for pipeline v3 ## Technical Details  ## Test Plan  ## Test Result  ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit 612bf0b710b399276916c222d8d4c5f9c34f9f62 Author: James Sandham <33790278+jsandham@users.noreply.github.com> Date: Fri Feb 6 18:44:55 2026 -0500 [rocsparse] Add bfloat16 and complex-types tests for code coverage (#4204) ## Motivation Add bfloat16 and complex-types tests for code coverage. Also renames the atomic_add tests to belong to pre_checkin so that they will be run as part of code coverage pre_checkin tests. commit 287fbc900071d5f9f8df7efdf1cfd25d9c8ea338 Author: Kerry Wang <kerrywang369@gmail.com> Date: Fri Feb 6 15:55:22 2026 -0600 don't include waitcnts; renames commit f52966a377bfd26725f35f103fbc7975cd9b4ec9 Author: Yiqian Liu <157505981+liu-yiqian@users.noreply.github.com> Date: Fri Feb 6 15:50:20 2026 -0600 [rocRoller] Explicitly convert when typeAcc differs with typeD (#3977) ## Motivation This PR explicitly converts data type when Accumulator type is different with matrix D. The purpose of this change is to make rocRoller client adds the same operation as hipblaslt ## Technical Details Added a convert operation when Accumulator is not the same type as matrix D. ## Test Plan All the existing tests should pass. This change should not affect the performance. ## Test Result Existing tests passed. --------- Co-authored-by: yiqialiu <yiqialiu@amd.com> commit 323a8d256e2409de54fa9dfa9523f4c50295c305 Author: Kerry Wang <kerrywang369@gmail.com> Date: Fri Feb 6 15:28:17 2026 -0600 minor quality improvements commit 6c4a1fb6d0f2eff761cd95e690f3ef0090266367 Author: Ali Yazdani <ayazdani@amd.com> Date: Fri Feb 6 14:24:53 2026 -0700 [Origami] AutoWgm for NonTemporal Kernels. (#4218) AIGESOLSEL-71 ## Motivation This PR enhances the Origami workgroup mapping (WGM) selection logic to support nontemporal kernels and improve automatic WGM value selection. Previously, nontemporal cases (NTA/NTB > 3) were excluded from automatic workgroup mapping optimizations, limiting potential performance. ## Technical Details 1. Enabling NonTemporal support in AutoWGM with an enhanced logic 2. Improved WGM Candidate Generation ## Test Plan CI, and locally ran performance tests. ## Test Result Performance benchmarks show uplifts coming from changes. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit 50e39459b541f978edd9acf645792cf496c16bea Author: Kerry Wang <kerrywang369@gmail.com> Date: Fri Feb 6 15:21:16 2026 -0600 add string-based observer test commit 16b3b1840d61871c116c9ae80cf59324233377d5 Author: Torre Zuk <42548444+TorreZuk@users.noreply.github.com> Date: Fri Feb 6 14:19:00 2026 -0700 [rocBLAS] Users/torrezuk/rocm 1157 amd smi rocblas (#4353) ## Motivation Deprecated dependency rocm-smi replaced by amd-smi ## Test Plan Test frequency reporting using rocblas-bench with environment variable set commit 45b616b1e6df1b1d3816a2f03a05a0f6ab754652 Author: Illia Silin <98187287+illsilin@users.noreply.github.com> Date: Fri Feb 6 10:17:02 2026 -0800 [CK] fix path for build filter (#4375) ## Motivation Fix the filter that determines whether CI builds are necessary. ## Technical Details A script checks the files list returned by git diff and checks whether any code source was modified. If not, if only documentation was changed, it will allow skipping the builds. We make sure we only look at the changes in projects/composablekernel/ folder. commit d8e2826bedff1183eaedeb6d6f5b2eeaa65dab7b Author: Geo Min <geomin12@amd.com> Date: Fri Feb 6 09:59:29 2026 -0800 [ci] Adding mi350 required group ID (#4378) After updating mi325 group-id, we are noticing errors for mi350. Tested here for mi350: https://github.com/ROCm/TheRock/actions/runs/21733399385/job/62692971370 Tested here for mi325: https://github.com/ROCm/TheRock/actions/runs/21759203211/job/62778060417 Adding both work properly commit 78497b37bb4e5853b5da3feb96381c6b643556f7 Author: bibek <108366729+bghimireamd@users.noreply.github.com> Date: Fri Feb 6 10:46:48 2026 -0600 Fix intermittent kernel compilation failures in BnFwdTrainingSpatial (#4202) ## Motivation Fix kernel compilation failures in BnFwdTrainingSpatial caused by two related issues: 1. Uninitialized local size variables producing random garbage values 2. Missing compile-time guard for warp-reduction LDS arrays when workgroup < 64 threads ## To Reproduce ```bash rm -rf ~/.config/miopen/*.udb.txt ~/.cache/miopen/ rm -rf /tmp/.config/miopen/ /tmp/.cache/miopen/ MIOPEN_FIND_ENFORCE=SEARCH MIOPEN_LOG_LEVEL=5 ./bin/MIOpenDriver bnorm -n 1024 -c 64 -H 13 -W 13 -m 1 --forw 1 -s 1 -V 1 ``` ## Technical Details ### Bug 1: Uninitialized Variables (Host) Variables declared without initialization: ```cpp size_t xlocalsize, xgridsize; // uninitialized ``` For Variants 0/1/3, early returns skip initialization, leaving garbage values that propagate to kernel template parameters. Depending on stack memory state, errors include: - `error: array is too large (18446744073709545792 elements)` - `error: variable length array declaration cannot have 'static' storage duration` - `error: zero-length arrays are not permitted in HIP device code` ### Bug 2: Missing Compile-Time Guard (Kernel) The warp-reduction path divides LDS size by 64: ```cpp __shared__ FpAccumCType lcl_data_x[MIO_BN_GRP0_FINAL * MIO_BN_GRP1_FINAL * MIO_BN_GRP2_FINAL / 64]; ``` When Variants 0/1/3 set workgroup size to 1×1×1, this computes `1/64 = 0`, producing illegal zero-length arrays. Note that `if constexpr` only suppresses template instantiation, not parsing of ill-formed code like zero-length arrays. ## Fix ### Host side (`common_spatial.hpp`, `forward_spatial.cpp`) Initialize variables to safe defaults: ```cpp size_t xlocalsize = 1, xgridsize = 1; size_t ylocalsize = 1, ygridsize = 1; size_t zlocalsize = 1, zgridsize = 1; size_t nelements = 1; unsigned int ldsgcn = 0, ldsnogcn = 0; ``` ### Kernel side (`MIOpenBatchNormFwdTrainSpatial.cpp`) Use C++17 constexpr ternary to ensure array size is always ≥ 1: ```cpp else { // C++17 idiomatic: ensure array size is never zero using constexpr ternary constexpr auto grp_final_total = MIO_BN_GRP0_FINAL * MIO_BN_GRP1_FINAL * MIO_BN_GRP2_FINAL; constexpr auto lds_gcn_array_size = grp_final_total >= 64 ? grp_final_total / 64 : 1; commitID = 64; __shared__ FpAccumCType lcl_data_x[lds_gcn_array_size]; __shared__ FpAccumCType lcl_data_y[lds_gcn_array_size]; miopen::reduction::gcn_reduce2(...); } ``` __Why this works:__ - `constexpr` ensures compile-time evaluation (zero runtime overhead) - When workgroup ≥ 64: array size = `grp_final_total / 64` (correct, same as before) - When workgroup < 64: array size = 1 (valid), but this `else` branch is never taken due to `if constexpr` guard - Dead code elimination removes the unused size-1 arrays from the binary ## Test Plan - Existing batchnorm training tests pass - Verified no zero-length array errors with deterministic workgroup sizes - Confirmed warp-reduction path only executes when workgroup ≥ 64 threads commit 8f8b97a40d36cb4095e929b0ef1b71ffea7ba170 Author: SreecharanGundaboluAMD <sgundabo@amd.com> Date: Fri Feb 6 08:18:12 2026 -0800 [miopen] upgrade clang-format (#4194) This PR updates the project's code formatting tooling to use `clang-format-18` instead of `clang-format-12` throughout the codebase as a transition as we move towards TheRock for our CI. **Tooling and Configuration Updates:** * Updated all references to `clang-format-12` to `clang-format-18` in the pre-commit hook (`.githooks/pre-commit`), CMake configuration (`ClangCheck.cmake`), and Dockerfile (`Dockerfile`). The Dockerfile now also adds the appropriate LLVM 18 repository and keyring for installation. [[1]](diffhunk://#diff-1436c8126d575a7576d98d0bc8a8c6d27e8eb4e2d7241d61fe64c286c0d7365cL7-R7) [[2]](diffhunk://#diff-fc024f0d7573d33039081dab6b12f76f0f34c8e07e014552daa1bed9a276a548L9-R9) [[3]](diffhunk://#diff-32304f8a254e46fb8ff524cf4c488eb6013ab54a89ca62709cfb20ccf58976f9R54-R61) ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit b34fa62134425a611b18c05aae687f1bc09c7d87 Author: BrianHarrisonAMD <169072757+BrianHarrisonAMD@users.noreply.github.com> Date: Fri Feb 6 08:57:01 2026 -0700 [hipDNN] Remove spdlog dependency for consumers of hipdnn (#4312) ## Motivation Draft of changes to remove spdlog and fmt dependencies from hipDNN frontend & consumer facing SDKs. Consumers of hipDNN can have conflicts as a result of these dependencies. Removing these extra dependencies, and relying on the C API logging methods from hipDNN backend will reduce friction, and make it easier for adoption of hipDNN. ## Technical Details - Remove spdlog and fmt from frontend + consumer SDKs. - Add new string stream style formatter that will forward to backend logging API callback - Note: since we are set at C++17 standards, we cannot use std::format. - Add Spdlog and fmt optional dependencies as opt in for plugin_sdk - This enables existing plugins to maintain logging style they have in place, and allows plugin authors to decide what style of logging to use. - Unify consumption of spdlog & fmt dependencies in CMake's using unified method's ## Test Plan - Ensure build and tests are working for all components - Ensure tests with logs enabled are working properly with expected format - Ensure build of samples is working, and logging format is correct ## Test Result Build, tests, and testing with logging enabled is working locally for hipDNN, providers, and samples commit 0c37fdc37ba545b2ced5211b9f59c2381fc93753 Author: amd-chunxlin <chunxlin@amd.com> Date: Fri Feb 6 09:44:39 2026 -0600 [rocRoller] Address long StreamK test runtimes (#4095) ## Motivation Some streamK tests take long time to finish, and this PR addresses the performance issue. <details> <summary>Comparison of test runtimes</summary> | Test name | Develop branch | This branch | | --- | --- | --- | | GPU_BasicGEMMStreamKWorkgroupMapping/0 | 14.412s | 4.059s| | GPU_BasicGEMMStreamKWorkgroupMapping/1 | 14.447s | 4.018s| | GPU_BasicGEMMStreamKWorkgroupMapping/2 | 14.452s | 4.026s| | GPU_BasicGEMMStreamKWorkgroupMapping/3 | 14.444s | 4.042s| | GPU_BasicGEMMStreamKWorkgroupMapping/4 | 38.872s | 9.446s| | GPU_BasicGEMMStreamKWorkgroupMapping/5 | 38.828s | 9.445s| | GPU_BasicGEMMStreamKWorkgroupMapping/6 | 38.913s | 9.446s| | GPU_BasicGEMMStreamKWorkgroupMapping/7 | 38.812s | 9.435s| | GPU_BasicGEMMStreamKWorkgroupMapping/8 | 38.878s | 9.456s| | GPU_BasicGEMMStreamKWorkgroupMapping/9 | 38.889s | 9.463s| | GPU_BasicGEMMStreamKWorkgroupMapping/10 | 38.884s | 9.463s| | GPU_BasicGEMMStreamKWorkgroupMapping/11 | 38.859s | 9.463s| | GPU_BasicGEMMStreamKWorkgroupMapping/12 | 14.414s | 4.035s| | GPU_BasicGEMMStreamKWorkgroupMapping/13 | 14.429s | 4.024s| | GPU_BasicGEMMStreamKWorkgroupMapping/14 | 14.47s | 4.045s| | GPU_BasicGEMMStreamKWorkgroupMapping/15 | 14.428s | 4.044s| | GPU_BasicGEMMStreamKWorkgroupMapping/16 | 38.854s | 9.454s| | GPU_BasicGEMMStreamKWorkgroupMapping/17 | 38.861s | 9.448s| | GPU_BasicGEMMStreamKWorkgroupMapping/18 | 38.95s | 9.461s| | GPU_BasicGEMMStreamKWorkgroupMapping/19 | 38.826s | 9.458s| | GPU_BasicGEMMStreamKWorkgroupMapping/20 | 38.833s | 9.48s| | GPU_BasicGEMMStreamKWorkgroupMapping/21 | 38.888s | 9.472s| | GPU_BasicGEMMStreamKWorkgroupMapping/22 | 38.868s | 9.477s| | GPU_BasicGEMMStreamKWorkgroupMapping/23 | 38.907s | 9.485s| | GPU_BasicGEMMStreamKWorkgroupMapping/24 | 14.426s | 4.025s| | GPU_BasicGEMMStreamKWorkgroupMapping/25 | 14.435s | 4.051s| | GPU_BasicGEMMStreamKWorkgroupMapping/26 | 14.472s | 4.053s| | GPU_BasicGEMMStreamKWorkgroupMapping/27 | 14.471s | 4.058s| | GPU_BasicGEMMStreamKWorkgroupMapping/28 | 38.879s | 9.457s| | GPU_BasicGEMMStreamKWorkgroupMapping/29 | 38.814s | 9.445s| | GPU_BasicGEMMStreamKWorkgroupMapping/30 | 38.853s | 9.45s| | GPU_BasicGEMMStreamKWorkgroupMapping/31 | 38.963s | 9.458s| | GPU_BasicGEMMStreamKWorkgroupMapping/32 | 38.924s | 9.466s| | GPU_BasicGEMMStreamKWorkgroupMapping/33 | 38.898s | 9.482s| | GPU_BasicGEMMStreamKWorkgroupMapping/34 | 38.951s | 9.455s| | GPU_BasicGEMMStreamKWorkgroupMapping/35 | 38.924s | 9.459s| | GPU_BasicGEMMStreamKWorkgroupMapping/36 | 14.461s | 4.037s| | GPU_BasicGEMMStreamKWorkgroupMapping/37 | 14.452s | 4.032s| | GPU_BasicGEMMStreamKWorkgroupMapping/38 | 14.43s | 4.053s| | GPU_BasicGEMMStreamKWorkgroupMapping/39 | 14.43s | 4.042s| | GPU_BasicGEMMStreamKWorkgroupMapping/40 | 38.868s | 9.473s| | GPU_BasicGEMMStreamKWorkgroupMapping/41 | 38.925s | 9.461s| | GPU_BasicGEMMStreamKWorkgroupMapping/42 | 38.884s | 9.452s| | GPU_BasicGEMMStreamKWorkgroupMapping/43 | 38.925s | 9.455s| | GPU_BasicGEMMStreamKWorkgroupMapping/44 | 39.012s | 9.476s| | GPU_BasicGEMMStreamKWorkgroupMapping/45 | 38.915s | 9.479s| | GPU_BasicGEMMStreamKWorkgroupMapping/46 | 38.933s | 9.457s| | GPU_BasicGEMMStreamKWorkgroupMapping/47 | 38.936s | 9.469s| | GPU_BasicGEMMStreamKWorkgroupMapping/48 | 14.461s | 4.041s| | GPU_BasicGEMMStreamKWorkgroupMapping/49 | 14.468s | 4.049s| | GPU_BasicGEMMStreamKWorkgroupMapping/50 | 14.466s | 4.046s| | GPU_BasicGEMMStreamKWorkgroupMapping/51 | 14.479s | 4.038s| | GPU_BasicGEMMStreamKWorkgroupMapping/52 | 38.907s | 9.473s| | GPU_BasicGEMMStreamKWorkgroupMapping/53 | 38.914s | 9.471s| | GPU_BasicGEMMStreamKWorkgroupMapping/54 | 38.885s | 9.463s| | GPU_BasicGEMMStreamKWorkgroupMapping/55 | 38.891s | 9.463s| | GPU_BasicGEMMStreamKWorkgroupMapping/56 | 38.859s | 9.472s| | GPU_BasicGEMMStreamKWorkgroupMapping/57 | 38.899s | 9.475s| | GPU_BasicGEMMStreamKWorkgroupMapping/58 | 38.936s | 9.47s| | GPU_BasicGEMMStreamKWorkgroupMapping/59 | 38.952s | 9.472s| | GPU_BasicGEMMStreamKWorkgroupMapping/60 | 14.485s | 4.036s| | GPU_BasicGEMMStreamKWorkgroupMapping/61 | 14.419s | 4.031s| | GPU_BasicGEMMStreamKWorkgroupMapping/62 | 14.455s | 4.035s| | GPU_BasicGEMMStreamKWorkgroupMapping/63 | 14.488s | 4.045s| | GPU_BasicGEMMStreamKWorkgroupMapping/64 | 38.846s | 9.462s| | GPU_BasicGEMMStreamKWorkgroupMapping/65 | 38.923s | 9.475s| | GPU_BasicGEMMStreamKWorkgroupMapping/66 | 38.884s | 9.463s| | GPU_BasicGEMMStreamKWorkgroupMapping/67 | 38.93s | 9.441s| | GPU_BasicGEMMStreamKWorkgroupMapping/68 | 38.9s | 9.473s| | GPU_BasicGEMMStreamKWorkgroupMapping/69 | 38.914s | 9.477s| | GPU_BasicGEMMStreamKWorkgroupMapping/70 | 38.961s | 9.487s| | GPU_BasicGEMMStreamKWorkgroupMapping/71 | 38.93s | 9.489s| | GPU_BasicGEMMStreamKWorkgroupMapping/72 | 14.432s | 4.051s| | GPU_BasicGEMMStreamKWorkgroupMapping/73 | 14.442s | 4.042s| | GPU_BasicGEMMStreamKWorkgroupMapping/74 | 14.44s | 4.036s| | GPU_BasicGEMMStreamKWorkgroupMapping/75 | 14.456s | 4.042s| | GPU_BasicGEMMStreamKWorkgroupMapping/76 | 38.891s | 9.486s| | GPU_BasicGEMMStreamKWorkgroupMapping/77 | 38.938s | 9.444s| | GPU_BasicGEMMStreamKWorkgroupMapping/78 | 38.919s | 9.448s| | GPU_BasicGEMMStreamKWorkgroupMapping/79 | 38.858s | 9.455s| | GPU_BasicGEMMStreamKWorkgroupMapping/80 | 38.878s | 9.491s| | GPU_BasicGEMMStreamKWorkgroupMapping/81 | 38.975s | 9.478s| | GPU_BasicGEMMStreamKWorkgroupMapping/82 | 38.945s | 9.496s| | GPU_BasicGEMMStreamKWorkgroupMapping/83 | 38.905s | 9.477s| | GPU_BasicGEMMStreamKWorkgroupMapping/84 | 14.433s | 4.033s| | GPU_BasicGEMMStreamKWorkgroupMapping/85 | 14.41s | 4.038s| | GPU_BasicGEMMStreamKWorkgroupMapping/86 | 14.478s | 4.044s| | GPU_BasicGEMMStreamKWorkgroupMapping/87 | 14.473s | 4.035s| | GPU_BasicGEMMStreamKWorkgroupMapping/88 | 38.927s | 9.49s| | GPU_BasicGEMMStreamKWorkgroupMapping/89 | 38.938s | 9.472s| | GPU_BasicGEMMStreamKWorkgroupMapping/90 | 38.913s | 9.454s| | GPU_BasicGEMMStreamKWorkgroupMapping/91 | 38.835s | 9.463s| | GPU_BasicGEMMStreamKWorkgroupMapping/92 | 38.901s | 9.494s| | GPU_BasicGEMMStreamKWorkgroupMapping/93 | 38.864s | 9.488s| | GPU_BasicGEMMStreamKWorkgroupMapping/94 | 38.839s | 9.508s| | GPU_BasicGEMMStreamKWorkgroupMapping/95 | 38.921s | 9.487s| | GPU_BasicGEMMStreamKWorkgroupMapping/96 | 14.445s | 4.05s| | GPU_BasicGEMMStreamKWorkgroupMapping/97 | 14.432s | 4.041s| | GPU_BasicGEMMStreamKWorkgroupMapping/98 | 14.472s | 4.055s| | GPU_BasicGEMMStreamKWorkgroupMapping/99 | 14.449s | 4.039s| | GPU_BasicGEMMStreamKWorkgroupMapping/100 | 38.87s | 9.485s| | GPU_BasicGEMMStreamKWorkgroupMapping/101 | 38.866s | 9.474s| | GPU_BasicGEMMStreamKWorkgroupMapping/102 | 38.836s | 9.471s| | GPU_BasicGEMMStreamKWorkgroupMapping/103 | 38.847s | 9.464s| | GPU_BasicGEMMStreamKWorkgroupMapping/104 | 38.985s | 9.468s| | GPU_BasicGEMMStreamKWorkgroupMapping/105 | 38.928s | 9.469s| | GPU_BasicGEMMStreamKWorkgroupMapping/106 | 38.866s | 9.496s| | GPU_BasicGEMMStreamKWorkgroupMapping/107 | 38.92s | 9.48s| | GPU_BasicGEMMStreamKWorkgroupMapping/108 | 14.452s | 4.046s| | GPU_BasicGEMMStreamKWorkgroupMapping/109 | 14.448s | 4.044s| | GPU_BasicGEMMStreamKWorkgroupMapping/110 | 14.469s | 4.044s| | GPU_BasicGEMMStreamKWorkgroupMapping/111 | 14.457s | 4.052s| | GPU_BasicGEMMStreamKWorkgroupMapping/112 | 38.895s | 9.474s| | GPU_BasicGEMMStreamKWorkgroupMapping/113 | 38.901s | 9.453s| | GPU_BasicGEMMStreamKWorkgroupMapping/114 | 38.882s | 9.467s| | GPU_BasicGEMMStreamKWorkgroupMapping/115 | 38.861s | 9.446s| | GPU_BasicGEMMStreamKWorkgroupMapping/116 | 38.945s | 9.498s| | GPU_BasicGEMMStreamKWorkgroupMapping/117 | 38.94s | 9.488s| | GPU_BasicGEMMStreamKWorkgroupMapping/118 | 38.881s | 9.488s| | GPU_BasicGEMMStreamKWorkgroupMapping/119 | 38.911s | 9.486s| | GPU_BasicGEMMStreamKWorkgroupMapping/120 | 14.434s | 4.044s| | GPU_BasicGEMMStreamKWorkgroupMapping/121 | 14.455s | 4.061s| | GPU_BasicGEMMStreamKWorkgroupMapping/122 | 14.464s | 4.067s| | GPU_BasicGEMMStreamKWorkgroupMapping/123 | 14.488s | 4.058s| | GPU_BasicGEMMStreamKWorkgroupMapping/124 | 38.932s | 9.499s| | GPU_BasicGEMMStreamKWorkgroupMapping/125 | 38.868s | 9.488s| | GPU_BasicGEMMStreamKWorkgroupMapping/126 | 38.863s | 9.488s| | GPU_BasicGEMMStreamKWorkgroupMapping/127 | 38.899s | 9.467s| | GPU_BasicGEMMStreamKWorkgroupMapping/128 | 38.948s | 9.511s| | GPU_BasicGEMMStreamKWorkgroupMapping/129 | 38.915s | 9.502s| | GPU_BasicGEMMStreamKWorkgroupMapping/130 | 38.932s | 9.508s| | GPU_BasicGEMMStreamKWorkgroupMapping/131 | 38.914s | 9.491s| | GPU_BasicGEMMStreamKWorkgroupMapping/132 | 14.42s | 4.031s| | GPU_BasicGEMMStreamKWorkgroupMapping/133 | 14.464s | 4.054s| | GPU_BasicGEMMStreamKWorkgroupMapping/134 | 14.494s | 4.048s| | GPU_BasicGEMMStreamKWorkgroupMapping/135 | 14.48s | 4.046s| | GPU_BasicGEMMStreamKWorkgroupMapping/136 | 38.911s | 9.49s| | GPU_BasicGEMMStreamKWorkgroupMapping/137 | 38.91s | 9.478s| | GPU_BasicGEMMStreamKWorkgroupMapping/138 | 38.914s | 9.478s| | GPU_BasicGEMMStreamKWorkgroupMapping/139 | 38.92s | 9.475s| | GPU_BasicGEMMStreamKWorkgroupMapping/140 | 38.883s | 9.508s| | GPU_BasicGEMMStreamKWorkgroupMapping/141 | 38.935s | 9.5s| | GPU_BasicGEMMStreamKWorkgroupMapping/142 | 38.848s | 9.494s| | GPU_BasicGEMMStreamKWorkgroupMapping/143 | 38.932s | 9.502s| </details> ## Technical Details - Short-circuit expression comparison - Remove unused code - Caching expressions of kernel arguments to eliminate redundant regeneration. - Change `AssemblyKernelArgument` to a `class` ## Test Plan No functional changes; covered by existing tests. ## Test Result See CI report commit 66e22ac6c6c0c286325cabf2b1faa269ea640446 Author: hcman2 <52367956+hcman2@users.noreply.github.com> Date: Fri Feb 6 10:55:08 2026 +0800 [formocast] [tensilelite] enable tuning with formocast (#4043) ## Motivation Integration plan of formocast and origami : Steps 1.Push Formocast code to the origami subfolder. (we are here now) 2.Submit tuning code calling the API of origami. 3.Push and reserve origami prediction mode. Add APIs to pass sizemapping data via config_t. Use an environment variable to switch modes without affecting other code. 4.Enable predictionThreshold with tox tests. 5.Push Origami code with Formocast backend. This step will enable the bench with different modes. 6.Refine Formocast and Origami to verify API usage and identify functions to move. This PR is to include step2 and step4. ## Technical Details  ## Test Plan  ## Test Result  ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Henry Ho <hehe790223@gmail.com> Co-authored-by: Peter Cheng <Peter.Cheng@amd.com> commit 37a74ef54eaa1bb1df603db6ec8aff22f342bc71 Author: Illia Silin <98187287+illsilin@users.noreply.github.com> Date: Thu Feb 5 17:06:57 2026 -0800 [CK] a bunch of CI fixes. (#4361) ## Motivation Fixing some of the CK CI issues ## Technical Details fixing paths to dockerfiles and scripts; moving codegen tests to separate stage (collides with main build since you must call cmake from same folder but different options); fixing a couple of clang compilation issues with staging compiler; commit 808e9496d17be0826164dbb34457f54e157dd2bd Author: Kerry Wang <kerrywang369@gmail.com> Date: Thu Feb 5 17:00:01 2026 -0600 refractor observers to have new runtime with context concept commit 3c9beb38b8dba1301a961cc5dc3f44ca9d4185e3 Author: Eiden Yoshida <47196116+eidenyoshida@users.noreply.github.com> Date: Thu Feb 5 17:56:12 2026 -0500 [CK] MICI: Fix git diff in selective_test_filter.py (#4352) ## Motivation - git diff needs access to reference repo ## Technical Details - mount reference repo path into docker for selective_test_filter.py to access ## Test Plan - tested in MICI ## Test Result - launch_tests.sh ran successfully commit 1663ac026d46c3dd02edb73d2bfa7310c54695d5 Author: Torre Zuk <42548444+TorreZuk@users.noreply.github.com> Date: Thu Feb 5 14:20:45 2026 -0700 [rocBLAS] trsm doc & test; trsv change noted in log (#4198) ## Motivation Tests trsm use of new trsv kernel for big batches Adds chagelog note on trsv which used to call hipGetDevice and potentially hipSetDevice commit d8bb9d2b9fe278d07fe63b395d87268c94e53fd8 Author: Jeffrey Novotny <jnovotny@amd.com> Date: Thu Feb 5 14:54:48 2026 -0500 [rocsolver] Doxygen API cleanup part 2 (#4330) ## Motivation  Continuing with rocSOLVER API/Doxygen copy edits and polishing ## Technical Details  Edit Doxygen comments in header file. ## Test Plan  Build locally. ## Test Result  NA ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit 5aa1f1d4c189f779ea699be250fb1b284f3d6ac2 Author: Geo Min <geomin12@amd.com> Date: Thu Feb 5 11:01:53 2026 -0800 [ci] Updating variable group-id for OSSCI (#4360) OSSCI migrated mi325s, so need a new groupID Sanity works here: https://github.com/ROCm/TheRock/actions/runs/21723540679/job/62659665907 normal run works here: https://github.com/ROCm/TheRock/actions/runs/21723540679/job/62659791422 I've dabbled with organization variables, however, this does not work for forks so for now, we will do the manual update commit 6273d3b30b32d6e0856394d37a421153dceb33c1 Author: Dmitrii Polomin <dmitriy.polomin@dxc.com> Date: Thu Feb 5 19:53:45 2026 +0100 [MIOpen] Ported solver test to gtest (#3713) ## Motivation Porting tests from CTest to GTest, in this case, `solver.cpp` ## Technical Details Pretty straightforward port, although I had to get creative in order to conform to `INSTANTIATE_TEST_SUITE_P` pattern and naming conventions ## Test Plan Running locally, using the CI launched by this PR ## Test Result See CI actions launched by this PR ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit ece0c48dd152856c3b443c12c58a89bc7d7c34e5 Author: Nathan Henderson <nathan.henderson@amd.com> Date: Thu Feb 5 09:03:42 2026 -0700 [rocroller] Use StreamKMode in hipBLASLt integration and client (#4028) ## Motivation RocRoller has a `StreamKMode` enum that is used internally and in the GEMM tests. However, the client and hipBLASLt integration still used boolean values to represent the StreamK state. This PR replaces the individual boolean flags (`--streamK`, `--streamKTwoTile`, `--streamKTwoTileDPFirst`) with a single `--streamK` string option that accepts one of the `StreamKMode` values (`None`, `Standard`, `TwoTile`, or `TwoTileDPFirst`). ## Technical Details - Update the rocRoller GEMM client to use `StreamKMode` enum instead of three separate booleans - Update the hipBLASLt rocRoller integration to match - Add `enumStrings<T>()` utility function in `Utils_impl.hpp` for CLI validation of enum values ## Test Plan Update `test_gemm_client.py` YAML fixtures to use the new `streamK: None` format. ## Test Result Validated by the StreamK rrperf tests commit 3b98c98a23e76075a6a1e4e580482a627e39d59b Author: Jobbins <john.robbins@amd.com> Date: Thu Feb 5 08:56:42 2026 -0700 [composablekernel] fix failure status (#4351) ## Motivation Pipelines were failing on Math CI status check. ## Technical Details For the success case, we just changed the config in Jenkins to use a proper app token and no code changes were required. However, the failure case would not have worked as coded, so we needed to move that outside of the `rocmnode()` block. ## Test Plan I removed all of the CI in one of the commits to quickly test, and then added it back. Got a successful "success" message and "failure" message produced commit 9bb7f5c31253643cd72363314c3d3ee02f723406 Author: Eiden Yoshida <47196116+eidenyoshida@users.noreply.github.com> Date: Thu Feb 5 10:55:44 2026 -0500 [CK] MICI: Correct path for build trace script (#4349) ## Motivation - Corrects path to script due to superrepo migration - Forces all tests to run by default ## Technical Details - now in /projects/composablekernel --------- Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> commit 120f91dd211117e308b3713593ac7f061cc02c08 Author: bibek <108366729+bghimireamd@users.noreply.github.com> Date: Thu Feb 5 09:47:16 2026 -0600 [HIPDNN][DOC] Add TYPED_TEST guidance for multi-datatype tests (#4000) ## Motivation Doc update : add `TYPED_TEST` guidance for multi-datatype tests Update `.clinerules`, `.cursor/rules/testing.mdc`, and `docs/Testing.md` to recommend `TYPED_TEST` for tests covering `float`/`half`/`bfloat16`. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Samuel Reeder <41528605+SamuelReeder@users.noreply.github.com> commit d26a7820b58f789d19efd6064d4c2c4f4fc72a95 Author: Eiden Yoshida <47196116+eidenyoshida@users.noreply.github.com> Date: Wed Feb 4 21:43:22 2026 -0500 [CK] MICI: Use reference repo for checkout operations (#4336) ## Motivation - Maintain a reference repo on slave nodes that speeds up any clone/checkout operations ## Technical Details - clone a ref repo if it does not exist - update ref repo if it does exist - checkout after ref repo is updated - eliminates double clone ## Test Result - Initial checkouts succeeded commit f2f187ab40738272232f571f58112697da405b1a Author: Geo Min <geomin12@amd.com> Date: Wed Feb 4 15:43:38 2026 -0800 [ci] Fixing rocm-libs race condition (#4192) Currently, there is a race condition that overwrites BLAS libraries during MIOpen/hipdnn builds. (error: https://github.com/ROCm/rocm-libraries/actions/runs/21228188053/job/61080555083) Tested locally: ``` # With all three geom@geom:~/Code/rocm-libraries/.github/scripts$ PLATFORM="linux" python3 therock_matrix.py [{'cmake_options': '-DTHEROCK_USE_EXTERNAL_COMPOSABLE_KERNEL=ON -DTHEROCK_ENABLE_COMPOSABLE_KERNEL=ON -DTHEROCK_ENABLE_MIOPEN=ON -DTHEROCK_ENABLE_MIOPEN_PLUGIN=ON -DTHEROCK_ENABLE_ALL=OFF -DTHEROCK_COMPOSABLE_KERNEL_SOURCE_DIR=../composable_kernel', 'project_to_test': 'miopen_plugin,miopen,hipdnn'}] # Only hipdnn geom@geom:~/Code/rocm-libraries/.github/scripts$ PLATFORM="linux" python3 therock_matrix.py [{'cmake_options': '-DTHEROCK_USE_EXTERNAL_COMPOSABLE_KERNEL=ON -DTHEROCK_ENABLE_ALL=OFF -DTHEROCK_COMPOSABLE_KERNEL_SOURCE_DIR=../composable_kernel -DTHEROCK_ENABLE_MIOPEN_PLUGIN=ON -DTHEROCK_ENABLE_COMPOSABLE_KERNEL=ON', 'project_to_test': 'hipdnn,miopen_plugin'}] ``` This fixes this error, as if all libraries are ran, they will combine (saves resources + no overwriting) commit f34aec25c434b3044b75481d70693af3bf0ade1e Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com> Date: Wed Feb 4 18:25:31 2026 -0500 [CK] Add FP8 KV_BLOCKSCALE support for batch prefill (#4263) Implement per-page K/V quantization for paged attention: - Add KV_BLOCKSCALE enum to BlockAttentionQuantScaleEnum - Use exp2 shift trick to eliminate explicit P scaling overhead - Prefetch physical pages offset for KV cache, overlaps with computations ## Proposed changes Please describe the motivation behind the pull request, whether it enables a new feature or fixes a bug. If there are associated pull requests or issues, please link them to the pull request. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [ ] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3696](https://github.com/ROCm/composable_kernel/pull/3696) 🧑‍💻 Originally authored by @Jeff-Huang --------- Co-authored-by: Jeff Huang <chiachi.huang@amd.com> Co-authored-by: Illia Silin <Illia.Silin@amd.com> commit df32df51ea6a9ebfba9a459c77ed82c4877df22b Author: Yiqian Liu <157505981+liu-yiqian@users.noreply.github.com> Date: Wed Feb 4 17:12:11 2026 -0600 [rocRoller] [hipblaslt] Enable more workgroup tile sizes for pre-swizzled scale data (#4175) ## Motivation Pre-swizzle is an optimization that pre-swizzle the scale data to match the layout that kernel expects. The purpose of this PR is to add more possible workgroup tile sizes that supports pre-swizzled scale data. ## Technical Details 1. Configure the workgroup tile size to 256 at K dimension when the input data format is pre-swizzled. 2. Filter out the invalid workgroup tile size (i.e., MN dimension is not multiple of 32, or MN dimension is 96). 3. Configure the solution parameters for pre-swizzled input. ## Test Plan 1. Added a rocRoller client test that uses 32x32x256 workgroup tile. ## Test Result 1. This PR should not change any kernel that the data is not pre-swizzled. 2. All the tests should pass and no performance changes. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Yiqian Liu <yiqialiu@ctr2-alola-ctrl-01.amd.com> Co-authored-by: yiqialiu <yiqialiu@amd.com> commit 87d1a8fa005ef2f75e48e5c9c4e70f8235236b03 Author: Samuel Reeder <41528605+SamuelReeder@users.noreply.github.com> Date: Wed Feb 4 16:10:01 2026 -0700 Use `--latest-release` flag for installing rocm in clang-tidy (#4120) ## Motivation `--latest-release` was added in TheRock [2997](https://github.com/ROCm/TheRock/pull/2997) to grab latest nightly for the specified target. We can use this to keep the clang-tidy workflow somewhat up-to-date. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit c5…

Jeff-Huang requested review from Snektron, ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, cgmillette, coderfeli, geyyer, illsilin, poyenc, qianfengz, shumway, tenpercent, vidyasagar-amd and vpietila-amd as code owners January 31, 2026 04:01

poyenc requested a review from Copilot February 2, 2026 04:45

Copilot started reviewing on behalf of poyenc February 2, 2026 04:46 View session

Copilot AI reviewed Feb 2, 2026

View reviewed changes

asleepzzz reviewed Feb 2, 2026

View reviewed changes

include/ck_tile/ops/fmha/pipeline/block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp Outdated Show resolved Hide resolved

asleepzzz reviewed Feb 2, 2026

View reviewed changes

include/ck_tile/ops/fmha/kernel/fmha_batch_prefill_kernel.hpp Show resolved Hide resolved

asleepzzz reviewed Feb 2, 2026

View reviewed changes

poyenc reviewed Feb 2, 2026

View reviewed changes

example/ck_tile/01_fmha/codegen/ops/fmha_batch_prefill.py Outdated Show resolved Hide resolved

poyenc reviewed Feb 2, 2026

View reviewed changes

include/ck_tile/ops/fmha/kernel/fmha_batch_prefill_kernel.hpp Outdated Show resolved Hide resolved

poyenc reviewed Feb 2, 2026

View reviewed changes

include/ck_tile/ops/fmha/block/block_attention_quant_scale_enum.hpp Outdated Show resolved Hide resolved

poyenc reviewed Feb 2, 2026

View reviewed changes

include/ck_tile/ops/fmha/pipeline/block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp Outdated Show resolved Hide resolved

poyenc reviewed Feb 2, 2026

View reviewed changes

include/ck_tile/ops/fmha/pipeline/block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp Outdated Show resolved Hide resolved

ltqin self-requested a review February 2, 2026 06:55

ltqin reviewed Feb 2, 2026

View reviewed changes

include/ck_tile/ops/fmha/kernel/fmha_batch_prefill_kernel.hpp Outdated Show resolved Hide resolved

Jeff-Huang added 5 commits February 3, 2026 11:18

Add runtime check nullptr for prevent quantization parameters.

dc74e66

1. Relax kv_blockscale page_size restriction from == 1024 to >= kN0

e1af9b7

2. Rename QScaleKargsSelector -> GetQScaleKargs for naming consistency 3. Remove unused BlockAttentionQuantScaleEnumToStr<KV_BLOCKSCALE>

use statically_indexed_array instead of c-style array.

4933100

Rename stride fields in FmhaFwdKVBlockScaleKargs

8b07580

Jeff-Huang force-pushed the batch-prefill-fp8-kvcache-blockscale branch from 53d41c9 to 8b07580 Compare February 3, 2026 03:18

Split kv_block_descale_ptr into k_descale_ptr and v_descale_ptr to ma…

c093935

…intain flexibility.

assistant-librarian bot mentioned this pull request Feb 3, 2026

[CK] Add FP8 KV_BLOCKSCALE support for batch prefill ROCm/rocm-libraries#4263

Merged

7 tasks

ammallya closed this Feb 3, 2026

	// For KV_BLOCKSCALE: shift value for exp2(x + shift) to scale P to [0, 2^shift]
	// For BLOCKSCALE/KV_BLOCKSCALE: shift value for exp2(x + shift) to scale P to [0, 2^shift]

Conversation

Jeff-Huang commented Jan 31, 2026

Proposed changes

Checklist

Discussion

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

asleepzzz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

poyenc Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ltqin Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Jeff-Huang Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

ltqin Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Jeff-Huang Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ammallya commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

poyenc Feb 2, 2026 •

edited

Loading