[CK TILE] fix numerical errors of preshuffle_b by CongMa13 · Pull Request #3695 · ROCm/composable_kernel

CongMa13 · 2026-01-31T00:21:46Z

This pull request introduces several improvements and fixes related to quantized grouped GEMM (General Matrix Multiply) pipelines and their supporting utilities.

The numerical issue

Steps to reproduce

Run 
./bin/tile_example_gemm_weight_preshuffle -prec=fp8
./bin/tile_example_gemm_weight_preshuffle -prec=int4

Solution

The main changes address type correctness, improve data layout and shuffling logic, and expand test coverage to better validate different GEMM configurations.

Key changes include:

Data layout and shuffling logic

Refactored the logic in shuffle_b_permuteN to use constexpr variables for KLane and ItemsPerAccess, simplifying tile view construction and correcting the permutation order for improved efficiency and correctness (tensor_shuffle_utils.hpp).
Fixed the calculation of KLaneBytes in weight preshuffle pipeline policies to account for internal data type conversion (e.g., from pk_int4_t to fp8), ensuring accurate memory access and alignment in quantized GEMM policies (wp_pipeline_agmem_bgmem_creg_base_policy.hpp, gemm_wp_abquant_pipeline_ag_bg_cr_base_policy.hpp). [1] [2]

Test infrastructure enhancements

Unit tests did not catch this issue since there were no tests for fp8. Added new configuration structs (config_mn_16x16, config_mn_32x32) to support additional GEMM tile shapes and updated tests to run with these configurations for broader coverage (test_gemm_pipeline_util.hpp). [1] [2]

Copilot

Pull request overview

This PR fixes numerical issues in the weight preshuffle path for quantized GEMM (including pk_int4 → fp8), aligns host-side shuffle logic with the kernel’s expectations, and broadens test coverage to additional GEMM tile shapes so fp8 issues are exercised.

Changes:

Adjusted weight preshuffle and AB-quant pipeline policies to compute KLaneBytes based on the internal compute type (mixed_prec_compute_type_from_input_t) instead of the packed input storage type, ensuring correct NumAccess and memory access patterns for pk_int4/fp8 flows.
Fixed shuffle_b_permuteN’s tile-view layout and permutation to use a consistent KLane/ItemsPerAccess formulation and corrected the permutation order to match the intended layout.
Extended the GEMM preshuffle test utilities with new config structs and updated example grouped-GEMM code to use properly qualified dependent template types, improving both coverage and compilation robustness.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
test/ck_tile/gemm_weight_preshuffle/test_gemm_pipeline_util.hpp	Adds `config_mn_16x16`/`config_mn_32x32` warp-tile configs and runs the preshuffle tests with both, increasing coverage across GEMM tile shapes (note: the `_16x16`/`_32x32` names are currently inverted relative to their `M_Warp_Tile`/`N_Warp_Tile` values).
include/ck_tile/ops/gemm/pipeline/wp_pipeline_agmem_bgmem_creg_base_policy.hpp	Updates `UniversalWeightPreshufflePipelineAgBgCrPolicy::GetBlockWeightPreshuffle` to compute `KLaneBytes` from `BTypeToUse` (mixed-precision compute type) so access granularity matches the actual compute datatype.
include/ck_tile/ops/gemm_quant/pipeline/gemm_wp_abquant_pipeline_ag_bg_cr_base_policy.hpp	Mirrors the `KLaneBytes`/`NumAccess` fix in the AB-quant weight preshuffle B-quant policy, basing the lane-byte count on `BTypeToUse` derived from `Problem::BDataType` and the compute type.
include/ck_tile/ops/gemm_quant/pipeline/gemm_abquant_pipeline_ag_bg_cr_v3.hpp	Qualifies `BlockGemm::OverrideADataType`/`OverrideBDataType` with `typename` to correctly refer to dependent nested types in the AB-quant compute pipeline.
include/ck_tile/host/tensor_shuffle_utils.hpp	Reimplements the non-gfx12 branch of `shuffle_b_permuteN` to use constexpr `KLane`/`ItemsPerAccess` (aligned with `shuffle_b`) and adjusts the tensor view rank and `reference_permute` order to a layout consistent with the preshuffle kernel.
example/ck_tile/17_grouped_gemm/abquant_grouped_gemm.cpp	Adds `typename` in `BaseGemmPipeline` and `GemmPipeline` dependent template instantiations so grouped AB-quant GEMM examples compile cleanly with strict C++ template rules.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

test/ck_tile/gemm_weight_preshuffle/test_gemm_pipeline_util.hpp

ThomasNing · 2026-02-02T22:41:18Z

@CongMa13 Please solve the gfx950's error in CI

ammallya · 2026-02-03T22:00:58Z

Imported to ROCm/rocm-libraries

[CK TILE] fix bugs of preshuffle_b

4340c1c

CongMa13 requested review from Snektron, ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, cgmillette, coderfeli, geyyer, illsilin, poyenc, qianfengz, shumway, tenpercent, vidyasagar-amd and vpietila-amd as code owners January 31, 2026 00:21

CongMa13 requested a review from Copilot January 31, 2026 00:24

Copilot started reviewing on behalf of CongMa13 January 31, 2026 00:25 View session

Copilot AI reviewed Jan 31, 2026

View reviewed changes

test/ck_tile/gemm_weight_preshuffle/test_gemm_pipeline_util.hpp Show resolved Hide resolved

CongMa13 and others added 3 commits January 31, 2026 18:40

[CK TILE] fix bugs of preshuffle_b

758921f

Merge branch 'develop' into congma/ck_tile/fix_preshuffle_b

d306714

Merge branch 'develop' into congma/ck_tile/fix_preshuffle_b

0b98283

Merge branch 'develop' into congma/ck_tile/fix_preshuffle_b

61c7540

assistant-librarian bot mentioned this pull request Feb 3, 2026

[CK TILE] fix numerical errors of preshuffle_b ROCm/rocm-libraries#4264

Closed

ammallya closed this Feb 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CK TILE] fix numerical errors of preshuffle_b#3695

[CK TILE] fix numerical errors of preshuffle_b#3695
CongMa13 wants to merge 5 commits intodevelopfrom
congma/ck_tile/fix_preshuffle_b

CongMa13 commented Jan 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

ThomasNing commented Feb 2, 2026

Uh oh!

ammallya commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

CongMa13 commented Jan 31, 2026

The numerical issue

Steps to reproduce

Solution

Data layout and shuffling logic

Test infrastructure enhancements

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

ThomasNing commented Feb 2, 2026

Uh oh!

ammallya commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants