[CK] Add new fwd conv fp16/bf16 instances optimized for unit group size. by vpietila-amd · Pull Request #3670 · ROCm/composable_kernel

vpietila-amd · 2026-01-28T15:26:52Z

Proposed changes

Added new FP16/BF16 instances that are optimized for group size = 1. The new instance use the compute optimized block GEMM pipeline.

CK prof command	Baseline (TFLOPs)	New V3 instances (TFLOPs)
grouped_conv_fwd 1 1 1 0 1 0 1 2 1 32 2376 256 3 3 100 100 1 1 1 1 1 1 1 1	858.818	962.293
grouped_conv_fwd 1 1 1 0 1 0 1 2 1 32 256 256 3 3 100 100 1 1 1 1 1 1 1 1	979.987	1121.11
grouped_conv_fwd 1 1 1 0 1 0 1 2 1 32 2376 256 3 3 50 50 1 1 1 1 1 1 1 1	945.951	1091.66

…it-group-size

Copilot

Pull request overview

This PR adds new FP16 and BF16 convolution instances optimized for unit group size (G=1) to improve performance. The new instances leverage the compute-optimized block GEMM pipeline and demonstrate significant performance improvements (up to 14% TFLOP increase).

Changes:

Added two new BF16 instances optimized for G=1 with different block configurations (256x256x256 and 512x128x32)
Added two new FP16 instances optimized for G=1 with matching block configurations
Modified line endings to add commas for existing instances to accommodate new entries

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-28T21:01:12Z

...tensor_operation_instance/gpu/grouped_conv_fwd/device_grouped_conv_fwd_xdl_comp_instance.hpp

-        DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle_V3<NDimSpatial,ALayout,BLayout,    DsLayout,ELayout,   F16,   F16,     F32,      F16,    DsDataTypes,   F16, PassThrough, PassThrough, OutElementOp,       ConvSpec, GemmMNKPadding,   256,   128,   128,    64,   8,   8,  32,   32,    2,    2,     S<8, 32, 1>,     S<1, 0, 2>,    S<1, 0, 2>,               2,              8,              8,          0,    S<8, 32, 1>,     S<1, 0, 2>,    S<1, 0, 2>,               2,              8,              8,          0,          1,           1,                   S<1, 32, 1, 8>,               8,  BlockGemmPipelineScheduler::Intrawave, BlockGemmPipelineVersion::v4>
-    // clang-format on
+        DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle_V3<NDimSpatial,ALayout,BLayout,    DsLayout,ELayout,   F16,   F16,     F32,      F16,    DsDataTypes,   F16, PassThrough, PassThrough, OutElementOp,       ConvSpec, GemmMNKPadding,   256,   128,   128,    64,   8,   8,  32,   32,    2,    2,     S<8, 32, 1>,     S<1, 0, 2>,    S<1, 0, 2>,               2,              8,              8,          0,    S<8, 32, 1>,     S<1, 0, 2>,    S<1, 0, 2>,               2,              8,              8,          0,          1,           1,                   S<1, 32, 1, 8>,               8,  BlockGemmPipelineScheduler::Intrawave, BlockGemmPipelineVersion::v4>,
+


Empty line contains trailing whitespace. Remove the trailing whitespace to maintain code cleanliness.

Suggested change

…it-group-size

...tensor_operation_instance/gpu/grouped_conv_fwd/device_grouped_conv_fwd_xdl_comp_instance.hpp

…it-group-size

ammallya · 2026-02-03T22:01:24Z

Imported to ROCm/rocm-libraries

Add new fwd conv fp16/bf16 instances optimized for unit group size.

0c5b724

vpietila-amd marked this pull request as ready for review January 28, 2026 15:35

vpietila-amd requested review from a team, Snektron, ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, cgmillette, coderfeli, geyyer, illsilin, poyenc, qianfengz, shumway and vidyasagar-amd as code owners January 28, 2026 15:35

vpietila-amd enabled auto-merge (squash) January 28, 2026 16:09

Merge branch 'develop' into vpietila/add-fwd-conv-v3-instances-for-un…

6295d53

…it-group-size

bartekxk previously approved these changes Jan 28, 2026

View reviewed changes

afagaj requested a review from Copilot January 28, 2026 21:00

Copilot AI reviewed Jan 28, 2026

View reviewed changes

vpietila-amd and others added 2 commits January 29, 2026 10:38

Merge branch 'develop' into vpietila/add-fwd-conv-v3-instances-for-un…

2aaeac2

…it-group-size

Conditionally compile new instances only for gfx950.

f8aec67

vpietila-amd dismissed bartekxk’s stale review via f8aec67 January 29, 2026 15:04

vpietila-amd commented Jan 29, 2026

View reviewed changes

...tensor_operation_instance/gpu/grouped_conv_fwd/device_grouped_conv_fwd_xdl_comp_instance.hpp Outdated Show resolved Hide resolved

Ville Pietilä added 2 commits January 29, 2026 10:19

Fix clang-format pragmas.

8a23393

Fix clang-format.

0369750

Ville Pietilä and others added 3 commits January 30, 2026 10:17

Move gfx950 specific instances under other double rate instances.

88f8417

Fix clang-format.

fbd1fa1

Merge branch 'develop' into vpietila/add-fwd-conv-v3-instances-for-un…

486eac5

…it-group-size

assistant-librarian bot mentioned this pull request Feb 3, 2026

[CK] Add new fwd conv fp16/bf16 instances optimized for unit group size. ROCm/rocm-libraries#4275

Open

ammallya closed this Feb 3, 2026

auto-merge was automatically disabled February 3, 2026 22:01
Pull request was closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CK] Add new fwd conv fp16/bf16 instances optimized for unit group size.#3670

[CK] Add new fwd conv fp16/bf16 instances optimized for unit group size.#3670
vpietila-amd wants to merge 9 commits intodevelopfrom
vpietila/add-fwd-conv-v3-instances-for-unit-group-size

vpietila-amd commented Jan 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 28, 2026

Uh oh!

Uh oh!

ammallya commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vpietila-amd commented Jan 28, 2026

Proposed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ammallya commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants