HipKittens MXFP8 GEMM Support by alextmagro · Pull Request #566 · ROCm/TransformerEngine

alextmagro · 2026-04-28T05:16:00Z

Creates an MXFP8 GEMM with HipKittens that outperforms hipBLASlt, and offers additional epilogues such as BIAS and GELU AUX

Requires a workspace sized relative to the model. Often larger than hipBLASlt, but with significant performance improvements. Only builds for gfx950, and requires M / 256 and N / 256.

Adds hipKittens header library as a submodule.

wangye805 · 2026-05-01T15:25:11Z

            )
-        if use_bias:
-            pytest.skip("hipblaslt GEMM does not yet support MXFP8 with bias.")
+        hipkittens_eligible = (m % 256 == 0) and (n % 256 == 0) and (k >= 256)


same hardcoding 256s...

wangye805 · 2026-05-01T17:04:51Z

+    key = (device, ub, grouped_gemm)
+    ws = _workspace_cache.get(key)
+    if ws is None:
+        ws = torch.empty(get_cublas_workspace_size_bytes(), dtype=torch.uint8, device=device)
+        _workspace_cache[key] = ws
+    return ws
+
+
+def check_mxfp8_workspace(device: int, needed: int) -> None:
+    """Grow the workspace to required size"""
+    key = (device, False, False)
+    ws = _workspace_cache.get(key)
+    if ws is not None and ws.shape[0] >= needed:
+        return
+    _workspace_cache[key] = torch.empty(needed, dtype=torch.uint8, device=device)


I have concerns for the proposed workspace cache system:
1). In non-moe runs, it will try to allocate the largest size kitten_gemm needs, replace previous allocated smaller buffers, relying on pytorch garbage collection to deallocate. Then the biggest single buffer will stay in the process starting from the second iteration.
2). For the MOE run, sizes are dynamic, so probably the cache system can still change after the warm up runs

If we can force TE upstream to always provide you TN layout, then we can remove this dynamic workspace entirely?

I understand your concern, but think we are ok for current models.

1.) This is correct, we only keep the largest workspace, relying on pytorch GC to delete the old workspace. This only affects iteration 1.
2.) Since the workspace is shared for all GEMMs in the model, I think this is unlikely. For example, with DeepSeek 671B with BS=2, the largest non-MoE workspace needed is for the dense layers FFN, where wgrad GEMM will need 200 MB compared to the theoretically maximum MoE GEMM size of 72 MB so this wouldn't occur. For a full MoE Model like Qwen 235B, we still don't run into this issue as the largest non-MoE GEMM would use 96 MB vs 44 MB worst case for MoE.

It is possible that there is a model that exists or could exists where the MoE GEMM is the largest, but convergence theory would imply that we hit the maximum allocation threshold fairly quickly with a many-layer model, and it almost certainly wouldn't affect the performance of a full training run.

Emm, in addition to my another comment on the possibility to remove this dynamic workspace directly, if we really need this dynamic buffer:
1). let's try to allocate buffer without cache to see if it really hurts the e2e training before working on this delicate buffer cache?
2). Convergence theory usually works in theory papers with input distributional assumptions. I agree for our qwen or ds it works fine. Our library may run into strange corner cases when used by customers.

We can do this, but I believe this doesn't change us from needing a workspace that changes dynamically with the largest needed space.

The convergence I was referring to was that we have an upper bound on our largest expert in a model. In the scenario where the MoE layer is the largest size, every time we see a new largest expert, we are less likely to see an even larger expert. This means that that we are very unlikely to be spending time on allocating memory for the workspace later on. I think memory allocation here is also a negligible overhead, given that the same workspace is reused.

Right. this does not change the need of a dynamic workspace. If pytorch native buffer allocation does not hurt us much, our codes will be cleaner and easier to maintain

Looks like it doesn't hurt much -- maybe 5 tflops or so lost on average.

Emm, 5 tflops drop vs cleaner/easier to maintain code, I'm okay with both options. @ipanfilo do you have comments on this?

conflicts

ipanfilo · 2026-05-08T16:28:27Z

+  if (!use_mxfp8 && params.force_hipblaslt) {
+    GTEST_SKIP() << "force_hipblaslt only relevant for MXFP8";
+  }
+  if (use_mxfp8) {


Add new const bool use_hipblaslt_fp8 = (!use_mxfp8 || param.force_hipblaslt) - this combination is used below for many skips. And all this should be below, under ifdef HIP_PLATFORM_AMD under has_fp8

I wanted to avoid the skips completely, so split up the test instantiation into non-mxfp8 and mxfp8.

Nevertheless, the same condition is used multiple times below. May be you can rather have use_hipkittens_mxfp8 = (use_mxfp8 && !params.force_hiplaslt) for better clarity

ipanfilo · 2026-05-08T16:46:34Z

                         [](const testing::TestParamInfo<DqGEMMTestSuite::ParamType>& info) {
-                           return MKN(std::get<0>(info.param)) + "x" + TN(std::get<3>(info.param));
+                           return MKN(std::get<0>(info.param)) + "x" +
+                                  std::to_string(std::get<1>(info.param)) + "x" +


What is a point, they are set to false only

ipanfilo · 2026-05-08T16:50:48Z

    GTEST_SKIP() << "MXFP8 is not supported in current config";
  }
+  if (params.use_bias || params.use_gelu) {
+    if (params.force_hipblaslt) {


It is skipped below anyway, if add it for future, move it after more generic one

Sorry, this and the Dq test name changes are artifacts from my attempt to enable bias and gelu for this test. I a ran into issues with gelu for the non-fp8 GEMM in hipBLASlt, and decided to just focus on the non-Dq tests. I have reverted things.

ipanfilo · 2026-05-08T16:56:06Z

+#include <hip/hip_runtime.h>
+#include <cstddef>
+
+enum KittensDType {


Is it copied from some hipKittent enum? Put comment then

These values come from the NVTE values -- I have added a comment to that extent.

And where are they used?

ipanfilo · 2026-05-08T17:15:28Z


-    return torch.empty(get_cublas_workspace_size_bytes(), dtype=torch.uint8, device=device)
+    key = (device, ub, grouped_gemm)
+    ws = _workspace_cache.get(key)


Why we don't rely on torch memory caching?

I have made this change. I will need to run an E2E run to make sure that performance isn't affected, but should be ok given my understanding of torch.empty()

aris134 · 2026-05-13T16:21:14Z

+    size_t sa_tr_bytes = align_up((size_t)M * scale_K, 256);
+    size_t sb_tr_bytes = align_up((size_t)N * scale_K, 256);
+    size_t sa_pk_bytes = align_up((size_t)k_iters * M * sizeof(uint32_t), 256);
+    size_t sb_pk_bytes = (size_t)k_iters * N * sizeof(uint32_t);


For my own understanding, can you explain why sb_pk_bytes does not require 256-alignment like the others?

Here, we are aligning the end of each variable so that the next address is 256 aligned, not the current one. Since sb_pk_bytes is the last address, we don't need to pad.

matthiasdiener · 2026-05-13T18:15:53Z

+
 namespace transformer_engine {
 namespace jax {


Nit: there are a few whitespace-only changes in these files, not sure if they are necessary.

I have removed this, thanks

aris134

LGTM!

ipanfilo · 2026-05-14T15:50:06Z

+            Path(__file__).resolve().parent.parent
+            / "3rdparty" / "hipkittens" / "include" / "kittens.cuh"
+        )
+        if "gfx950" in rocm_archs and hipkittens_header.exists():


Pytorch/JAX extensions do not bear any GPU code but delegate all this to TE core. And kittens are added to TE common too.
Why is this build time setting needed?

This is an artifact from when I was running into issues with CI not finding pybinded functions from hipKittens. The issue was elsewhere, and I forgot to remove this. I will remove it, thanks!

ipanfilo · 2026-05-14T23:02:31Z

    NVTE_CHECK((k % 128) == 0, "GEMM K dimension must be multiple of 128 for MXFP8 scaling (got K=", k, ")");
-    NVTE_CHECK((m % 16) == 0, "GEMM M dimension must be multiple of 16 for MXFP8 scaling (got M=", m, ")");
-    NVTE_CHECK((n % 16) == 0, "GEMM N dimension must be multiple of 16 for MXFP8 scaling (got N=", n, ")");
+    NVTE_CHECK((m % 16)  == 0, "GEMM M dimension must be multiple of 16 for MXFP8 scaling (got M=", m, ")");


It looks like just spacing change. Please revert if it is the case

ipanfilo · 2026-05-14T23:05:52Z

-                 transb, grad, workspace, workspaceSize, alpha, beta, use_split_accumulator,
-                 math_sm_count, use_service_stream ? ss_ctl.stream : stream, handle);
+#ifdef USE_HIPKITTENS_GEMM
+  bool is_mxfp8 = inputA->scaling_mode == NVTE_MXFP8_1D_SCALING


Move it out of ifdef and use in ifs that currently check the same conditon

ipanfilo · 2026-05-14T23:07:49Z

-    NVTE_CHECK((n % 16) == 0, "GEMM N dimension must be multiple of 16 for MXFP8 scaling (got N=", n, ")");
+    NVTE_CHECK((m % 16)  == 0, "GEMM M dimension must be multiple of 16 for MXFP8 scaling (got M=", m, ")");
+    NVTE_CHECK((n % 16)  == 0, "GEMM N dimension must be multiple of 16 for MXFP8 scaling (got N=", n, ")");
+#ifndef USE_HIPKITTENS_GEMM


It is checked below in else branch of hipkittens conditoon

ipanfilo · 2026-05-14T23:09:35Z

+  if (use_hipkittens) {
+    auto param = CanonicalizeGemmInput(*inputA, transa, *inputB, transb, m, n, k);
+
+    hipStream_t s = use_service_stream ? ss_ctl.stream : stream;


the same like with is_mxfp8, no point of having it defined for one branch only

ipanfilo · 2026-05-14T23:16:58Z

  }

  auto [atol, rtol] = getTestTolerances(dtype, has_fp8, use_mxfp8);
+  size_t mismatch_limit = use_mxfp8 ? std::max((size_t)1, params.m * params.n / 1'000'000) : 0;


Unused variable

ipanfilo · 2026-05-15T00:21:26Z

@@ -743,12 +786,15 @@ MAKE_DQ_GEMM_TEST(Testfp8xfp8xfp16, fp8, fp8, fp16)

 INSTANTIATE_TEST_SUITE_P(OperatorTest, DqGEMMTestSuite,


If you end up with having separate prefix for MXFP8, it has to be use for this suite for consistency

ipanfilo · 2026-05-15T00:37:41Z

@@ -30,7 +30,9 @@ std::vector<std::tuple<size_t, size_t, size_t>> test_case_sizes = {

 std::vector<std::tuple<size_t, size_t, size_t>> test_case_sizes_mxfp8 = {


test_case_sizes_mxfp8 is only used for DqGEMMTest, is it intention to add sizes there?

ipanfilo · 2026-05-15T00:44:07Z

+  if (!use_mxfp8 && params.force_hipblaslt) {
+    GTEST_SKIP() << "force_hipblaslt only relevant for MXFP8";
+  }
+  if (use_mxfp8) {


Nevertheless, the same condition is used multiple times below. May be you can rather have use_hipkittens_mxfp8 = (use_mxfp8 && !params.force_hiplaslt) for better clarity

ipanfilo · 2026-05-15T00:59:02Z

+#include <hip/hip_runtime.h>
+#include <cstddef>
+
+enum KittensDType {


And where are they used?

ipanfilo · 2026-05-15T01:08:06Z

 num_cublas_streams = get_num_compute_streams()


+def _hipkittens_workspace_bytes(m: int, n: int, k: int, layout: str) -> int:


Should it check for env to figure out if hipKittens is enabled?

HipKittens MXFP8 GEMM Support

f9d5ce2

alextmagro requested review from aris134, matthiasdiener and zstreet87 April 28, 2026 05:16

alextmagro requested review from ipanfilo, wangye805 and wenchenvincent as code owners April 28, 2026 05:16

alextmagro added the ci-level 1 CI test level 1 label Apr 28, 2026

wangye805 requested changes May 1, 2026

View reviewed changes

alextmagro added 3 commits May 5, 2026 15:05

Update HipKittens branch after upstream MXFP8 merge

aac5860

Merge remote-tracking branch 'origin/dev' into hipkittens_mxfp8

c917ed0

Update HipKittens commit and address PR comments

3a91321

alextmagro requested a review from wangye805 May 5, 2026 20:26

alextmagro added 5 commits May 5, 2026 20:26

Merge remote-tracking branch 'origin/dev' into hipkittens_mxfp8 with

cc719fe

conflicts

Resolve conflicts, ensure fp4 workspace changes are harmonious

fcda154

min workspace size guaranteed

70fba6d

add hipkittens to wheels

455002e

fix issue with gfx942 for unified build

ba60ef5

aris134 reviewed May 6, 2026

View reviewed changes

Comment thread transformer_engine/common/gemm/kittens/mxfp8_gemm.cpp

aris134 reviewed May 6, 2026

View reviewed changes

Comment thread transformer_engine/common/gemm/kittens/mxfp8_gemm.cpp

aris134 reviewed May 6, 2026

View reviewed changes

Comment thread transformer_engine/common/gemm/kittens/mxfp8_gemm.cpp

aris134 reviewed May 6, 2026

View reviewed changes

Comment thread transformer_engine/common/gemm/kittens/mxfp8_gemm.cpp

ipanfilo requested changes May 8, 2026

View reviewed changes

alextmagro added 2 commits May 12, 2026 02:59

Cleanup and workspace changes

f72b7b8

Merge remote-tracking branch 'origin/dev' into hipkittens_mxfp8

731640a

alextmagro requested review from aris134 and ipanfilo May 12, 2026 13:24

alextmagro added 3 commits May 12, 2026 16:56

fix jax import issue

1960c06

Fix autotuning bug

320152e

fix pytorch import

a280cf7

Revert workspace changes to avoid sizing race condition

2a27902

aris134 reviewed May 13, 2026

View reviewed changes

Comment thread transformer_engine/common/gemm/kittens/mxfp8_gemm.cpp

Revert C++ workspace change to Python

3d7aaf9

aris134 reviewed May 13, 2026

View reviewed changes

Comment thread transformer_engine/common/gemm/kittens/mxfp8_gemm.cpp

aris134 reviewed May 13, 2026

View reviewed changes

Comment thread transformer_engine/common/gemm/kittens/mxfp8_gemm.cpp Outdated

wangye805 approved these changes May 13, 2026

View reviewed changes

matthiasdiener reviewed May 13, 2026

View reviewed changes

aris134 approved these changes May 13, 2026

View reviewed changes

ipanfilo requested changes May 14, 2026

View reviewed changes

Cleanup style and build_tools relics

824841d

alextmagro requested a review from ipanfilo May 14, 2026 17:18

alextmagro added ci-level 3 CI test level 3 and removed ci-level 1 CI test level 1 labels May 14, 2026

matthiasdiener reviewed May 14, 2026

View reviewed changes

Comment thread transformer_engine/jax/cpp_extensions/gemm.py

Fix whitespaces and comment issues

f66f77c

ipanfilo reviewed May 15, 2026

View reviewed changes

		@@ -743,12 +786,15 @@ MAKE_DQ_GEMM_TEST(Testfp8xfp8xfp16, fp8, fp8, fp16)

		INSTANTIATE_TEST_SUITE_P(OperatorTest, DqGEMMTestSuite,

		@@ -30,7 +30,9 @@ std::vector<std::tuple<size_t, size_t, size_t>> test_case_sizes = {

		std::vector<std::tuple<size_t, size_t, size_t>> test_case_sizes_mxfp8 = {

		num_cublas_streams = get_num_compute_streams()


		def _hipkittens_workspace_bytes(m: int, n: int, k: int, layout: str) -> int:

Conversation

alextmagro commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alextmagro May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aris134 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

alextmagro commented Apr 28, 2026 •

edited

Loading

alextmagro May 13, 2026 •

edited

Loading