MoE prefill bf16 perf improvement for qwen-3.5-35B-A3B by digantdesai · Pull Request #18829 · pytorch/executorch

digantdesai · 2026-04-11T16:15:38Z

	Baseline	Batched	Speedup
Prefill (1341 tok)	588 tok/s	1807 tok/s	3.07x
Decode (128 tok)	90 tok/s	86 tok/s	~1.0x

Inductor emits aten::sort.stable for ops like argsort, but lacks a native c-shim for it. This adds a thrust-based implementation (aoti_torch_cuda_sort_stable) that handles int64, int32, and float32 dtypes on contiguous innermost-dim tensors. Registered as a supported fallback kernel in CudaBackend so AOTI-compiled models can use sort. This PR was authored with the assistance of Claude.

pytorch-bot · 2026-04-11T16:15:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18829

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Rolling out OSDC (ARC) runners on pull workflow for PyTorch trunk commits

❌ 11 New Failures, 1 Cancelled Job, 2 Pending, 2 Unrelated Failures

As of commit 63548f5 with merge base 266ff2d ():

NEW FAILURES - The following jobs have failed:

pull / android / run-emulator (gh)
The process '/usr/bin/sh' failed with exit code 1
pull / unittest / linux / linux-job (gh)
export/tests/test_target_recipes.py::TestTargetRecipes::test_mv2_model
pull / unittest / macos / macos-job (gh)
export/tests/test_target_recipes.py::TestTargetRecipes::test_mv3_model
pull / unittest-editable / linux / linux-job (gh)
export/tests/test_target_recipes.py::TestTargetRecipes::test_mv2_model
pull / unittest-nxp-neutron / linux-job (gh)
RuntimeError: Command docker exec -t d8834a361ab917d581ee228c89567be90dffa4382f82d0610987b35d583b8aba /exec failed with exit code 1
Test CUDA Builds / check-all-cuda-builds (gh)
Process completed with exit code 1.
Test CUDA Builds / export-model-cuda-artifact (google, gemma-3-4b-it, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t 9817add6ea5b83fef913dfac09233ea4eadcb6a3b8022ed08ac6ecd3fc96303b /exec failed with exit code 1
Test CUDA Builds / export-model-cuda-artifact (mistralai, Voxtral-Mini-3B-2507, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t d9e203d8bc7d664177bebba2d724015363a93cec7b97cb086ee69b943d842c3e /exec failed with exit code 1
Test CUDA Builds / export-model-cuda-artifact (openai, whisper-large-v3-turbo, quantized-int4-weight-only) / linux-job (gh)
RuntimeError: Command docker exec -t 1598e09cdcbdd058db230fcd5b612072327ecc594023d892165d771ee6786d58 /exec failed with exit code 1
Test CUDA Builds / test-executorch-cuda-build-12.6 / linux-job (gh)
RuntimeError: Command docker exec -t ae0656f74207626b8376135d99ed1ee10355defb97b84705ca8dc21e664d2965 /exec failed with exit code 1
Test CUDA Builds / test-models-cuda / linux-job (gh)
RuntimeError: Command docker exec -t 2b3d631e4185614e35d323eba572acd6052a8f48f25115d71bc41fb0b8037885 /exec failed with exit code 1

CANCELLED JOB - The following job was cancelled. Please retry:

pull / unittest-editable / macos / macos-job (gh)
##[error]The operation was canceled.

FLAKY - The following job failed but was likely due to flakiness present on trunk:

Test CUDA Builds / export-model-cuda-artifact (openai, whisper-small, non-quantized) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Test CUDA Builds / unittest-cuda / linux-job (gh) (trunk failure)
backends/cuda/tests/test_fused_moe.py::TestFusedMoE::test_e2e_cpp_runner

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Sweeps prompt lengths [1..4095] with Qwen3.5-35B-A3B shapes (256 experts, top-8, INT4 W4A16). Validates correctness against loop-based eager reference at small M, benchmarks vectorized eager, torch.compile, and Triton fused_moe. Handles OOM gracefully at large M where eager/compile dequantize all experts. This PR was authored with the assistance of Claude.

github-actions · 2026-04-11T16:16:23Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

When the Triton tile size fits within a single quantization group, load one scale per N-element instead of per (K, N) element. Reduces scale memory traffic in both GEMM1 and GEMM2 vec-mat kernels. This PR was authored with the assistance of Claude.

Adds a batched (M>1) Triton fused MoE kernel using tensor-core mma instructions for prefill workloads. Includes moe_align_block_size for token-expert sorting and scale broadcast optimization in the batched GEMM inner loops. Weight layout: [E, N, K//2] (packed INT4). This PR was authored with the assistance of Claude.

Add use_batched_moe flag on FusedMoEExperts, toggled by _set_batched_moe in export.py before each method's torch.export call. Decode (T=1) uses the vec-mat fused_moe kernel; prefill (T>=2) uses fused_moe_batched_gemm. This PR was authored with the assistance of Claude.

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 11, 2026

digantdesai added 3 commits April 11, 2026 09:16

digantdesai force-pushed the digantdesai/qwen35_moe branch from a0d199a to 63548f5 Compare April 13, 2026 15:15

digantdesai changed the title ~~Add CUDA sort shim for AOTI export (thrust-based sort_stable fallback)~~ [AOTI-CUDA] MoE prefill bf16 perf improvement for qwen-3.5-35B-A3B Apr 13, 2026

digantdesai changed the title ~~[AOTI-CUDA] MoE prefill bf16 perf improvement for qwen-3.5-35B-A3B~~ MoE prefill bf16 perf improvement for qwen-3.5-35B-A3B Apr 13, 2026

digantdesai requested review from Gasoonjia and mergennachin April 13, 2026 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MoE prefill bf16 perf improvement for qwen-3.5-35B-A3B#18829

MoE prefill bf16 perf improvement for qwen-3.5-35B-A3B#18829
digantdesai wants to merge 5 commits intomainfrom
digantdesai/qwen35_moe

digantdesai commented Apr 11, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Apr 11, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

digantdesai commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18829

❗ 1 Active SEVs

❌ 11 New Failures, 1 Cancelled Job, 2 Pending, 2 Unrelated Failures

Uh oh!

github-actions bot commented Apr 11, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

digantdesai commented Apr 11, 2026 •

edited

Loading

pytorch-bot bot commented Apr 11, 2026 •

edited

Loading

This PR needs a `release notes:` label