[Neuron] Add tensor parallel support for Neuron backend by JingyaHuang · Pull Request #13718 · huggingface/diffusers

JingyaHuang · 2026-05-11T14:17:44Z

What does this PR do?

Adds tensor-parallel (TP) inference for diffusers models on AWS Neuron (Trainium/Inferentia) device. Here as suggested we use Flux2 Klein as the starting point. But the TP support here is generic, easy to extend to other backend(cuda, tpu and more) and is exposed through the existing public API used for CP: model.enable_parallelism(config=TensorParallelConfig(...)).

Key changes:

A model-agnostic apply_tensor_parallel that shards from a flat _tp_plan.

Quick test — Flux2 TP on Neuron (For future release)

run with torchrun --nproc_per_node=8 flux2_tp8_neuron.py

import torch
  import torch.distributed as dist
  from torch.distributed.device_mesh import DeviceMesh
  import torch_neuronx  # noqa: F401 — registers torch.neuron

  from diffusers import Flux2KleinPipeline, TensorParallelConfig

  MODEL = "black-forest-labs/FLUX.2-klein-9B"
  PROMPT = "a golden retriever surfing a wave, photorealistic"

  dist.init_process_group(backend="neuron")
  device = torch.neuron.current_device()
  rank = dist.get_rank()
  tp_size = dist.get_world_size()
  tp_mesh = DeviceMesh("neuron", list(range(tp_size)))

  pipe = Flux2KleinPipeline.from_pretrained(MODEL, torch_dtype=torch.bfloat16)

  # Text encoder + VAE: replicated on every rank (no TP).
  pipe.text_encoder = pipe.text_encoder.to(device)
  pipe.vae = pipe.vae.to(device)

  # Transformer: shard across all ranks while still on CPU, then move to device.
  pipe.transformer.enable_parallelism(config=TensorParallelConfig(mesh=tp_mesh))
  pipe.transformer = pipe.transformer.to(device)
  torch.neuron.synchronize()

  image = pipe(
      prompt=PROMPT, height=1024, width=1024,
      num_inference_steps=4, guidance_scale=1.0,
  ).images[0]

  if rank == 0:
      image.save("flux2_tp8.png")
      print("Saved flux2_tp8.png")

  dist.destroy_process_group()

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

… into add-neuron-backend

…into support-neuron-tp

JingyaHuang · 2026-06-24T16:28:20Z


    # flash-attn only returns LSE if dropout_p > 0. So, we need to workaround.
-    if grad_enabled or (_parallel_config is not None and _parallel_config.context_parallel_config._world_size > 1):
+    if grad_enabled or (_parallel_config is not None and _parallel_config._cp_world_size > 1):


With TP, context_parallel_config can be None, we set up _parallel_config._cp_world_size for it.

…into support-neuron-tp

…rs (huggingface#13946) SkyReels-V2 and ChronoEdit are both built on Wan, and their transformers have the same keys as WanTransformer3DModel, so they reuse convert_wan_transformer_to_diffusers (like WanVACE / WanAnimate). This lets the community GGUF builds load directly. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

fix(cosmos3): pin VAE latent norm buffers to encode output device Under sharded placement (device_map="balanced"), vae.encode() runs on the VAE's own device while the mean/inv_std buffers were pinned to x.device, causing a cross-device RuntimeError. Compute raw_mu first, then pin the normalization buffers to its device so all tensors share one device. Co-authored-by: Atharva Joshi <atjoshi@smc521ge-0036.ipp2a2.colossus.nvidia.com> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

…13876) * docs: fix repeated word typo in set_timesteps docstring Removed the duplicate word "schedule" from the docstring for the sigmas argument in EulerDiscreteScheduler.set_timesteps. * Update scheduling_euler_discrete.py * Apply style fixes --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

sayakpaul

Thanks for working on this.

There is lot of intrusive and model-specific changes which I think is a bit of an anti-pattern. I think it's also probably because of some of the fusion stuff that's happening inside Flux2.

More specifically, the intrusive pieces exist for one reason: Flux2 fuses projections into single Linears (SwiGLU gate+linear, and to_qkv_mlp_proj packing Q/K/V and MLP).

Contiguous column sharding is blind to that internal layout, so:

you must reorder rows so each rank gets paired slices -> the permuters, and
the local tensor width no longer factors as heads × head_dim or splits cleanly into qkv/mlp -> the runtime local_* recomputation.

I opened JingyaHuang#1 to simplify some of the stuff. LMK.

Furthermore, would the changes related to fusing be the same for Flux1, for example? I think gf the layers were unfused, parallelize_module + DTensor would handle head-splitting automatically and none of this would be needed.

sayakpaul · 2026-06-26T22:37:22Z

+    config: TensorParallelConfig,
+    tp_plan: dict,
+    *,
+    backend: str = "default",


Can this not be derived from torch_device?

sayakpaul · 2026-06-26T22:41:31Z

    return _get_projections(attn, hidden_states, encoder_hidden_states)


+def _get_tp_degree(parallel_config) -> int:


Seems like it should be present in _modeling_parallel.py?

sayakpaul · 2026-06-26T22:44:14Z

+    @property
+    def _cp_world_size(self) -> int:
+        """Context-parallel world size, or 1 when context parallelism is not enabled.
+
+        Lets attention backends branch on context parallelism without dereferencing a possibly ``None``
+        ``context_parallel_config`` (e.g. when only tensor parallelism is active).
+        """
+        cp = self.context_parallel_config
+        if cp is None or cp._world_size is None:
+            return 1
+        return cp._world_size


Where is this needed?

sayakpaul · 2026-06-26T22:46:10Z

+        # On Neuron, run the index-heavy `_unpack_latents_with_ids` on CPU to avoid expensive
+        # device<->host syncs from the gather/scatter arithmetic, then move the result back.
+        latent_device = latents.device
+        on_neuron = get_device() == "neuron"
+        if on_neuron:
+            latents = latents.cpu()
+            latent_ids = latent_ids.cpu()


Is this not needed on CUDA?

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

…arallel Adopts Sayak's changes from #1 that replace the Flux2-specific _tp_fused_block_permuters (permute-then-slice) with generic PackedColwiseParallel / PackedRowwiseParallel styles that slice fused projections block-by-block. Also drops the now-unused _tp_fused_block_permuters base-class default in modeling_utils. Keeps torch.chunk in Flux2SwiGLU.forward (TorchAO compile regression fix), overriding the half-slicing on Sayak's branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…into support-neuron-tp

sayakpaul · 2026-07-02T12:21:30Z

@JingyaHuang did my PR break any neuron-specific stuff?

JingyaHuang and others added 25 commits March 18, 2026 11:15

draft:add neuron as a legit backend

98f6c8c

Merge branch 'huggingface:main' into add-neuron-backend

c58b8b8

Merge branch 'huggingface:main' into add-neuron-backend

3367409

Merge branch 'main' into add-neuron-backend

0c51734

feat: neuron-specific changes in the pipeline

a76953c

tests: eager tests

2480388

draft: start with tp for flux2

1469c04

fix: style

929ab72

Merge branch 'huggingface:main' into add-neuron-backend

52cac76

Merge branch 'huggingface:main' into support-neuron-tp

30cb353

Merge branch 'add-neuron-backend' of github.com:JingyaHuang/diffusers…

28a5086

… into add-neuron-backend

Merge branch 'huggingface:main' into support-neuron-tp

7fab0c4

Merge branch 'huggingface:main' into add-neuron-backend

68689e5

Merge branch 'main' into add-neuron-backend

da79308

fix:apr_02 beta

3bb9c7c

Merge branch 'add-neuron-backend' of github.com:JingyaHuang/diffusers…

c4facab

… into add-neuron-backend

feat:add wan

dff1f32

Merge branch 'huggingface:main' into support-neuron-tp

1c930c4

Merge branch 'huggingface:main' into add-neuron-backend

1eb5ff9

fix:pixart

cbe8f28

fix: rewrite flux swiglu activation to avoid gather op in neuron IR

16b9606

test: pixart compile mode on neuron

7f13f68

Merge branch 'main' into neuron-torch-comppile

a46cb19

cleanup & fix style

a354b88

Merge branch 'neuron-torch-comppile' into support-neuron-tp

931bb85

github-actions Bot added size/L PR with diff > 200 LOC lora models tests utils labels May 11, 2026

JingyaHuang and others added 3 commits June 24, 2026 13:24

Merge branch 'main' into support-neuron-tp

3fc043e

tests: add test units for tp

e6d20d8

Merge branch 'support-neuron-tp' of github.com:JingyaHuang/diffusers …

e76a2fc

…into support-neuron-tp

github-actions Bot added the tests label Jun 24, 2026

JingyaHuang marked this pull request as ready for review June 24, 2026 15:30

JingyaHuang added 3 commits June 24, 2026 16:06

fix: in case of text-encoder(s) on CPU

034ba9e

review:cleanup+add test

4907524

Merge branch 'support-neuron-tp' of github.com:JingyaHuang/diffusers …

af2aed7

…into support-neuron-tp

JingyaHuang commented Jun 25, 2026

View reviewed changes

JingyaHuang requested a review from sayakpaul June 25, 2026 12:25

JingyaHuang and others added 4 commits June 25, 2026 14:25

Merge branch 'main' into support-neuron-tp

b9b048b

fix: style

915eeb1

Merge branch 'support-neuron-tp' of github.com:JingyaHuang/diffusers …

720dad2

…into support-neuron-tp

doc: remove it for now

89cf8b6

JingyaHuang mentioned this pull request Jun 26, 2026

[TPU] Add tensor parallel support for TPU backend #14075

Draft

sayakpaul reviewed Jun 26, 2026

View reviewed changes

Comment thread docs/source/en/training/distributed_inference.md Outdated

HaozheZhang6 and others added 5 commits June 26, 2026 15:38

clean some stuff to simplify code.

155802c

Merge branch 'main' into JingyaHuang-support-neuron-tp

f133732

sayakpaul reviewed Jun 26, 2026

View reviewed changes

sayakpaul and others added 7 commits June 26, 2026 17:07

clean more to remove permutation related shenanigans.

b3d8130

revert: put torch.chunk back

7ea75f7

Merge branch 'main' into support-neuron-tp

c73cf09

Merge branch 'main' into support-neuron-tp

eb58402

Update docs/source/en/training/distributed_inference.md

c3e123c

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

Merge branch 'support-neuron-tp' of github.com:JingyaHuang/diffusers …

491c537

…into support-neuron-tp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Neuron] Add tensor parallel support for Neuron backend#13718

[Neuron] Add tensor parallel support for Neuron backend#13718
JingyaHuang wants to merge 54 commits into
huggingface:mainfrom
JingyaHuang:support-neuron-tp

JingyaHuang commented May 11, 2026 •

edited

Loading

Uh oh!

JingyaHuang Jun 24, 2026

Uh oh!

Uh oh!

sayakpaul left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sayakpaul Jun 26, 2026

Uh oh!

sayakpaul Jun 26, 2026

Uh oh!

sayakpaul Jun 26, 2026

Uh oh!

sayakpaul Jun 26, 2026

Uh oh!

sayakpaul commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

		return _get_projections(attn, hidden_states, encoder_hidden_states)


		def _get_tp_degree(parallel_config) -> int:

Uh oh!

Conversation

JingyaHuang commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Key changes:

Quick test — Flux2 TP on Neuron (For future release)

Who can review?

Uh oh!

JingyaHuang Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sayakpaul Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

sayakpaul Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

sayakpaul Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

sayakpaul Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

sayakpaul commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

JingyaHuang commented May 11, 2026 •

edited

Loading