NucleusMoE-Image by sippycoder · Pull Request #13317 · huggingface/diffusers

sippycoder · 2026-03-24T03:16:39Z

What does this PR do?

This PR introduces NucleusMoE-Image series into the diffusers library.

NucleusMoE-Image is a 2B active 17B parameter model trained with efficiency at its core. Our novel architecture highlights the scalability of sparse MoE architecture for Image generation. The technical report will be released very soon.

NucleusImage - text kv caching

sippycoder · 2026-03-24T04:07:56Z

cc: @sayakpaul @IlyasMoutawwakil

dg845 · 2026-03-25T04:19:19Z

src/diffusers/models/transformers/transformer_nucleusmoe_image.py

+logger = logging.get_logger(__name__)
+
+
+# copied from diffusers.models.transformers.transformer_qwenimage.apply_rotary_emb_qwen


Suggested change

# copied from diffusers.models.transformers.transformer_qwenimage.apply_rotary_emb_qwen

# Copied from diffusers.models.transformers.transformer_qwenimage.apply_rotary_emb_qwen with qwen->nucleus

nit: # Copied from mechanism supports renamings with the above syntax

dg845 · 2026-03-25T04:20:59Z

src/diffusers/models/transformers/transformer_nucleusmoe_image.py

+        return self.norm(conditioning)
+
+
+# copied from diffusers.models.transformers.transformer_qwenimage.QwenEmbedRope


Suggested change

# copied from diffusers.models.transformers.transformer_qwenimage.QwenEmbedRope

# Copied from diffusers.models.transformers.transformer_qwenimage.QwenEmbedRope with Qwen->NucleusMoE

See #13317 (comment). Alternatively, if NucleusMoEEmbedRope is changed (for example to remove txt_seq_lens as suggested in #13317 (comment)), the # Copied from statement should be removed.

dg845 · 2026-03-25T04:25:24Z

src/diffusers/models/transformers/transformer_nucleusmoe_image.py

+        if txt_seq_lens is not None:
+            deprecate(
+                "txt_seq_lens",
+                "0.39.0",
+                "Passing `txt_seq_lens` is deprecated and will be removed in version 0.39.0. "
+                "Please use `max_txt_seq_len` instead.",
+                standard_warn=False,
+            )


As this is a new model, can we remove the dependence on the deprecated txt_seq_lens argument?

dg845 · 2026-03-25T04:27:58Z

src/diffusers/models/transformers/transformer_nucleusmoe_image.py

+
+    def forward(
+        self,
+        video_fhw: tuple[int, int, int, list[tuple[int, int, int]]],


Suggested change

video_fhw: tuple[int, int, int, list[tuple[int, int, int]]],

video_fhw: tuple[int, int, int] | list[tuple[int, int, int]],

nit: fix type annotation

dg845 · 2026-03-25T04:34:54Z

src/diffusers/models/transformers/transformer_nucleusmoe_image.py

+        return out
+
+
+@maybe_allow_in_graph


I think we can remove the maybe_allow_in_graph decorator as the NucleusMoE-Image transformer compile tests

RUN_SLOW=1 RUN_COMPILE=1 pytest tests/models/transformers/test_models_transformer_nucleusmoe_image.py::TestNucleusMoEImageTransformerCompile

have the same pass/fail pattern with and without it. (Currently, test_compile_on_different_shapes fails both with and without maybe_allow_in_graph; all other tests pass.)

dg845 · 2026-03-25T04:36:55Z

src/diffusers/models/transformers/transformer_nucleusmoe_image.py

+        attention_kwargs: dict[str, Any] | None = None,
+    ) -> torch.Tensor:
+        scale1, gate1, scale2, gate2 = self.img_mod(temb).unsqueeze(1).chunk(4, dim=-1)
+        scale1, scale2 = 1 + scale1, 1 + scale2


nit: I think it's more clear if we do the calculation inline as e.g. img_modulated = img_normed * (1 + scale1).

dg845 · 2026-03-25T04:39:05Z

src/diffusers/models/transformers/transformer_nucleusmoe_image.py

+        gate1 = gate1.clamp(min=-2.0, max=2.0)
+        gate2 = gate2.clamp(min=-2.0, max=2.0)


It seems weird to me that we first clamp the gates to [-2.0, 2.0] and then essentially clamp again by squashing with the tanh function below. Is this intended?

dg845 · 2026-03-25T04:42:53Z

src/diffusers/models/transformers/transformer_nucleusmoe_image.py

+        if hidden_states.dtype == torch.float16:
+            hidden_states = hidden_states.clip(-65504, 65504)


Suggested change

if hidden_states.dtype == torch.float16:

hidden_states = hidden_states.clip(-65504, 65504)

if hidden_states.dtype == torch.float16:

fp16_finfo = torch.finfo(torch.float16)

hidden_states = hidden_states.clip(fp16_finfo.min, fp16_finfo.max)

dg845 · 2026-03-25T04:47:04Z

src/diffusers/models/transformers/transformer_nucleusmoe_image.py

+        dense_moe_strategy: str = "leave_first_three_and_last_block_dense",
+        num_experts: int = 128,
+        moe_intermediate_dim: int = 1344,
+        capacity_factors: List[float] = [8.0] * 24,


Suggested change

capacity_factors: List[float] = [8.0] * 24,

capacity_factors: float | list[float] = 8.0,

I think allowing capacity_factors to take float arguments as well makes the code a little cleaner. We would then expand float inputs to a list inside __init__:

if isinstance(capacity_factors, float): capacity_factors = [capacity_factors] * num_layers

dg845 · 2026-03-25T04:53:54Z

src/diffusers/models/transformers/transformer_nucleusmoe_image.py

+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        img_shapes: list[tuple[int, int, int]] | None = None,


Suggested change

img_shapes: list[tuple[int, int, int]] | None = None,

img_shapes: tuple[int, int, int] | list[tuple[int, int, int]],

I think allowing img_shapes to take tuple[int, int, int] arguments as well would be cleaner, similar to #13317 (comment). If I understand correctly, NucleusMoEEmbedRope only accepts batches with the same image shape, so this would make it easier to specify such shapes.

dg845 · 2026-03-25T04:54:55Z

src/diffusers/models/transformers/transformer_nucleusmoe_image.py

+                "Please use `encoder_hidden_states_mask` instead.",
+                standard_warn=False,
+            )
+


I think we can remove the deprecated txt_seq_lens argument here as well. See #13317 (comment).

dg845 · 2026-03-25T04:59:13Z

src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py

+"""
+
+
+def calculate_shift(


Suggested change

def calculate_shift(

# Copied from diffusers.pipelines.qwenimage.pipeline_qwenimage.calculate_shift

def calculate_shift(

dg845 · 2026-03-25T04:59:39Z

src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py

+    return mu
+
+
+def retrieve_timesteps(


Suggested change

def retrieve_timesteps(

# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.retrieve_timesteps

def retrieve_timesteps(

dg845 · 2026-03-25T05:00:45Z

src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py

+        self.default_sample_size = 128
+        self.return_index = -8


Should default_sample_size and return_index be configurable via __init__?

dg845 · 2026-03-25T05:02:02Z

src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py

+        prompt_embeds_mask=None,
+        negative_prompt_embeds_mask=None,


Suggested change

prompt_embeds_mask=None,

negative_prompt_embeds_mask=None,

nit: remove prompt_embeds_mask and negative_prompt_embeds_mask as they are not used in check_inputs.

dg845 · 2026-03-25T05:03:44Z

src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py

+        return latents
+
+    @staticmethod
+    def _unpack_latents(latents, height, width, vae_scale_factor):


Could we refactor _pack_latents and _unpack_latents to take a patch_size argument instead of hardcoding the patch size to 2? This would make the code more robust.

dg845 · 2026-03-25T05:06:26Z

src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py

+        height = 2 * (int(height) // (self.vae_scale_factor * 2))
+        width = 2 * (int(width) // (self.vae_scale_factor * 2))


Could we refactor this to use self.transformer.config.patch_size instead of hardcoding the patch size to 2? See also #13317 (comment).

dg845 · 2026-03-25T05:07:53Z

src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py

+        latents = self._pack_latents(latents, batch_size, num_channels_latents, height, width)
+        return latents
+
+    def enable_vae_slicing(self):


We can remove the VAE slicing/tiling methods here as they are deprecated. Users can always call the corresponding methods on the VAE itself (e.g. pipe.vae.enable_tiling()) to enable/disable slicing/tiling.

dg845 · 2026-03-25T05:09:31Z

src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py

+        self,
+        prompt: str | list[str] = None,
+        negative_prompt: str | list[str] = None,
+        true_cfg_scale: float = 4.0,


Suggested change

true_cfg_scale: float = 4.0,

guidance_scale: float = 4.0,

nit: rename to guidance_scale to follow the diffusers CFG naming conventions.

dg845 · 2026-03-25T05:11:44Z

src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py

+        latent_h = 2 * (int(height) // (self.vae_scale_factor * 2))
+        latent_w = 2 * (int(width) // (self.vae_scale_factor * 2))
+        img_shapes = [(1, latent_h // 2, latent_w // 2)] * (batch_size * num_images_per_prompt)


Similar to #13317 (comment), can we refactor this to use self.transformer.config.patch_size?

dg845 · 2026-03-25T05:13:39Z

src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py

+
+                noise_pred = self.transformer(
+                    hidden_states=latents,
+                    timestep=timestep / 1000,


Instead of hardcoding this at 1000, could we use self.scheduler.config.num_train_timesteps instead?

dg845 · 2026-03-25T05:14:19Z

src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py

+                    noise_norm = torch.norm(comb_pred, dim=-1, keepdim=True)
+                    noise_pred = comb_pred * (cond_norm / noise_norm)
+
+                noise_pred = -noise_pred


Why do we need to negate noise_pred here?

dg845 · 2026-03-25T05:15:23Z

src/diffusers/hooks/text_kv_cache.py

+    def __init__(self):
+        super().__init__()
+        # Maps encoder_hidden_states.data_ptr() → (txt_key, txt_value)
+        self.kv_cache: dict[int, tuple] = {}


Suggested change

self.kv_cache: dict[int, tuple] = {}

self.kv_cache: dict[int, tuple[torch.Tensor, torch.Tensor]] = {}

nit: more specific type annotation

dg845

Thanks for the PR! Left an initial review :). @yiyixuxu, could you also take a look at the text KV cache code in src/diffusers/hooks/text_kv_cache.py?

HuggingFaceDocBuilderDev · 2026-03-25T05:29:21Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

IlyasMoutawwakil · 2026-03-25T09:04:48Z

src/diffusers/models/transformers/transformer_nucleusmoe_image.py

+        self.experts = nn.ModuleList(
+            [
+                FeedForward(
+                    dim=hidden_size,
+                    dim_out=hidden_size,
+                    inner_dim=moe_intermediate_dim,
+                    activation_fn="swiglu",
+                    bias=False,
+                )
+                for _ in range(num_experts)
+            ]
+        )


you would need the projections to be in packed/contiguous format for torch.grouped_mm support (num_experts, dim_in, dim_out), @sayakpaul is that possible ? in Transformers we use the inline weight converter

Not at the moment because MoEs are still a bit of a special case in this part of world.

nmnWithNucleus and others added 9 commits March 20, 2026 07:59

adding NucleusMoE-Image model

76dcd51

update system prompt

f691395

Add text kv caching

7eef03e

Class/function name changes

cb63a95

Merge pull request #1 from heuristicoder/caching

50792e8

NucleusImage - text kv caching

add missing imports

f2eec82

add RoPE credits

d8b50e5

Merge branch 'main' into main

9a84625

Merge branch 'main' into main

8bad648

sayakpaul requested review from dg845 and yiyixuxu March 24, 2026 04:08

Merge branch 'main' into main

115f765

dg845 reviewed Mar 25, 2026

View reviewed changes

IlyasMoutawwakil reviewed Mar 25, 2026

View reviewed changes

		logger = logging.get_logger(__name__)


		# copied from diffusers.models.transformers.transformer_qwenimage.apply_rotary_emb_qwen

		return self.norm(conditioning)


		# copied from diffusers.models.transformers.transformer_qwenimage.QwenEmbedRope

	video_fhw: tuple[int, int, int, list[tuple[int, int, int]]],
	video_fhw: tuple[int, int, int] \| list[tuple[int, int, int]],

		gate1 = gate1.clamp(min=-2.0, max=2.0)
		gate2 = gate2.clamp(min=-2.0, max=2.0)

		if hidden_states.dtype == torch.float16:
		hidden_states = hidden_states.clip(-65504, 65504)

	capacity_factors: List[float] = [8.0] * 24,
	capacity_factors: float \| list[float] = 8.0,

	img_shapes: list[tuple[int, int, int]] \| None = None,
	img_shapes: tuple[int, int, int] \| list[tuple[int, int, int]],

	def calculate_shift(
	# Copied from diffusers.pipelines.qwenimage.pipeline_qwenimage.calculate_shift
	def calculate_shift(

	def retrieve_timesteps(
	# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.retrieve_timesteps
	def retrieve_timesteps(

		height = 2 * (int(height) // (self.vae_scale_factor * 2))
		width = 2 * (int(width) // (self.vae_scale_factor * 2))

	self.kv_cache: dict[int, tuple] = {}
	self.kv_cache: dict[int, tuple[torch.Tensor, torch.Tensor]] = {}

Conversation

sippycoder commented Mar 24, 2026

What does this PR do?

Uh oh!

sippycoder commented Mar 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dg845 Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dg845 Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dg845 Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dg845 left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Mar 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dg845 Mar 25, 2026 •

edited

Loading

dg845 Mar 25, 2026 •

edited

Loading

dg845 Mar 25, 2026 •

edited

Loading