Add Ernie-Image modular pipeline by akshan-main · Pull Request #13498 · huggingface/diffusers

akshan-main · 2026-04-17T09:11:31Z

What does this PR do?

Adds the modular pipeline for ErnieImage (ErnieImageAutoBlocks + ErnieImageModularPipeline).

Parity verified on A100, bf16, 50 steps, 1024x1024 with baidu/ERNIE-Image:

MAD (AutoBlocks vs standard): 0.000033
Max absolute diff: 0.58 (out of 255)

Colab Notebook: https://colab.research.google.com/gist/akshan-main/f25801763d573209464d6bfd685d708e/modular-ernie-image.ipynb

Addresses #13389 (comment).

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case. Add modular pipeline for HunyuanVideo 1.5 #13389 (comment)
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@yiyixuxu @sayakpaul

sayakpaul · 2026-04-17T09:17:57Z

Maybe it could live as a custom pipeline on the Hub like https://huggingface.co/krea/krea-realtime-video/tree/main?

akshan-main · 2026-04-17T10:04:08Z

@sayakpaul @yiyixuxu can do the krea pattern by keeping a minimal ErnieImageModularPipeline subclass + MODULAR_PIPELINE_MAPPING entry in diffusers. Want me to restructure the PR that way?

sayakpaul · 2026-04-17T10:06:25Z

Doing it the Krea way wouldn't require any changes to core Diffusers no?

akshan-main · 2026-04-17T10:39:01Z

you were right, the fully hub-only pattern works. moved everything to https://huggingface.co/akshan-main/ernie-image-modular and it loads end-to-end with zero diffusers changes:

from diffusers.modular_pipelines import ModularPipeline
pipe = ModularPipeline.from_pretrained("akshan-main/ernie-image-modular", trust_remote_code=True)
pipe.load_components(trust_remote_code=True)

No custom ErnieImageModularPipeline class needed.

inlined the pipeline properties (vae_scale_factor, num_channels_latents, text_in_dim) into the blocks via direct components.vae.config.* / components.transformer.config.* reads.

Should I close this pr @sayakpaul

akshan-main · 2026-04-17T11:23:05Z

hey @sayakpaul, over the last few weeks, I've profiled QwenImage and QwenImageEdit to identify cudaStreamSynchronize calls causing per-step latency, which led to the QwenImage RoPE sync fix (#13406, merged), plus modular pipelines for LTX (#13378, merged) and HunyuanVideo 1.5 (#13389, merged), and an HV1.5 I2V bug fix (#13439, merged). I've also published modular upscale hub blocks for SDXL, Flux1, and Z-Image, and just pushed an Ernie-Image modular hub repo. Would this qualify me for MVP recognition, or is there more I should do to get there? happy to keep contributing either way, just wanted to check.

yiyixuxu · 2026-04-17T18:25:55Z

@sayakpaul @akshan-main

ooh but I'd really love to have official modular support for ERNIE-Image — it's a really good model, trained from scratch, and the Baidu team is committed to releasing more checkpoints and building their own community around it

I think we can release official blocks for text-to-image, image-to-image, and edit pipelines (not yet released) and encourage the community to build more custom stuff on hub using our official blocks

sayakpaul · 2026-04-18T03:41:03Z

Ah my bad, that makes sense. I didn't realize it.

akshan-main · 2026-04-18T03:51:52Z

@yiyixuxu I think the modular pipeline is good to review now

akshan-main · 2026-04-20T20:34:19Z

friendly ping @yiyixuxu when you get the chance

yiyixuxu · 2026-04-20T23:17:31Z

@claude can you do a review here? thanks

github-actions · 2026-04-20T23:17:50Z

Claude finished @yiyixuxu's task in 3m 50s —— View job

Review: Ernie-Image Modular Pipeline

Overall the implementation is well-structured, follows the modular pipeline conventions, and parity numbers look good. A few issues to address:

Issues

1. Alphabetical ordering in `src/diffusers/init.py` (lines 452–453, 1239–1240)

ErnieImageAutoBlocks and ErnieImageModularPipeline are inserted between FluxAutoBlocks and FluxKontextAutoBlocks, breaking alphabetical order. They should come before the Flux2* entries (i.e. at the top of the modular_pipelines list), since E < F.

Same issue in the TYPE_CHECKING block at lines 1239–1240.

Fix this →

2. Missing `output_type="latent"` support in decoder (`decoders.py`)

The standard ErnieImagePipeline supports output_type="latent" (line 361 of pipeline_ernie_image.py), returning raw latents without VAE decode. The modular decoder doesn't handle this case — it will raise ValueError for output_type="latent". The Flux modular decoder has this pattern:

if not block_state.output_type == "latent":
    # do VAE decode
    ...
else:
    block_state.images = block_state.latents

Consider adding the same early-return for "latent" to ErnieImageVaeDecoderStep.__call__.

Fix this →

3. `ErnieImagePachifier` — consider inlining instead of creating a component (`modular_pipeline.py:23–63, decoders.py:62–63`)

The standard pipeline uses simple static methods _patchify_latents / _unpatchify_latents for this logic. The Flux2 modular pipeline does the same (static methods on blocks). Creating a full ConfigMixin class registered as a ComponentSpec with from_config for what is pure tensor reshaping seems heavyweight.

Currently ErnieImagePachifier is only used in the decoder (for unpack_latents). Consider inlining unpack_latents as a static method on ErnieImageVaeDecoderStep (like Flux2's _unpatchify_latents) and removing the pachifier component entirely.

Not blocking, but simplifies the component graph.

4. Docstring quality — `prompt` type hints show `None` (`modular_blocks_ernie_image.py`)

In the auto-generated docstring for ErnieImageAutoBlocks, several parameter types show None instead of their actual types:

prompt (None) — should be str or list[str]
negative_prompt (None, *optional*) — should be str or list[str]
generator (None, *optional*) — should be torch.Generator

This happens because the InputParam definitions in the encoder/enhancer steps don't specify type_hint. For example in encoders.py:149:

InputParam("prompt", description="The prompt or prompts to guide image generation.")

→ should have type_hint=str.

Same for negative_prompt at line 150 and generator in before_denoise.py:325.

Fix this →

5. `ErnieImageCoreDenoiseStep` docstring has `generator` type hint as `None` (`modular_blocks_ernie_image.py:104`)

Same issue: generator (None, *optional*) in the generated docstring. The underlying InputParam in before_denoise.py:325 is missing type_hint=torch.Generator.

Minor / Nits

6. `MODULAR_PIPELINE_MAPPING` ordering (`modular_pipeline.py:136`)

The ("ernie-image", ...) entry is appended at the end. The mapping isn't strictly alphabetical, but inserting it between existing entries (e.g. after "helios-pyramid" or before "flux") in a more logical spot would be more consistent. Not blocking.

7. Dummy objects ordering (`dummy_torch_and_transformers_objects.py`)

The ErnieImageAutoBlocks and ErnieImageModularPipeline dummy classes are inserted between Flux2ModularPipeline and FluxAutoBlocks. Alphabetically they should come before the Flux2* entries. make style may or may not fix this automatically.

8. `do_sample` logic in prompt enhancer (`encoders.py:68`)

do_sample=temperature != 1.0 or top_p != 1.0,

This means do_sample=True when temperature=0.6 (the default), which is correct. But when temperature=1.0 AND top_p=1.0, it sets do_sample=False (greedy). This is a subtle semantic choice — worth confirming this matches the standard pipeline's behavior. Not a bug per se, just flagging.

What looks good

Guider abstraction: Properly separated from the denoise loop, no guidance_scale as pipeline input — follows modular.md guidelines.
No cross-imports: All imports are from diffusers core, no imports from diffusers.pipelines.* or other modular pipelines.
BN eps handling: Uses vae.config.batch_norm_eps instead of hardcoded 1e-5 (the standard pipeline hardcodes 1e-5, but the VAE config default is 1e-4 — the modular version is actually more correct here).
Test model path: Points to hf-internal-testing/, not a personal repo.
File structure: Matches the expected layout from modular.md.
Parity: MAD=0.000033, max_diff=0.58/255 looks excellent for bf16.

yiyixuxu · 2026-04-20T23:25:13Z

+        if isinstance(prompt, str):
+            prompt = [prompt]
+
+        pe = getattr(components, "pe", None)


@claude what do you think about this pattern?
I think it is unnecesarily (if we listed a component as expected_components it should be an attribute) - can you verify by trace through the code?
did we talk about in our agents.md and other docs? why did it not included in your review

yiyixuxu · 2026-04-20T23:30:30Z

+    return padded, lengths
+
+
+class ErnieImageTextInputStep(ModularPipelineBlocks):


I think we would need a input step that expand tensors for num_images_per_prompt and so on
can we do the same for ernie?

yiyixuxu · 2026-04-20T23:33:15Z

@claude i left some additional comments here, can you look at them and let me know what you think and in the case if you agree with me, let me know why they were not included in your review and how wee can update our agents.md to improve moving forward

github-actions · 2026-04-20T23:33:35Z

Claude finished @yiyixuxu's task in 3m 14s —— View job

Analysis of YiYi's review comments

Read repository guidelines (AGENTS.md, modular.md)
Read the commented code sections (encoders.py, before_denoise.py)
Read reference modular pipelines for comparison (Flux, QwenImage, SDXL)
Trace through the component registration framework code
Analyze comment 1: expected_components attribute access pattern (encoders.py:136)
Analyze comment 2: Prompt enhancer rewrite (encoders.py:31)
Analyze comment 3: Input step for num_images_per_prompt expansion (before_denoise.py:48)
Suggest agents.md improvements

Comment 1: `getattr(components, "pe", None)` is unnecessary (`encoders.py:136`)

YiYi is correct. I traced through the framework code:

In modular_pipeline.py:1743-1750, during __init__, every component in expected_components is registered — from_config components get created, from_pretrained components are registered as None.
register_components() at line 2071+ calls setattr(self, name, module) for all components, including None values.

So components.pe always exists as an attribute — it's just None when not loaded. The getattr(components, "pe", None) on line 136 is defensive code that adds no value. The correct pattern is:

if not block_state.use_pe or components.pe is None or components.pe_tokenizer is None:
    ...

No other modular pipeline in the codebase uses getattr(components, ...) for expected components — they all access them directly.

Why this wasn't in the original review: The review looked at this code and saw a pattern that "works correctly" (produces the right result) without questioning whether it follows the framework's idioms. The getattr pattern is functionally harmless, which made it easy to overlook. But YiYi is right that it signals a misunderstanding of how expected_components works, and tolerating it would set a bad precedent for future contributions.

Suggested agents.md / modular.md addition:

10. Using getattr(components, "name", None) for expected components. Components listed in expected_components are always registered as attributes on the pipeline — either as the loaded object or None. Access them directly as components.name. Don't use getattr with a fallback; it obscures the contract and suggests the component might not be registered at all.

Comment 2: Rewrite `ErnieImagePromptEnhancerStep` so the block always normalizes prompts (`encoders.py:31`)

YiYi is correct. Currently the prompt enhancer block does two things entangled together:

Normalizes prompt from str → list[str] (lines 133-134)
Optionally rewrites prompts via the PE model (lines 138-165)

The problem: when use_pe=False (or PE not loaded), the block still normalizes prompt to a list — but the control flow makes this a side-effect of the "no-op" early return (lines 138-142). This couples prompt normalization (always needed) with prompt enhancement (optional).

What YiYi is asking for: Restructure so the block always expands/normalizes prompts (str→list), then conditionally runs the PE rewrite only when use_pe=True. This is cleaner because:

Prompt normalization happens unconditionally and visibly
The PE enhancement is a clearly optional layer on top
The block's contract is clearer: "output is always list[str]"

Why this wasn't in the original review: The review checked that the code works (it does — the early return still normalizes prompts). But it missed the structural issue: the normalization logic shouldn't be gated behind the PE check. This is a code organization/clarity concern that requires understanding the intent of modular blocks (each block has a clear, separable responsibility), not just their correctness.

Suggested agents.md addition: This is more of a general principle already implied by AGENTS.md ("Strive to write code as simple and explicit as possible"). However, we could add to modular.md:

11. Mixing always-needed logic with conditional logic. If a block performs both unconditional work (e.g. normalizing inputs) and conditional work (e.g. an optional model pass), structure the __call__ so unconditional logic runs first, unconditionally. Don't hide it inside a conditional branch's early return.

Comment 3: Need an input step that expands tensors for `num_images_per_prompt` (`before_denoise.py:48`)

YiYi is correct. This is the most architecturally significant comment. The established pattern across modular pipelines is:

Encoder step produces embeddings at prompt-batch granularity (batch_size = number of prompts)
Input step (separate block) handles num_images_per_prompt expansion — repeating/reshaping all tensor inputs to batch_size * num_images_per_prompt

This is the pattern in:

Flux: FluxTextInputStep (flux/inputs.py:31-120) — takes prompt_embeds and repeats with .repeat(1, num_images_per_prompt, 1) then reshapes
QwenImage: QwenImageTextInputsStep (qwenimage/inputs.py:112-253) — same pattern, also handles masks
SDXL: StableDiffusionXLInputStep (stable_diffusion_xl/before_denoise.py:346+) — same pattern

What ErnieImage does instead: The num_images_per_prompt expansion is baked into _encode() in the encoder step (encoders.py:238-239):

for _ in range(num_images_per_prompt):
    text_hiddens.append(hidden)

And ErnieImageTextInputStep (before_denoise.py:48) only handles padding — it doesn't do any num_images_per_prompt expansion because it was already done in the encoder.

Why this matters:

It violates separation of concerns — the encoder should encode, the input step should handle batching
It prevents users from running the encoder standalone and then controlling batch expansion separately
The encoder re-encodes the same prompt N times (via the for-loop) instead of encoding once and repeating the tensor — wasteful if the model forward pass is expensive (though here the hidden states are reused, so it's just appending the same tensor N times, but the pattern is still wrong)
It's inconsistent with every other modular pipeline

The fix would be:

Remove num_images_per_prompt from ErnieImageTextEncoderStep — encode each prompt once
Add num_images_per_prompt expansion to ErnieImageTextInputStep (the existing class in before_denoise.py:48), following the Flux/QwenImage pattern

Why this wasn't in the original review: The review verified that the code produces correct results and checked for surface-level pattern compliance (guider abstraction, no cross-imports, test paths, etc.). But it didn't compare the architectural decomposition of encoder vs. input step against reference implementations like Flux or QwenImage. The num_images_per_prompt handling was split across files in a way that technically works but violates the modular design principle of separating encoding from batching.

Suggested modular.md addition:

12. Baking num_images_per_prompt expansion into encoder steps. Encoder steps should produce embeddings at prompt-batch granularity only (one embedding set per prompt). The num_images_per_prompt expansion (repeating/reshaping tensors to batch_size * num_images_per_prompt) belongs in a separate input step placed between encoding and denoising. See FluxTextInputStep, QwenImageTextInputsStep for reference. This separation lets users run encoding standalone without committing to a batch expansion factor.

Summary of suggested `modular.md` additions

Three new gotcha entries to add:

#	Rule	Why missed
10	Don't use `getattr(components, ...)` for expected components — they're always registered as attributes	Review checked correctness, not framework idiom compliance
11	Don't hide unconditional logic inside conditional branches	Review checked behavior, not code structure/clarity
12	Don't bake `num_images_per_prompt` into encoder steps — use a separate input step	Review didn't compare architectural decomposition against reference pipelines

The common thread: the review focused on "does it work correctly?" but under-weighted "does it follow the established modular architecture patterns?" Adding these explicit rules to modular.md would make pattern violations more detectable in future reviews.

yiyixuxu · 2026-04-20T23:50:54Z

@claude for your reviews,

from the first review

Missing output_type="latent" support in decoder (decoders.py)

For modular, if you want to have latent, all you need to do is not run the decode step (it is very easy to do so), do you agree? if so, can you update the list of things we need to include in modular.md to inclujde this -> so that in the future, we are aligned on this

github-actions · 2026-04-20T23:51:11Z

Claude finished @yiyixuxu's task in 1m 49s —— View job

Responding to YiYi's feedback on `output_type="latent"`

Analyze YiYi's point about modular decoder design
Read modular.md and reference pipelines to verify
Propose updated modular.md guideline

Yes, I fully agree. My original review point #2 was wrong — it applied standard pipeline thinking to the modular context.

In modular pipelines, users compose steps explicitly via InsertableDict. If they want latent output, they simply remove the decode block:

blocks = pipe.blocks
blocks.sub_blocks.pop("decode")
output = pipe(..., blocks=blocks)

No special output_type="latent" handling is needed inside the decoder. This is the whole point of the modular architecture — composition replaces conditional logic.

I verified this is the intended pattern: the QwenImage decoder (qwenimage/decoders.py:231) has an explicit comment from you:

# YiYi Notes: remove support for output_type = "latents', we can just skip decode/encode step in modular

Newer modular pipelines (QwenImage, LTX, Wan, Helios) all follow this — their decoders always decode, no conditional latent passthrough. Only older ones (Flux, SDXL) still have the output_type="latent" check in the decoder.

Proposed addition to `modular.md` gotchas

This should be added as gotcha #10:

10. Adding output_type="latent" support inside decoder blocks. In standard pipelines, the decoder checks output_type == "latent" to skip VAE decoding and return raw latents. In modular pipelines this is unnecessary — users simply remove the decode block from the block sequence (e.g. blocks.sub_blocks.pop("decode")). Decoder blocks should always decode. Don't add conditional latent passthrough logic. Same principle applies to encoder blocks: if users want to skip encoding, they remove the encoder block rather than passing pre-computed embeddings (see gotcha #4).

This pairs naturally with existing gotcha #4 (which already captures the encoder-side version of the same principle). Together they establish: in modular pipelines, "skip a step" means removing the block, not adding conditional bypass logic inside it.

Updated full list of proposed `modular.md` additions

Combining the guidelines from my earlier analysis of YiYi's review comments with this new one:

#	Rule	Why missed in review
10	Don't add `output_type="latent"` inside decoder blocks — users skip the decode block entirely	Review applied standard pipeline patterns to modular context
11	Don't use `getattr(components, ...)` for expected components — they're always registered as attributes	Review checked correctness, not framework idiom compliance
12	Don't hide unconditional logic inside conditional branches	Review checked behavior, not code structure/clarity
13	Don't bake `num_images_per_prompt` into encoder steps — use a separate input step	Review didn't compare architectural decomposition against reference pipelines

yiyixuxu · 2026-04-21T00:13:24Z

+            InputParam("prompt", description="The prompt or prompts to guide image generation."),
+            InputParam("negative_prompt", description="The prompt or prompts to avoid during image generation."),
+            InputParam(
+                "num_images_per_prompt",


yeah, lets not accept num_images_per_prompt here - it is more flexible to expand the shape outside the encoder steps -> since we want to support use case where users want to pre-compute text prompts

yiyixuxu · 2026-04-21T00:15:08Z

@akshan-main thanks for the PR
sorry if the back and force with claude creaeted a bit noise here - I left some in-line comments for things that I think make seense eto change
additionally, we may ask you to push some update to modular.md before merge (but don't worry about it for now)

akshan-main · 2026-04-21T00:17:20Z

@yiyixuxu all good, thanks for the detailed review. Working through the inline comments now, will push updates shortly. Happy to help with modular.md updates when needed

akshan-main · 2026-04-21T02:27:01Z

@yiyixuxu moved num_images_per_prompt off the encoder, dropped the getattr, and wrapped the enhancer in an AutoBlock on use_pe. When you get the time 👍

yiyixuxu · 2026-04-22T19:26:46Z

@claude can you do a review again? thanks

github-actions · 2026-04-22T19:27:05Z

Claude finished @yiyixuxu's task in 4m 30s —— View job

Review: Ernie-Image Modular Pipeline (Round 2)

Good progress since round 1. The three structural issues YiYi flagged are properly addressed:

num_images_per_prompt is now handled in ErnieImageTextInputStep (not the encoder)
getattr(components, ...) for expected components is removed
Prompt enhancer is wrapped in AutoPipelineBlocks triggered on use_pe

Two remaining issues to fix before merge:

Issues

1. `generator` type_hint still shows `None` in auto-generated docstrings

modular_blocks_ernie_image.py:102 and :165 both show:

generator (`None`, *optional*):

The underlying InputParam in before_denoise.py:230 correctly has type_hint=torch.Generator, and reference implementations (Helios, QwenImage, LTX, HunyuanVideo1.5) all render as generator (`Generator`, *optional*). The docstrings need to be regenerated:

python utils/modular_auto_docstring.py --fix_and_overwrite

2. Test model path uses personal repo (`modular.md` gotcha #9)

tests/modular_pipelines/ernie_image/test_modular_pipeline_ernie_image.py:38:

pretrained_model_name_or_path = "akshan-main/tiny-ernie-image-modular-pipe"

Per modular.md gotcha #9: "Tiny test models must live under hf-internal-testing/, not personal repos." This needs to be moved before merge.

What was fixed from round 1 (verified)

Issue	Status
Alphabetical ordering in `src/diffusers/__init__.py`	✅ Fixed — `ErnieImage` now correctly before `Flux2`
`getattr(components, "pe", None)` for expected components	✅ Fixed — removed, direct access used
Prompt enhancer always normalizes prompts	✅ Fixed — always normalizes `str→list`, wrapped in `AutoPipelineBlocks` on `use_pe`
`num_images_per_prompt` baked into encoder	✅ Fixed — moved to `ErnieImageTextInputStep._expand()`
`prompt`/`negative_prompt` missing `type_hint`	✅ Fixed — `type_hint=str` added

What looks good

ErnieImageTextInputStep (before_denoise.py:48-153): Clean separation — _expand() repeats each hidden state per num_images_per_prompt, then _pad_text() stacks into a batched tensor. This correctly handles the variable-length text hidden states that ErnieImage uses (unlike Flux/QwenImage which have fixed-length embeddings).
ErnieImageAutoPromptEnhancerStep (modular_blocks_ernie_image.py:32-73): Good use of AutoPipelineBlocks with block_trigger_inputs = ["use_pe"] — the enhancer is cleanly optional.
Guider abstraction (denoise.py:126-152): Correctly follows the reference pattern (matches QwenImage/Flux2-Klein). Uses getattr(block_state, ...) for guider_inputs — this is the correct pattern for block_state (as opposed to components, where direct access is expected).
No output_type="latent" in decoder: Correctly follows the modular pattern — users skip the decode block entirely rather than adding conditional bypass logic.
BN eps from config (decoders.py:78): Uses vae.config.batch_norm_eps instead of hardcoded value.
Alphabetical ordering: Correct in src/diffusers/__init__.py and dummy objects.
No cross-imports: All imports from diffusers core.

yiyixuxu

thanks, i left a bit more comments

yiyixuxu · 2026-04-21T23:43:30Z

+                description="Number of images to generate per prompt.",
+            ),
+            InputParam(
+                "batch_size",


this should not be an input, no? we can always derive from prompt_embed, no?
what happends if prompt_embeds has different batch_size from batch_size?

yiyixuxu · 2026-04-22T19:15:11Z

+        ]
+
+        block_state.prompt = revised
+        block_state.revised_prompts = list(revised)


I think we either:

keep the original prompt in prompt

or, drop the revised_prompt

both prompt and revised_prompt are same thing now, we don't need both

yiyixuxu · 2026-04-22T19:20:20Z

+        else:
+            block_state.negative_prompt_embeds = None
+
+        state.set("batch_size", len(prompt))


let's avoid this patten cc @claude here
if you want to add something to state, declare them as intermediate_outputs (but here I don't think we need to produce this batch_size as output here. we do that in input steps in core-denoise like other pipelines

yiyixuxu · 2026-04-22T19:21:39Z

+                description="Number of images to generate per prompt.",
+            ),
+            InputParam(
+                "batch_size",


this should not be an input -> should be output based on prompt_embeds

yiyixuxu · 2026-04-22T19:25:39Z

+    def __call__(self, components: ErnieImageModularPipeline, block_state: BlockState, i: int, t: torch.Tensor):
+        guider_inputs = {
+            "text_bth": (
+                getattr(block_state, "text_bth", None),


@claude what do you think of this pattern? do we need to do getattr(..., None) here?

yiyixuxu · 2026-04-22T19:33:14Z

@claude can you look at my reviews and update your summary on things we can improve on agents.md/modular.md that can help you with future review?

github-actions · 2026-04-22T19:33:33Z

Claude finished @yiyixuxu's task in 5m 27s —— View job

Analysis of YiYi's latest review comments + consolidated modular.md improvements

Read repository guidelines (AGENTS.md, modular.md)
Read current code at commented locations
Read reference pipelines for comparison (Flux, LTX, WAN, QwenImage, Helios, HunyuanVideo1.5, Z-Image)
Trace through framework code (get_block_state, set_block_state, BlockState)
Analyze comment 1: batch_size should be derived from prompt_embeds, not accepted as input
Analyze comment 2: prompt vs revised_prompts redundancy
Analyze comment 3: state.set() pattern — use intermediate_outputs instead
Analyze comment 4: getattr(block_state, ..., None) in denoise guider inputs
Produce consolidated summary of modular.md improvements (all rounds)

Comment 1: `batch_size` should not be an input (`before_denoise.py:83`)

YiYi is correct. batch_size is trivially derivable from prompt_embeds:

# Current (before_denoise.py:135) - accepts batch_size as optional InputParam
block_state.batch_size = block_state.batch_size or len(prompt_embeds)

This creates a potential inconsistency: if a user passes batch_size=3 but prompt_embeds has 5 entries, the pipeline silently uses 3. The reference pattern in other pipelines:

Flux (flux/inputs.py:106): block_state.batch_size = block_state.prompt_embeds.shape[0]
LTX (ltx/before_denoise.py:137): block_state.batch_size = block_state.prompt_embeds.shape[0]
WAN (wan/encoders.py:239): derives from len(prompt) in encoder, but doesn't expose it as an InputParam

None of these accept batch_size as an InputParam in the input step. They all derive it from the actual data.

Why this wasn't in my review: I verified that batch_size flowed correctly through the pipeline and produced correct results. I didn't question whether it should be an input vs. derived value — I checked the plumbing, not the API design.

Proposed gotcha:

Accepting batch_size as an input when it can be derived. In input steps, batch_size should be derived from the actual data (e.g. len(prompt_embeds) or prompt_embeds.shape[0]), not accepted as an InputParam. Accepting it creates a potential inconsistency if the user passes a value that disagrees with the data. See FluxTextInputStep, LTXTextInputStep for reference.

Comment 2: `prompt` vs `revised_prompts` redundancy (`encoders.py:143`)

YiYi is correct. After the enhancer runs:

# encoders.py:142-143
block_state.prompt = revised          # overwrites original with enhanced version
block_state.revised_prompts = list(revised)   # saves the exact same list

Both prompt and revised_prompts contain the same data. The original prompt is lost. Either:

Keep the original prompt in prompt and put the enhanced version only in revised_prompts
Or drop revised_prompts entirely since prompt already has the rewritten version

No other modular pipeline has this pattern (prompt enhancers are unique to Ernie).

Why this wasn't in my review: I flagged the type hints on prompt/revised_prompts (round 1, issue #4) but didn't examine the semantic relationship between the two outputs. The code "worked" — both values were populated — but I didn't question whether having two identical outputs made sense.

Proposed gotcha:

Producing redundant outputs. Don't declare multiple intermediate_outputs that always contain the same value. Each output should carry distinct information. If a block transforms an input in-place (e.g. rewriting prompt), either preserve the original under a different name or don't create a copy under a new name.

Comment 3: `state.set()` bypasses the block interface (`encoders.py:254`)

YiYi is correct. The encoder directly writes to global state:

# encoders.py:254
state.set("batch_size", len(prompt))

I searched all modular pipelines — only HunyuanVideo1.5 (encoders.py:307,390) and ErnieImage use state.set() directly in block __call__. All other pipelines (Flux, LTX, WAN, Helios, QwenImage, Z-Image, SDXL, Flux2) exclusively use set_block_state(), which writes to state via declared intermediate_outputs.

The proper pattern is:

Declare the value as an intermediate_output on the block
Set it on block_state
Let set_block_state() write it to global state

Using state.set() bypasses the block's declared interface — the value doesn't appear in intermediate_outputs, so it's invisible to the framework's introspection and documentation.

Additionally, YiYi's broader point: batch_size shouldn't be produced by the encoder at all. The encoder should encode. The input step should derive batch_size from prompt_embeds. This follows the separation of concerns pattern established by Flux and LTX.

Why this wasn't in my review: I didn't audit whether blocks used state.set() vs set_block_state(). The state.set() call is a single line that doesn't affect correctness — it just bypasses the declared interface. I wasn't checking for framework idiom compliance at that level of detail.

Proposed gotcha:

Using state.set() directly inside block __call__. Blocks should only write to state through their declared intermediate_outputs via set_block_state(). Don't call state.set() directly — it bypasses the block's declared interface, making the output invisible to framework introspection and documentation. If you need to pass a value to downstream blocks, declare it as an intermediate_output.

Comment 4: `getattr(block_state, ..., None)` in denoise guider inputs (`denoise.py:128`)

YiYi is correct — getattr(block_state, ..., None) is unnecessary here. This is the same principle as the getattr(components, ...) issue from round 1, but applied to block_state.

The framework's get_block_state() (modular_pipeline.py:491-516) creates a BlockState by collecting all declared inputs from state. For required inputs, it raises ValueError if missing. For optional inputs (like negative_text_bth), it sets them to None. So all declared inputs are guaranteed to exist as attributes on block_state.

Current code:

# denoise.py:126-135
guider_inputs = {
    "text_bth": (
        getattr(block_state, "text_bth", None),        # unnecessary
        getattr(block_state, "negative_text_bth", None), # unnecessary
    ),
    ...
}

Should be:

guider_inputs = {
    "text_bth": (block_state.text_bth, block_state.negative_text_bth),
    ...
}

However, I want to flag that this pattern is widespread in existing pipelines:

SDXL (stable_diffusion_xl/denoise.py:204-219): Uses getattr(block_state, ..., None) for all guider inputs
Flux2 (flux2/denoise.py:326-334): Same pattern
QwenImage (qwenimage/denoise.py:209-218): Same pattern

Newer pipelines have moved away from this:

Helios (helios/denoise.py:404-405): Direct block_state.prompt_embeds access
WAN, LTX, Z-Image: Use prepare_inputs_from_block_state() method, which encapsulates the extraction entirely

The direction is clearly toward either direct access or the prepare_inputs_from_block_state API. New pipelines should follow the newer pattern.

Why this wasn't in my review: In round 1, I correctly flagged getattr(components, ...) for expected components. But I missed the same pattern applied to block_state in the denoiser, partly because several existing pipelines (SDXL, Flux2, QwenImage) use this exact pattern, making it look "normal." I was comparing against the wrong references.

Proposed gotcha (extends the existing round-1 proposal):

Using getattr with defaults for declared inputs/components. Components listed in expected_components are always registered as attributes — either as the loaded object or None. Similarly, inputs declared in a block's inputs list are always present on block_state after get_block_state() — either with their value or None. Access both directly (e.g. components.name, block_state.name). Don't use getattr(obj, name, None) — it obscures the framework's contract and suggests the attribute might not exist. For guider input construction, prefer prepare_inputs_from_block_state() (see WAN, LTX, Z-Image) or direct access (see Helios).

Consolidated modular.md gotcha proposals (all rounds)

Combining findings from round 1, round 2, and this round:

#	Proposed Gotcha	Source	Why Missed in Review
10	Don't add `output_type="latent"` inside decoder blocks. Users skip the decode block entirely. Decoders should always decode.	Round 1, YiYi feedback	Applied standard pipeline thinking to modular context
11	Don't use `getattr` with defaults for declared inputs/components. Components and block_state inputs are always registered as attributes. Access directly. For guider inputs, prefer `prepare_inputs_from_block_state()`.	Round 1 + this round, YiYi's comments	Checked correctness, not framework idiom compliance; older pipelines (SDXL, Flux2) use same anti-pattern
12	Don't hide unconditional logic inside conditional branches. If a block does both unconditional work (normalizing inputs) and conditional work (optional model pass), run unconditional logic first, unconditionally.	Round 1, YiYi's comment	Checked behavior, not code structure
13	Don't bake `num_images_per_prompt` into encoder steps. Encoders produce one embedding set per prompt. Expansion belongs in a separate input step.	Round 1, YiYi's comment	Didn't compare architectural decomposition against reference pipelines
14	Don't accept `batch_size` as input when derivable from data. Input steps should derive `batch_size` from `prompt_embeds` shape/length, not accept it as an `InputParam`.	This round, YiYi's comment	Checked data flow, not API design
15	Don't produce redundant outputs. Each `intermediate_output` should carry distinct information. Don't create copies of the same value under different names.	This round, YiYi's comment	Checked that outputs were populated, not whether they were semantically distinct
16	Don't use `state.set()` directly inside blocks. Write to state only through declared `intermediate_outputs` via `set_block_state()`. Direct `state.set()` bypasses the block's interface.	This round, YiYi's comment	Didn't audit framework API usage patterns

Common review failure modes

Looking across all rounds, there are three recurring themes in what I missed:

Correctness ≠ Idiom compliance. Code that produces correct results can still violate framework contracts. Reviews need to check both.
Comparing against wrong references. Older pipelines (SDXL, Flux2) have patterns that were acceptable at the time but are now superseded. New pipelines should be compared against the newest reference implementations (Helios, WAN, Z-Image, LTX).
Data flow vs. API design. Checking that values flow correctly through the pipeline is necessary but insufficient. Reviews also need to check whether values should flow that way — is this the right block to produce this value? Should this be an input or derived?

Suggested additions to agents.md / review instructions

To help future reviews catch these issues, I'd suggest adding to the review guidelines:

When reviewing modular pipelines, check against the newest reference implementations (currently Helios, WAN, Z-Image, LTX — not SDXL or Flux which use older patterns). Specifically verify:

Encoder steps only encode; num_images_per_prompt and batch_size are handled in input steps

All state writes go through intermediate_outputs / set_block_state(), never state.set() directly

No getattr(obj, name, None) for declared components or inputs

Each output carries distinct information

Unconditional logic isn't hidden inside conditional branches

akshan-main · 2026-04-22T23:16:20Z

@yiyixuxu addressed review changes. batch_size now derived from prompt_embeds/text_bth, dropped revised_prompts, state.set, and the getattr on block_state. docstrings regenerated too.

github-actions Bot added tests modular-pipelines utils size/L PR with diff > 200 LOC labels Apr 17, 2026

yiyixuxu reviewed Apr 20, 2026

View reviewed changes

Comment thread src/diffusers/modular_pipelines/ernie_image/encoders.py

yiyixuxu reviewed Apr 20, 2026

View reviewed changes

yiyixuxu reviewed Apr 21, 2026

View reviewed changes

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 21, 2026

akshan-main added 3 commits April 20, 2026 18:23

Add Ernie-Image modular pipeline

181e661

Address review

c137fa1

Fix alphabetical ordering and generator type_hint

d494316

akshan-main force-pushed the modular-ernie branch from 5779f01 to d494316 Compare April 21, 2026 01:26

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 21, 2026

akshan-main requested a review from yiyixuxu April 21, 2026 02:27

yiyixuxu reviewed Apr 22, 2026

View reviewed changes

Address review

e4df67a

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 22, 2026

akshan-main requested a review from yiyixuxu April 22, 2026 23:16

		return padded, lengths


		class ErnieImageTextInputStep(ModularPipelineBlocks):

Conversation

akshan-main commented Apr 17, 2026

What does this PR do?

Before submitting

Who can review?

Uh oh!

sayakpaul commented Apr 17, 2026

Uh oh!

akshan-main commented Apr 17, 2026

Uh oh!

sayakpaul commented Apr 17, 2026

Uh oh!

akshan-main commented Apr 17, 2026

Uh oh!

akshan-main commented Apr 17, 2026

Uh oh!

yiyixuxu commented Apr 17, 2026

Uh oh!

sayakpaul commented Apr 18, 2026

Uh oh!

akshan-main commented Apr 18, 2026

Uh oh!

akshan-main commented Apr 20, 2026

Uh oh!

yiyixuxu commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: Ernie-Image Modular Pipeline

Issues

1. Alphabetical ordering in src/diffusers/__init__.py (lines 452–453, 1239–1240)

2. Missing output_type="latent" support in decoder (decoders.py)

3. ErnieImagePachifier — consider inlining instead of creating a component (modular_pipeline.py:23–63, decoders.py:62–63)

4. Docstring quality — prompt type hints show None (modular_blocks_ernie_image.py)

5. ErnieImageCoreDenoiseStep docstring has generator type hint as None (modular_blocks_ernie_image.py:104)

Minor / Nits

6. MODULAR_PIPELINE_MAPPING ordering (modular_pipeline.py:136)

7. Dummy objects ordering (dummy_torch_and_transformers_objects.py)

8. do_sample logic in prompt enhancer (encoders.py:68)

What looks good

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yiyixuxu Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yiyixuxu commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Analysis of YiYi's review comments

Comment 1: getattr(components, "pe", None) is unnecessary (encoders.py:136)

Comment 2: Rewrite ErnieImagePromptEnhancerStep so the block always normalizes prompts (encoders.py:31)

Comment 3: Need an input step that expands tensors for num_images_per_prompt (before_denoise.py:48)

Summary of suggested modular.md additions

Uh oh!

yiyixuxu commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Responding to YiYi's feedback on output_type="latent"

Proposed addition to modular.md gotchas

Updated full list of proposed modular.md additions

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yiyixuxu commented Apr 21, 2026

Uh oh!

akshan-main commented Apr 21, 2026

Uh oh!

akshan-main commented Apr 21, 2026

Uh oh!

yiyixuxu commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026 •

edited

Loading

1. Alphabetical ordering in `src/diffusers/init.py` (lines 452–453, 1239–1240)

2. Missing `output_type="latent"` support in decoder (`decoders.py`)

3. `ErnieImagePachifier` — consider inlining instead of creating a component (`modular_pipeline.py:23–63, decoders.py:62–63`)

4. Docstring quality — `prompt` type hints show `None` (`modular_blocks_ernie_image.py`)

5. `ErnieImageCoreDenoiseStep` docstring has `generator` type hint as `None` (`modular_blocks_ernie_image.py:104`)

6. `MODULAR_PIPELINE_MAPPING` ordering (`modular_pipeline.py:136`)

7. Dummy objects ordering (`dummy_torch_and_transformers_objects.py`)

8. `do_sample` logic in prompt enhancer (`encoders.py:68`)

yiyixuxu Apr 20, 2026 •

edited

Loading

github-actions Bot commented Apr 20, 2026 •

edited

Loading

Comment 1: `getattr(components, "pe", None)` is unnecessary (`encoders.py:136`)

Comment 2: Rewrite `ErnieImagePromptEnhancerStep` so the block always normalizes prompts (`encoders.py:31`)

Comment 3: Need an input step that expands tensors for `num_images_per_prompt` (`before_denoise.py:48`)

Summary of suggested `modular.md` additions

github-actions Bot commented Apr 20, 2026 •

edited

Loading

Responding to YiYi's feedback on `output_type="latent"`

Proposed addition to `modular.md` gotchas

Updated full list of proposed `modular.md` additions

github-actions Bot commented Apr 22, 2026 •

edited

Loading

1. `generator` type_hint still shows `None` in auto-generated docstrings

2. Test model path uses personal repo (`modular.md` gotcha #9)

github-actions Bot commented Apr 22, 2026 •

edited

Loading

Comment 1: `batch_size` should not be an input (`before_denoise.py:83`)

Comment 2: `prompt` vs `revised_prompts` redundancy (`encoders.py:143`)

Comment 3: `state.set()` bypasses the block interface (`encoders.py:254`)

Comment 4: `getattr(block_state, ..., None)` in denoise guider inputs (`denoise.py:128`)