Skip to content

Add Ernie-Image modular pipeline#13498

Open
akshan-main wants to merge 4 commits intohuggingface:mainfrom
akshan-main:modular-ernie
Open

Add Ernie-Image modular pipeline#13498
akshan-main wants to merge 4 commits intohuggingface:mainfrom
akshan-main:modular-ernie

Conversation

@akshan-main
Copy link
Copy Markdown
Contributor

What does this PR do?

Adds the modular pipeline for ErnieImage (ErnieImageAutoBlocks + ErnieImageModularPipeline).

Parity verified on A100, bf16, 50 steps, 1024x1024 with baidu/ERNIE-Image:

  • MAD (AutoBlocks vs standard): 0.000033
  • Max absolute diff: 0.58 (out of 255)

Colab Notebook: https://colab.research.google.com/gist/akshan-main/f25801763d573209464d6bfd685d708e/modular-ernie-image.ipynb

Addresses #13389 (comment).

Before submitting

Who can review?

@yiyixuxu @sayakpaul

@sayakpaul
Copy link
Copy Markdown
Member

Maybe it could live as a custom pipeline on the Hub like https://huggingface.co/krea/krea-realtime-video/tree/main?

@akshan-main
Copy link
Copy Markdown
Contributor Author

@sayakpaul @yiyixuxu can do the krea pattern by keeping a minimal ErnieImageModularPipeline subclass + MODULAR_PIPELINE_MAPPING entry in diffusers. Want me to restructure the PR that way?

@sayakpaul
Copy link
Copy Markdown
Member

Doing it the Krea way wouldn't require any changes to core Diffusers no?

@akshan-main
Copy link
Copy Markdown
Contributor Author

you were right, the fully hub-only pattern works. moved everything to https://huggingface.co/akshan-main/ernie-image-modular and it loads end-to-end with zero diffusers changes:

from diffusers.modular_pipelines import ModularPipeline
pipe = ModularPipeline.from_pretrained("akshan-main/ernie-image-modular", trust_remote_code=True)
pipe.load_components(trust_remote_code=True)

No custom ErnieImageModularPipeline class needed.

inlined the pipeline properties (vae_scale_factor, num_channels_latents, text_in_dim) into the blocks via direct components.vae.config.* / components.transformer.config.* reads.

Should I close this pr @sayakpaul

@akshan-main
Copy link
Copy Markdown
Contributor Author

hey @sayakpaul, over the last few weeks, I've profiled QwenImage and QwenImageEdit to identify cudaStreamSynchronize calls causing per-step latency, which led to the QwenImage RoPE sync fix (#13406, merged), plus modular pipelines for LTX (#13378, merged) and HunyuanVideo 1.5 (#13389, merged), and an HV1.5 I2V bug fix (#13439, merged). I've also published modular upscale hub blocks for SDXL, Flux1, and Z-Image, and just pushed an Ernie-Image modular hub repo. Would this qualify me for MVP recognition, or is there more I should do to get there? happy to keep contributing either way, just wanted to check.

@yiyixuxu
Copy link
Copy Markdown
Collaborator

@sayakpaul @akshan-main

ooh but I'd really love to have official modular support for ERNIE-Image — it's a really good model, trained from scratch, and the Baidu team is committed to releasing more checkpoints and building their own community around it

I think we can release official blocks for text-to-image, image-to-image, and edit pipelines (not yet released) and encourage the community to build more custom stuff on hub using our official blocks

@sayakpaul
Copy link
Copy Markdown
Member

Ah my bad, that makes sense. I didn't realize it.

@akshan-main
Copy link
Copy Markdown
Contributor Author

@yiyixuxu I think the modular pipeline is good to review now

@akshan-main
Copy link
Copy Markdown
Contributor Author

friendly ping @yiyixuxu when you get the chance

@yiyixuxu
Copy link
Copy Markdown
Collaborator

@claude can you do a review here? thanks

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 20, 2026

Claude finished @yiyixuxu's task in 3m 50s —— View job


Review: Ernie-Image Modular Pipeline

  • Read repository guidelines (CLAUDE.md, AGENTS.md, models.md, modular.md, pitfalls.md)
  • Read existing modular pipeline for reference (Flux2, LTX, Helios, Flux)
  • Review __init__.py changes
  • Review encoders.py
  • Review before_denoise.py
  • Review denoise.py
  • Review decoders.py
  • Review modular_blocks_ernie_image.py
  • Review modular_pipeline.py
  • Review ernie_image/__init__.py
  • Review dummy objects
  • Post final review

Overall the implementation is well-structured, follows the modular pipeline conventions, and parity numbers look good. A few issues to address:


Issues

1. Alphabetical ordering in src/diffusers/__init__.py (lines 452–453, 1239–1240)

ErnieImageAutoBlocks and ErnieImageModularPipeline are inserted between FluxAutoBlocks and FluxKontextAutoBlocks, breaking alphabetical order. They should come before the Flux2* entries (i.e. at the top of the modular_pipelines list), since E < F.

Same issue in the TYPE_CHECKING block at lines 1239–1240.

Fix this →

2. Missing output_type="latent" support in decoder (decoders.py)

The standard ErnieImagePipeline supports output_type="latent" (line 361 of pipeline_ernie_image.py), returning raw latents without VAE decode. The modular decoder doesn't handle this case — it will raise ValueError for output_type="latent". The Flux modular decoder has this pattern:

if not block_state.output_type == "latent":
    # do VAE decode
    ...
else:
    block_state.images = block_state.latents

Consider adding the same early-return for "latent" to ErnieImageVaeDecoderStep.__call__.

Fix this →

3. ErnieImagePachifier — consider inlining instead of creating a component (modular_pipeline.py:23–63, decoders.py:62–63)

The standard pipeline uses simple static methods _patchify_latents / _unpatchify_latents for this logic. The Flux2 modular pipeline does the same (static methods on blocks). Creating a full ConfigMixin class registered as a ComponentSpec with from_config for what is pure tensor reshaping seems heavyweight.

Currently ErnieImagePachifier is only used in the decoder (for unpack_latents). Consider inlining unpack_latents as a static method on ErnieImageVaeDecoderStep (like Flux2's _unpatchify_latents) and removing the pachifier component entirely.

Not blocking, but simplifies the component graph.

4. Docstring quality — prompt type hints show None (modular_blocks_ernie_image.py)

In the auto-generated docstring for ErnieImageAutoBlocks, several parameter types show None instead of their actual types:

  • prompt (None) — should be str or list[str]
  • negative_prompt (None, *optional*) — should be str or list[str]
  • generator (None, *optional*) — should be torch.Generator

This happens because the InputParam definitions in the encoder/enhancer steps don't specify type_hint. For example in encoders.py:149:

InputParam("prompt", description="The prompt or prompts to guide image generation.")

→ should have type_hint=str.

Same for negative_prompt at line 150 and generator in before_denoise.py:325.

Fix this →

5. ErnieImageCoreDenoiseStep docstring has generator type hint as None (modular_blocks_ernie_image.py:104)

Same issue: generator (None, *optional*) in the generated docstring. The underlying InputParam in before_denoise.py:325 is missing type_hint=torch.Generator.


Minor / Nits

6. MODULAR_PIPELINE_MAPPING ordering (modular_pipeline.py:136)

The ("ernie-image", ...) entry is appended at the end. The mapping isn't strictly alphabetical, but inserting it between existing entries (e.g. after "helios-pyramid" or before "flux") in a more logical spot would be more consistent. Not blocking.

7. Dummy objects ordering (dummy_torch_and_transformers_objects.py)

The ErnieImageAutoBlocks and ErnieImageModularPipeline dummy classes are inserted between Flux2ModularPipeline and FluxAutoBlocks. Alphabetically they should come before the Flux2* entries. make style may or may not fix this automatically.

8. do_sample logic in prompt enhancer (encoders.py:68)

do_sample=temperature != 1.0 or top_p != 1.0,

This means do_sample=True when temperature=0.6 (the default), which is correct. But when temperature=1.0 AND top_p=1.0, it sets do_sample=False (greedy). This is a subtle semantic choice — worth confirming this matches the standard pipeline's behavior. Not a bug per se, just flagging.


What looks good

  • Guider abstraction: Properly separated from the denoise loop, no guidance_scale as pipeline input — follows modular.md guidelines.
  • No cross-imports: All imports are from diffusers core, no imports from diffusers.pipelines.* or other modular pipelines.
  • BN eps handling: Uses vae.config.batch_norm_eps instead of hardcoded 1e-5 (the standard pipeline hardcodes 1e-5, but the VAE config default is 1e-4 — the modular version is actually more correct here).
  • Test model path: Points to hf-internal-testing/, not a personal repo.
  • File structure: Matches the expected layout from modular.md.
  • Parity: MAD=0.000033, max_diff=0.58/255 looks excellent for bf16.

if isinstance(prompt, str):
prompt = [prompt]

pe = getattr(components, "pe", None)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude what do you think about this pattern?
I think it is unnecesarily (if we listed a component as expected_components it should be an attribute) - can you verify by trace through the code?
did we talk about in our agents.md and other docs? why did it not included in your review

Comment thread src/diffusers/modular_pipelines/ernie_image/encoders.py
return padded, lengths


class ErnieImageTextInputStep(ModularPipelineBlocks):
Copy link
Copy Markdown
Collaborator

@yiyixuxu yiyixuxu Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would need a input step that expand tensors for num_images_per_prompt and so on
can we do the same for ernie?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@yiyixuxu
Copy link
Copy Markdown
Collaborator

@claude i left some additional comments here, can you look at them and let me know what you think and in the case if you agree with me, let me know why they were not included in your review and how wee can update our agents.md to improve moving forward

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 20, 2026

Claude finished @yiyixuxu's task in 3m 14s —— View job


Analysis of YiYi's review comments

  • Read repository guidelines (AGENTS.md, modular.md)
  • Read the commented code sections (encoders.py, before_denoise.py)
  • Read reference modular pipelines for comparison (Flux, QwenImage, SDXL)
  • Trace through the component registration framework code
  • Analyze comment 1: expected_components attribute access pattern (encoders.py:136)
  • Analyze comment 2: Prompt enhancer rewrite (encoders.py:31)
  • Analyze comment 3: Input step for num_images_per_prompt expansion (before_denoise.py:48)
  • Suggest agents.md improvements

Comment 1: getattr(components, "pe", None) is unnecessary (encoders.py:136)

YiYi is correct. I traced through the framework code:

  1. In modular_pipeline.py:1743-1750, during __init__, every component in expected_components is registered — from_config components get created, from_pretrained components are registered as None.
  2. register_components() at line 2071+ calls setattr(self, name, module) for all components, including None values.

So components.pe always exists as an attribute — it's just None when not loaded. The getattr(components, "pe", None) on line 136 is defensive code that adds no value. The correct pattern is:

if not block_state.use_pe or components.pe is None or components.pe_tokenizer is None:
    ...

No other modular pipeline in the codebase uses getattr(components, ...) for expected components — they all access them directly.

Why this wasn't in the original review: The review looked at this code and saw a pattern that "works correctly" (produces the right result) without questioning whether it follows the framework's idioms. The getattr pattern is functionally harmless, which made it easy to overlook. But YiYi is right that it signals a misunderstanding of how expected_components works, and tolerating it would set a bad precedent for future contributions.

Suggested agents.md / modular.md addition:

10. Using getattr(components, "name", None) for expected components. Components listed in expected_components are always registered as attributes on the pipeline — either as the loaded object or None. Access them directly as components.name. Don't use getattr with a fallback; it obscures the contract and suggests the component might not be registered at all.


Comment 2: Rewrite ErnieImagePromptEnhancerStep so the block always normalizes prompts (encoders.py:31)

YiYi is correct. Currently the prompt enhancer block does two things entangled together:

  1. Normalizes prompt from str → list[str] (lines 133-134)
  2. Optionally rewrites prompts via the PE model (lines 138-165)

The problem: when use_pe=False (or PE not loaded), the block still normalizes prompt to a list — but the control flow makes this a side-effect of the "no-op" early return (lines 138-142). This couples prompt normalization (always needed) with prompt enhancement (optional).

What YiYi is asking for: Restructure so the block always expands/normalizes prompts (str→list), then conditionally runs the PE rewrite only when use_pe=True. This is cleaner because:

  • Prompt normalization happens unconditionally and visibly
  • The PE enhancement is a clearly optional layer on top
  • The block's contract is clearer: "output is always list[str]"

Why this wasn't in the original review: The review checked that the code works (it does — the early return still normalizes prompts). But it missed the structural issue: the normalization logic shouldn't be gated behind the PE check. This is a code organization/clarity concern that requires understanding the intent of modular blocks (each block has a clear, separable responsibility), not just their correctness.

Suggested agents.md addition: This is more of a general principle already implied by AGENTS.md ("Strive to write code as simple and explicit as possible"). However, we could add to modular.md:

11. Mixing always-needed logic with conditional logic. If a block performs both unconditional work (e.g. normalizing inputs) and conditional work (e.g. an optional model pass), structure the __call__ so unconditional logic runs first, unconditionally. Don't hide it inside a conditional branch's early return.


Comment 3: Need an input step that expands tensors for num_images_per_prompt (before_denoise.py:48)

YiYi is correct. This is the most architecturally significant comment. The established pattern across modular pipelines is:

  • Encoder step produces embeddings at prompt-batch granularity (batch_size = number of prompts)
  • Input step (separate block) handles num_images_per_prompt expansion — repeating/reshaping all tensor inputs to batch_size * num_images_per_prompt

This is the pattern in:

  • Flux: FluxTextInputStep (flux/inputs.py:31-120) — takes prompt_embeds and repeats with .repeat(1, num_images_per_prompt, 1) then reshapes
  • QwenImage: QwenImageTextInputsStep (qwenimage/inputs.py:112-253) — same pattern, also handles masks
  • SDXL: StableDiffusionXLInputStep (stable_diffusion_xl/before_denoise.py:346+) — same pattern

What ErnieImage does instead: The num_images_per_prompt expansion is baked into _encode() in the encoder step (encoders.py:238-239):

for _ in range(num_images_per_prompt):
    text_hiddens.append(hidden)

And ErnieImageTextInputStep (before_denoise.py:48) only handles padding — it doesn't do any num_images_per_prompt expansion because it was already done in the encoder.

Why this matters:

  1. It violates separation of concerns — the encoder should encode, the input step should handle batching
  2. It prevents users from running the encoder standalone and then controlling batch expansion separately
  3. The encoder re-encodes the same prompt N times (via the for-loop) instead of encoding once and repeating the tensor — wasteful if the model forward pass is expensive (though here the hidden states are reused, so it's just appending the same tensor N times, but the pattern is still wrong)
  4. It's inconsistent with every other modular pipeline

The fix would be:

  • Remove num_images_per_prompt from ErnieImageTextEncoderStep — encode each prompt once
  • Add num_images_per_prompt expansion to ErnieImageTextInputStep (the existing class in before_denoise.py:48), following the Flux/QwenImage pattern

Why this wasn't in the original review: The review verified that the code produces correct results and checked for surface-level pattern compliance (guider abstraction, no cross-imports, test paths, etc.). But it didn't compare the architectural decomposition of encoder vs. input step against reference implementations like Flux or QwenImage. The num_images_per_prompt handling was split across files in a way that technically works but violates the modular design principle of separating encoding from batching.

Suggested modular.md addition:

12. Baking num_images_per_prompt expansion into encoder steps. Encoder steps should produce embeddings at prompt-batch granularity only (one embedding set per prompt). The num_images_per_prompt expansion (repeating/reshaping tensors to batch_size * num_images_per_prompt) belongs in a separate input step placed between encoding and denoising. See FluxTextInputStep, QwenImageTextInputsStep for reference. This separation lets users run encoding standalone without committing to a batch expansion factor.


Summary of suggested modular.md additions

Three new gotcha entries to add:

# Rule Why missed
10 Don't use getattr(components, ...) for expected components — they're always registered as attributes Review checked correctness, not framework idiom compliance
11 Don't hide unconditional logic inside conditional branches Review checked behavior, not code structure/clarity
12 Don't bake num_images_per_prompt into encoder steps — use a separate input step Review didn't compare architectural decomposition against reference pipelines

The common thread: the review focused on "does it work correctly?" but under-weighted "does it follow the established modular architecture patterns?" Adding these explicit rules to modular.md would make pattern violations more detectable in future reviews.


@yiyixuxu
Copy link
Copy Markdown
Collaborator

@claude for your reviews,

from the first review

  1. Missing output_type="latent" support in decoder (decoders.py)

For modular, if you want to have latent, all you need to do is not run the decode step (it is very easy to do so), do you agree? if so, can you update the list of things we need to include in modular.md to inclujde this -> so that in the future, we are aligned on this

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 20, 2026

Claude finished @yiyixuxu's task in 1m 49s —— View job


Responding to YiYi's feedback on output_type="latent"

  • Analyze YiYi's point about modular decoder design
  • Read modular.md and reference pipelines to verify
  • Propose updated modular.md guideline

Yes, I fully agree. My original review point #2 was wrong — it applied standard pipeline thinking to the modular context.

In modular pipelines, users compose steps explicitly via InsertableDict. If they want latent output, they simply remove the decode block:

blocks = pipe.blocks
blocks.sub_blocks.pop("decode")
output = pipe(..., blocks=blocks)

No special output_type="latent" handling is needed inside the decoder. This is the whole point of the modular architecture — composition replaces conditional logic.

I verified this is the intended pattern: the QwenImage decoder (qwenimage/decoders.py:231) has an explicit comment from you:

# YiYi Notes: remove support for output_type = "latents', we can just skip decode/encode step in modular

Newer modular pipelines (QwenImage, LTX, Wan, Helios) all follow this — their decoders always decode, no conditional latent passthrough. Only older ones (Flux, SDXL) still have the output_type="latent" check in the decoder.

Proposed addition to modular.md gotchas

This should be added as gotcha #10:

10. Adding output_type="latent" support inside decoder blocks. In standard pipelines, the decoder checks output_type == "latent" to skip VAE decoding and return raw latents. In modular pipelines this is unnecessary — users simply remove the decode block from the block sequence (e.g. blocks.sub_blocks.pop("decode")). Decoder blocks should always decode. Don't add conditional latent passthrough logic. Same principle applies to encoder blocks: if users want to skip encoding, they remove the encoder block rather than passing pre-computed embeddings (see gotcha #4).

This pairs naturally with existing gotcha #4 (which already captures the encoder-side version of the same principle). Together they establish: in modular pipelines, "skip a step" means removing the block, not adding conditional bypass logic inside it.

Updated full list of proposed modular.md additions

Combining the guidelines from my earlier analysis of YiYi's review comments with this new one:

# Rule Why missed in review
10 Don't add output_type="latent" inside decoder blocks — users skip the decode block entirely Review applied standard pipeline patterns to modular context
11 Don't use getattr(components, ...) for expected components — they're always registered as attributes Review checked correctness, not framework idiom compliance
12 Don't hide unconditional logic inside conditional branches Review checked behavior, not code structure/clarity
13 Don't bake num_images_per_prompt into encoder steps — use a separate input step Review didn't compare architectural decomposition against reference pipelines

InputParam("prompt", description="The prompt or prompts to guide image generation."),
InputParam("negative_prompt", description="The prompt or prompts to avoid during image generation."),
InputParam(
"num_images_per_prompt",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, lets not accept num_images_per_prompt here - it is more flexible to expand the shape outside the encoder steps -> since we want to support use case where users want to pre-compute text prompts

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@yiyixuxu
Copy link
Copy Markdown
Collaborator

@akshan-main thanks for the PR
sorry if the back and force with claude creaeted a bit noise here - I left some in-line comments for things that I think make seense eto change
additionally, we may ask you to push some update to modular.md before merge (but don't worry about it for now)

@akshan-main
Copy link
Copy Markdown
Contributor Author

@yiyixuxu all good, thanks for the detailed review. Working through the inline comments now, will push updates shortly. Happy to help with modular.md updates when needed

@github-actions github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 21, 2026
@github-actions github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 21, 2026
@akshan-main
Copy link
Copy Markdown
Contributor Author

@yiyixuxu moved num_images_per_prompt off the encoder, dropped the getattr, and wrapped the enhancer in an AutoBlock on use_pe. When you get the time 👍

@akshan-main akshan-main requested a review from yiyixuxu April 21, 2026 02:27
@yiyixuxu
Copy link
Copy Markdown
Collaborator

@claude can you do a review again? thanks

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

Claude finished @yiyixuxu's task in 4m 30s —— View job


Review: Ernie-Image Modular Pipeline (Round 2)

  • Read repository guidelines (AGENTS.md, modular.md, models.md, pitfalls.md)
  • Get current diff and read all changed files
  • Verify previous review issues were addressed
  • Review encoders.py
  • Review before_denoise.py
  • Review denoise.py
  • Review decoders.py
  • Review modular_blocks_ernie_image.py
  • Review modular_pipeline.py
  • Review __init__.py files and ordering
  • Review dummy objects
  • Review test file
  • Post final review

Good progress since round 1. The three structural issues YiYi flagged are properly addressed:

  • num_images_per_prompt is now handled in ErnieImageTextInputStep (not the encoder)
  • getattr(components, ...) for expected components is removed
  • Prompt enhancer is wrapped in AutoPipelineBlocks triggered on use_pe

Two remaining issues to fix before merge:


Issues

1. generator type_hint still shows None in auto-generated docstrings

modular_blocks_ernie_image.py:102 and :165 both show:

generator (`None`, *optional*):

The underlying InputParam in before_denoise.py:230 correctly has type_hint=torch.Generator, and reference implementations (Helios, QwenImage, LTX, HunyuanVideo1.5) all render as generator (`Generator`, *optional*). The docstrings need to be regenerated:

python utils/modular_auto_docstring.py --fix_and_overwrite

2. Test model path uses personal repo (modular.md gotcha #9)

tests/modular_pipelines/ernie_image/test_modular_pipeline_ernie_image.py:38:

pretrained_model_name_or_path = "akshan-main/tiny-ernie-image-modular-pipe"

Per modular.md gotcha #9: "Tiny test models must live under hf-internal-testing/, not personal repos." This needs to be moved before merge.


What was fixed from round 1 (verified)

Issue Status
Alphabetical ordering in src/diffusers/__init__.py ✅ Fixed — ErnieImage* now correctly before Flux2*
getattr(components, "pe", None) for expected components ✅ Fixed — removed, direct access used
Prompt enhancer always normalizes prompts ✅ Fixed — always normalizes str→list, wrapped in AutoPipelineBlocks on use_pe
num_images_per_prompt baked into encoder ✅ Fixed — moved to ErnieImageTextInputStep._expand()
prompt/negative_prompt missing type_hint ✅ Fixed — type_hint=str added

What looks good

  • ErnieImageTextInputStep (before_denoise.py:48-153): Clean separation — _expand() repeats each hidden state per num_images_per_prompt, then _pad_text() stacks into a batched tensor. This correctly handles the variable-length text hidden states that ErnieImage uses (unlike Flux/QwenImage which have fixed-length embeddings).
  • ErnieImageAutoPromptEnhancerStep (modular_blocks_ernie_image.py:32-73): Good use of AutoPipelineBlocks with block_trigger_inputs = ["use_pe"] — the enhancer is cleanly optional.
  • Guider abstraction (denoise.py:126-152): Correctly follows the reference pattern (matches QwenImage/Flux2-Klein). Uses getattr(block_state, ...) for guider_inputs — this is the correct pattern for block_state (as opposed to components, where direct access is expected).
  • No output_type="latent" in decoder: Correctly follows the modular pattern — users skip the decode block entirely rather than adding conditional bypass logic.
  • BN eps from config (decoders.py:78): Uses vae.config.batch_norm_eps instead of hardcoded value.
  • Alphabetical ordering: Correct in src/diffusers/__init__.py and dummy objects.
  • No cross-imports: All imports from diffusers core.

Copy link
Copy Markdown
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, i left a bit more comments

description="Number of images to generate per prompt.",
),
InputParam(
"batch_size",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not be an input, no? we can always derive from prompt_embed, no?
what happends if prompt_embeds has different batch_size from batch_size?

]

block_state.prompt = revised
block_state.revised_prompts = list(revised)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we either:

  • keep the original prompt in prompt
  • or, drop the revised_prompt

both prompt and revised_prompt are same thing now, we don't need both

else:
block_state.negative_prompt_embeds = None

state.set("batch_size", len(prompt))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's avoid this patten cc @claude here
if you want to add something to state, declare them as intermediate_outputs (but here I don't think we need to produce this batch_size as output here. we do that in input steps in core-denoise like other pipelines

description="Number of images to generate per prompt.",
),
InputParam(
"batch_size",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not be an input -> should be output based on prompt_embeds

def __call__(self, components: ErnieImageModularPipeline, block_state: BlockState, i: int, t: torch.Tensor):
guider_inputs = {
"text_bth": (
getattr(block_state, "text_bth", None),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude what do you think of this pattern? do we need to do getattr(..., None) here?

@yiyixuxu
Copy link
Copy Markdown
Collaborator

@claude can you look at my reviews and update your summary on things we can improve on agents.md/modular.md that can help you with future review?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

Claude finished @yiyixuxu's task in 5m 27s —— View job


Analysis of YiYi's latest review comments + consolidated modular.md improvements

  • Read repository guidelines (AGENTS.md, modular.md)
  • Read current code at commented locations
  • Read reference pipelines for comparison (Flux, LTX, WAN, QwenImage, Helios, HunyuanVideo1.5, Z-Image)
  • Trace through framework code (get_block_state, set_block_state, BlockState)
  • Analyze comment 1: batch_size should be derived from prompt_embeds, not accepted as input
  • Analyze comment 2: prompt vs revised_prompts redundancy
  • Analyze comment 3: state.set() pattern — use intermediate_outputs instead
  • Analyze comment 4: getattr(block_state, ..., None) in denoise guider inputs
  • Produce consolidated summary of modular.md improvements (all rounds)

Comment 1: batch_size should not be an input (before_denoise.py:83)

YiYi is correct. batch_size is trivially derivable from prompt_embeds:

# Current (before_denoise.py:135) - accepts batch_size as optional InputParam
block_state.batch_size = block_state.batch_size or len(prompt_embeds)

This creates a potential inconsistency: if a user passes batch_size=3 but prompt_embeds has 5 entries, the pipeline silently uses 3. The reference pattern in other pipelines:

  • Flux (flux/inputs.py:106): block_state.batch_size = block_state.prompt_embeds.shape[0]
  • LTX (ltx/before_denoise.py:137): block_state.batch_size = block_state.prompt_embeds.shape[0]
  • WAN (wan/encoders.py:239): derives from len(prompt) in encoder, but doesn't expose it as an InputParam

None of these accept batch_size as an InputParam in the input step. They all derive it from the actual data.

Why this wasn't in my review: I verified that batch_size flowed correctly through the pipeline and produced correct results. I didn't question whether it should be an input vs. derived value — I checked the plumbing, not the API design.

Proposed gotcha:

Accepting batch_size as an input when it can be derived. In input steps, batch_size should be derived from the actual data (e.g. len(prompt_embeds) or prompt_embeds.shape[0]), not accepted as an InputParam. Accepting it creates a potential inconsistency if the user passes a value that disagrees with the data. See FluxTextInputStep, LTXTextInputStep for reference.


Comment 2: prompt vs revised_prompts redundancy (encoders.py:143)

YiYi is correct. After the enhancer runs:

# encoders.py:142-143
block_state.prompt = revised          # overwrites original with enhanced version
block_state.revised_prompts = list(revised)   # saves the exact same list

Both prompt and revised_prompts contain the same data. The original prompt is lost. Either:

  • Keep the original prompt in prompt and put the enhanced version only in revised_prompts
  • Or drop revised_prompts entirely since prompt already has the rewritten version

No other modular pipeline has this pattern (prompt enhancers are unique to Ernie).

Why this wasn't in my review: I flagged the type hints on prompt/revised_prompts (round 1, issue #4) but didn't examine the semantic relationship between the two outputs. The code "worked" — both values were populated — but I didn't question whether having two identical outputs made sense.

Proposed gotcha:

Producing redundant outputs. Don't declare multiple intermediate_outputs that always contain the same value. Each output should carry distinct information. If a block transforms an input in-place (e.g. rewriting prompt), either preserve the original under a different name or don't create a copy under a new name.


Comment 3: state.set() bypasses the block interface (encoders.py:254)

YiYi is correct. The encoder directly writes to global state:

# encoders.py:254
state.set("batch_size", len(prompt))

I searched all modular pipelines — only HunyuanVideo1.5 (encoders.py:307,390) and ErnieImage use state.set() directly in block __call__. All other pipelines (Flux, LTX, WAN, Helios, QwenImage, Z-Image, SDXL, Flux2) exclusively use set_block_state(), which writes to state via declared intermediate_outputs.

The proper pattern is:

  1. Declare the value as an intermediate_output on the block
  2. Set it on block_state
  3. Let set_block_state() write it to global state

Using state.set() bypasses the block's declared interface — the value doesn't appear in intermediate_outputs, so it's invisible to the framework's introspection and documentation.

Additionally, YiYi's broader point: batch_size shouldn't be produced by the encoder at all. The encoder should encode. The input step should derive batch_size from prompt_embeds. This follows the separation of concerns pattern established by Flux and LTX.

Why this wasn't in my review: I didn't audit whether blocks used state.set() vs set_block_state(). The state.set() call is a single line that doesn't affect correctness — it just bypasses the declared interface. I wasn't checking for framework idiom compliance at that level of detail.

Proposed gotcha:

Using state.set() directly inside block __call__. Blocks should only write to state through their declared intermediate_outputs via set_block_state(). Don't call state.set() directly — it bypasses the block's declared interface, making the output invisible to framework introspection and documentation. If you need to pass a value to downstream blocks, declare it as an intermediate_output.


Comment 4: getattr(block_state, ..., None) in denoise guider inputs (denoise.py:128)

YiYi is correct — getattr(block_state, ..., None) is unnecessary here. This is the same principle as the getattr(components, ...) issue from round 1, but applied to block_state.

The framework's get_block_state() (modular_pipeline.py:491-516) creates a BlockState by collecting all declared inputs from state. For required inputs, it raises ValueError if missing. For optional inputs (like negative_text_bth), it sets them to None. So all declared inputs are guaranteed to exist as attributes on block_state.

Current code:

# denoise.py:126-135
guider_inputs = {
    "text_bth": (
        getattr(block_state, "text_bth", None),        # unnecessary
        getattr(block_state, "negative_text_bth", None), # unnecessary
    ),
    ...
}

Should be:

guider_inputs = {
    "text_bth": (block_state.text_bth, block_state.negative_text_bth),
    ...
}

However, I want to flag that this pattern is widespread in existing pipelines:

  • SDXL (stable_diffusion_xl/denoise.py:204-219): Uses getattr(block_state, ..., None) for all guider inputs
  • Flux2 (flux2/denoise.py:326-334): Same pattern
  • QwenImage (qwenimage/denoise.py:209-218): Same pattern

Newer pipelines have moved away from this:

  • Helios (helios/denoise.py:404-405): Direct block_state.prompt_embeds access
  • WAN, LTX, Z-Image: Use prepare_inputs_from_block_state() method, which encapsulates the extraction entirely

The direction is clearly toward either direct access or the prepare_inputs_from_block_state API. New pipelines should follow the newer pattern.

Why this wasn't in my review: In round 1, I correctly flagged getattr(components, ...) for expected components. But I missed the same pattern applied to block_state in the denoiser, partly because several existing pipelines (SDXL, Flux2, QwenImage) use this exact pattern, making it look "normal." I was comparing against the wrong references.

Proposed gotcha (extends the existing round-1 proposal):

Using getattr with defaults for declared inputs/components. Components listed in expected_components are always registered as attributes — either as the loaded object or None. Similarly, inputs declared in a block's inputs list are always present on block_state after get_block_state() — either with their value or None. Access both directly (e.g. components.name, block_state.name). Don't use getattr(obj, name, None) — it obscures the framework's contract and suggests the attribute might not exist. For guider input construction, prefer prepare_inputs_from_block_state() (see WAN, LTX, Z-Image) or direct access (see Helios).


Consolidated modular.md gotcha proposals (all rounds)

Combining findings from round 1, round 2, and this round:

# Proposed Gotcha Source Why Missed in Review
10 Don't add output_type="latent" inside decoder blocks. Users skip the decode block entirely. Decoders should always decode. Round 1, YiYi feedback Applied standard pipeline thinking to modular context
11 Don't use getattr with defaults for declared inputs/components. Components and block_state inputs are always registered as attributes. Access directly. For guider inputs, prefer prepare_inputs_from_block_state(). Round 1 + this round, YiYi's comments Checked correctness, not framework idiom compliance; older pipelines (SDXL, Flux2) use same anti-pattern
12 Don't hide unconditional logic inside conditional branches. If a block does both unconditional work (normalizing inputs) and conditional work (optional model pass), run unconditional logic first, unconditionally. Round 1, YiYi's comment Checked behavior, not code structure
13 Don't bake num_images_per_prompt into encoder steps. Encoders produce one embedding set per prompt. Expansion belongs in a separate input step. Round 1, YiYi's comment Didn't compare architectural decomposition against reference pipelines
14 Don't accept batch_size as input when derivable from data. Input steps should derive batch_size from prompt_embeds shape/length, not accept it as an InputParam. This round, YiYi's comment Checked data flow, not API design
15 Don't produce redundant outputs. Each intermediate_output should carry distinct information. Don't create copies of the same value under different names. This round, YiYi's comment Checked that outputs were populated, not whether they were semantically distinct
16 Don't use state.set() directly inside blocks. Write to state only through declared intermediate_outputs via set_block_state(). Direct state.set() bypasses the block's interface. This round, YiYi's comment Didn't audit framework API usage patterns

Common review failure modes

Looking across all rounds, there are three recurring themes in what I missed:

  1. Correctness ≠ Idiom compliance. Code that produces correct results can still violate framework contracts. Reviews need to check both.

  2. Comparing against wrong references. Older pipelines (SDXL, Flux2) have patterns that were acceptable at the time but are now superseded. New pipelines should be compared against the newest reference implementations (Helios, WAN, Z-Image, LTX).

  3. Data flow vs. API design. Checking that values flow correctly through the pipeline is necessary but insufficient. Reviews also need to check whether values should flow that way — is this the right block to produce this value? Should this be an input or derived?

Suggested additions to agents.md / review instructions

To help future reviews catch these issues, I'd suggest adding to the review guidelines:

When reviewing modular pipelines, check against the newest reference implementations (currently Helios, WAN, Z-Image, LTX — not SDXL or Flux which use older patterns). Specifically verify:

  • Encoder steps only encode; num_images_per_prompt and batch_size are handled in input steps
  • All state writes go through intermediate_outputs / set_block_state(), never state.set() directly
  • No getattr(obj, name, None) for declared components or inputs
  • Each output carries distinct information
  • Unconditional logic isn't hidden inside conditional branches

@github-actions github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 22, 2026
@akshan-main
Copy link
Copy Markdown
Contributor Author

@yiyixuxu addressed review changes. batch_size now derived from prompt_embeds/text_bth, dropped revised_prompts, state.set, and the getattr on block_state. docstrings regenerated too.

@akshan-main akshan-main requested a review from yiyixuxu April 22, 2026 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants