Add Ernie-Image modular pipeline#13498
Conversation
|
Maybe it could live as a custom pipeline on the Hub like https://huggingface.co/krea/krea-realtime-video/tree/main? |
|
@sayakpaul @yiyixuxu can do the krea pattern by keeping a minimal ErnieImageModularPipeline subclass + MODULAR_PIPELINE_MAPPING entry in diffusers. Want me to restructure the PR that way? |
|
Doing it the Krea way wouldn't require any changes to core Diffusers no? |
|
you were right, the fully hub-only pattern works. moved everything to https://huggingface.co/akshan-main/ernie-image-modular and it loads end-to-end with zero diffusers changes: from diffusers.modular_pipelines import ModularPipeline
pipe = ModularPipeline.from_pretrained("akshan-main/ernie-image-modular", trust_remote_code=True)
pipe.load_components(trust_remote_code=True)No custom ErnieImageModularPipeline class needed. inlined the pipeline properties (vae_scale_factor, num_channels_latents, text_in_dim) into the blocks via direct components.vae.config.* / components.transformer.config.* reads. Should I close this pr @sayakpaul |
|
hey @sayakpaul, over the last few weeks, I've profiled QwenImage and QwenImageEdit to identify cudaStreamSynchronize calls causing per-step latency, which led to the QwenImage RoPE sync fix (#13406, merged), plus modular pipelines for LTX (#13378, merged) and HunyuanVideo 1.5 (#13389, merged), and an HV1.5 I2V bug fix (#13439, merged). I've also published modular upscale hub blocks for SDXL, Flux1, and Z-Image, and just pushed an Ernie-Image modular hub repo. Would this qualify me for MVP recognition, or is there more I should do to get there? happy to keep contributing either way, just wanted to check. |
|
ooh but I'd really love to have official modular support for ERNIE-Image — it's a really good model, trained from scratch, and the Baidu team is committed to releasing more checkpoints and building their own community around it I think we can release official blocks for text-to-image, image-to-image, and edit pipelines (not yet released) and encourage the community to build more custom stuff on hub using our official blocks |
|
Ah my bad, that makes sense. I didn't realize it. |
|
@yiyixuxu I think the modular pipeline is good to review now |
|
friendly ping @yiyixuxu when you get the chance |
|
@claude can you do a review here? thanks |
|
Claude finished @yiyixuxu's task in 3m 50s —— View job Review: Ernie-Image Modular Pipeline
Overall the implementation is well-structured, follows the modular pipeline conventions, and parity numbers look good. A few issues to address: Issues1. Alphabetical ordering in
|
| if isinstance(prompt, str): | ||
| prompt = [prompt] | ||
|
|
||
| pe = getattr(components, "pe", None) |
There was a problem hiding this comment.
@claude what do you think about this pattern?
I think it is unnecesarily (if we listed a component as expected_components it should be an attribute) - can you verify by trace through the code?
did we talk about in our agents.md and other docs? why did it not included in your review
| return padded, lengths | ||
|
|
||
|
|
||
| class ErnieImageTextInputStep(ModularPipelineBlocks): |
There was a problem hiding this comment.
I think we would need a input step that expand tensors for num_images_per_prompt and so on
can we do the same for ernie?
|
@claude i left some additional comments here, can you look at them and let me know what you think and in the case if you agree with me, let me know why they were not included in your review and how wee can update our agents.md to improve moving forward |
|
Claude finished @yiyixuxu's task in 3m 14s —— View job Analysis of YiYi's review comments
Comment 1:
|
| # | Rule | Why missed |
|---|---|---|
| 10 | Don't use getattr(components, ...) for expected components — they're always registered as attributes |
Review checked correctness, not framework idiom compliance |
| 11 | Don't hide unconditional logic inside conditional branches | Review checked behavior, not code structure/clarity |
| 12 | Don't bake num_images_per_prompt into encoder steps — use a separate input step |
Review didn't compare architectural decomposition against reference pipelines |
The common thread: the review focused on "does it work correctly?" but under-weighted "does it follow the established modular architecture patterns?" Adding these explicit rules to modular.md would make pattern violations more detectable in future reviews.
|
@claude for your reviews, from the first review
For modular, if you want to have latent, all you need to do is not run the decode step (it is very easy to do so), do you agree? if so, can you update the list of things we need to include in modular.md to inclujde this -> so that in the future, we are aligned on this |
|
Claude finished @yiyixuxu's task in 1m 49s —— View job Responding to YiYi's feedback on
|
| # | Rule | Why missed in review |
|---|---|---|
| 10 | Don't add output_type="latent" inside decoder blocks — users skip the decode block entirely |
Review applied standard pipeline patterns to modular context |
| 11 | Don't use getattr(components, ...) for expected components — they're always registered as attributes |
Review checked correctness, not framework idiom compliance |
| 12 | Don't hide unconditional logic inside conditional branches | Review checked behavior, not code structure/clarity |
| 13 | Don't bake num_images_per_prompt into encoder steps — use a separate input step |
Review didn't compare architectural decomposition against reference pipelines |
| InputParam("prompt", description="The prompt or prompts to guide image generation."), | ||
| InputParam("negative_prompt", description="The prompt or prompts to avoid during image generation."), | ||
| InputParam( | ||
| "num_images_per_prompt", |
There was a problem hiding this comment.
yeah, lets not accept num_images_per_prompt here - it is more flexible to expand the shape outside the encoder steps -> since we want to support use case where users want to pre-compute text prompts
|
@akshan-main thanks for the PR |
|
@yiyixuxu all good, thanks for the detailed review. Working through the inline comments now, will push updates shortly. Happy to help with modular.md updates when needed |
5779f01 to
d494316
Compare
|
@yiyixuxu moved num_images_per_prompt off the encoder, dropped the getattr, and wrapped the enhancer in an AutoBlock on use_pe. When you get the time 👍 |
|
@claude can you do a review again? thanks |
|
Claude finished @yiyixuxu's task in 4m 30s —— View job Review: Ernie-Image Modular Pipeline (Round 2)
Good progress since round 1. The three structural issues YiYi flagged are properly addressed:
Two remaining issues to fix before merge: Issues1.
|
| Issue | Status |
|---|---|
Alphabetical ordering in src/diffusers/__init__.py |
✅ Fixed — ErnieImage* now correctly before Flux2* |
getattr(components, "pe", None) for expected components |
✅ Fixed — removed, direct access used |
| Prompt enhancer always normalizes prompts | ✅ Fixed — always normalizes str→list, wrapped in AutoPipelineBlocks on use_pe |
num_images_per_prompt baked into encoder |
✅ Fixed — moved to ErnieImageTextInputStep._expand() |
prompt/negative_prompt missing type_hint |
✅ Fixed — type_hint=str added |
What looks good
ErnieImageTextInputStep(before_denoise.py:48-153): Clean separation —_expand()repeats each hidden state pernum_images_per_prompt, then_pad_text()stacks into a batched tensor. This correctly handles the variable-length text hidden states that ErnieImage uses (unlike Flux/QwenImage which have fixed-length embeddings).ErnieImageAutoPromptEnhancerStep(modular_blocks_ernie_image.py:32-73): Good use ofAutoPipelineBlockswithblock_trigger_inputs = ["use_pe"]— the enhancer is cleanly optional.- Guider abstraction (
denoise.py:126-152): Correctly follows the reference pattern (matches QwenImage/Flux2-Klein). Usesgetattr(block_state, ...)for guider_inputs — this is the correct pattern forblock_state(as opposed tocomponents, where direct access is expected). - No
output_type="latent"in decoder: Correctly follows the modular pattern — users skip the decode block entirely rather than adding conditional bypass logic. - BN eps from config (
decoders.py:78): Usesvae.config.batch_norm_epsinstead of hardcoded value. - Alphabetical ordering: Correct in
src/diffusers/__init__.pyand dummy objects. - No cross-imports: All imports from diffusers core.
yiyixuxu
left a comment
There was a problem hiding this comment.
thanks, i left a bit more comments
| description="Number of images to generate per prompt.", | ||
| ), | ||
| InputParam( | ||
| "batch_size", |
There was a problem hiding this comment.
this should not be an input, no? we can always derive from prompt_embed, no?
what happends if prompt_embeds has different batch_size from batch_size?
| ] | ||
|
|
||
| block_state.prompt = revised | ||
| block_state.revised_prompts = list(revised) |
There was a problem hiding this comment.
I think we either:
- keep the original prompt in
prompt - or, drop the
revised_prompt
both prompt and revised_prompt are same thing now, we don't need both
| else: | ||
| block_state.negative_prompt_embeds = None | ||
|
|
||
| state.set("batch_size", len(prompt)) |
There was a problem hiding this comment.
let's avoid this patten cc @claude here
if you want to add something to state, declare them as intermediate_outputs (but here I don't think we need to produce this batch_size as output here. we do that in input steps in core-denoise like other pipelines
| description="Number of images to generate per prompt.", | ||
| ), | ||
| InputParam( | ||
| "batch_size", |
There was a problem hiding this comment.
this should not be an input -> should be output based on prompt_embeds
| def __call__(self, components: ErnieImageModularPipeline, block_state: BlockState, i: int, t: torch.Tensor): | ||
| guider_inputs = { | ||
| "text_bth": ( | ||
| getattr(block_state, "text_bth", None), |
There was a problem hiding this comment.
@claude what do you think of this pattern? do we need to do getattr(..., None) here?
|
@claude can you look at my reviews and update your summary on things we can improve on agents.md/modular.md that can help you with future review? |
|
Claude finished @yiyixuxu's task in 5m 27s —— View job Analysis of YiYi's latest review comments + consolidated modular.md improvements
Comment 1:
|
| # | Proposed Gotcha | Source | Why Missed in Review |
|---|---|---|---|
| 10 | Don't add output_type="latent" inside decoder blocks. Users skip the decode block entirely. Decoders should always decode. |
Round 1, YiYi feedback | Applied standard pipeline thinking to modular context |
| 11 | Don't use getattr with defaults for declared inputs/components. Components and block_state inputs are always registered as attributes. Access directly. For guider inputs, prefer prepare_inputs_from_block_state(). |
Round 1 + this round, YiYi's comments | Checked correctness, not framework idiom compliance; older pipelines (SDXL, Flux2) use same anti-pattern |
| 12 | Don't hide unconditional logic inside conditional branches. If a block does both unconditional work (normalizing inputs) and conditional work (optional model pass), run unconditional logic first, unconditionally. | Round 1, YiYi's comment | Checked behavior, not code structure |
| 13 | Don't bake num_images_per_prompt into encoder steps. Encoders produce one embedding set per prompt. Expansion belongs in a separate input step. |
Round 1, YiYi's comment | Didn't compare architectural decomposition against reference pipelines |
| 14 | Don't accept batch_size as input when derivable from data. Input steps should derive batch_size from prompt_embeds shape/length, not accept it as an InputParam. |
This round, YiYi's comment | Checked data flow, not API design |
| 15 | Don't produce redundant outputs. Each intermediate_output should carry distinct information. Don't create copies of the same value under different names. |
This round, YiYi's comment | Checked that outputs were populated, not whether they were semantically distinct |
| 16 | Don't use state.set() directly inside blocks. Write to state only through declared intermediate_outputs via set_block_state(). Direct state.set() bypasses the block's interface. |
This round, YiYi's comment | Didn't audit framework API usage patterns |
Common review failure modes
Looking across all rounds, there are three recurring themes in what I missed:
-
Correctness ≠ Idiom compliance. Code that produces correct results can still violate framework contracts. Reviews need to check both.
-
Comparing against wrong references. Older pipelines (SDXL, Flux2) have patterns that were acceptable at the time but are now superseded. New pipelines should be compared against the newest reference implementations (Helios, WAN, Z-Image, LTX).
-
Data flow vs. API design. Checking that values flow correctly through the pipeline is necessary but insufficient. Reviews also need to check whether values should flow that way — is this the right block to produce this value? Should this be an input or derived?
Suggested additions to agents.md / review instructions
To help future reviews catch these issues, I'd suggest adding to the review guidelines:
When reviewing modular pipelines, check against the newest reference implementations (currently Helios, WAN, Z-Image, LTX — not SDXL or Flux which use older patterns). Specifically verify:
- Encoder steps only encode;
num_images_per_promptandbatch_sizeare handled in input steps- All state writes go through
intermediate_outputs/set_block_state(), neverstate.set()directly- No
getattr(obj, name, None)for declared components or inputs- Each output carries distinct information
- Unconditional logic isn't hidden inside conditional branches
|
@yiyixuxu addressed review changes. batch_size now derived from prompt_embeds/text_bth, dropped revised_prompts, state.set, and the getattr on block_state. docstrings regenerated too. |
What does this PR do?
Adds the modular pipeline for ErnieImage (
ErnieImageAutoBlocks+ErnieImageModularPipeline).Parity verified on A100, bf16, 50 steps, 1024x1024 with
baidu/ERNIE-Image:Colab Notebook: https://colab.research.google.com/gist/akshan-main/f25801763d573209464d6bfd685d708e/modular-ernie-image.ipynb
Addresses #13389 (comment).
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@yiyixuxu @sayakpaul