feat: Add Motif-Video model and pipelines#13748
Conversation
…dance support Add complete Motif Video implementation to diffusers: New Models: - Add MotifVideoTransformer3DModel with T5Gemma2Encoder for multimodal conditioning - Supports text-to-video and image-to-video generation with vision tower integration New Pipelines: - Add MotifVideoPipeline for text-to-video generation - Default resolution: 736x1280, 121 frames, 25 fps - Supports classifier-free guidance and AdaptiveProjectedGuidance - Add MotifVideoImage2VideoPipeline for image-to-video generation - First frame conditioning with vision encoder - Same defaults as T2V pipeline Enhanced Guidance: - Update AdaptiveProjectedGuidance with normalization_dims parameter - Support "spatial" normalization for 5D tensors (per-frame spatial normalization) - Support custom dimension lists for flexible normalization - Update AdaptiveProjectedMixGuidance with same parameter Documentation & Tests: - Add comprehensive API documentation for transformer and pipelines - Add test suites for both T2V and I2V pipelines - Register all new components in __init__ files - Add dummy objects for torch and transformers backends Total: 18 files changed, 3416 insertions(+), 2 deletions(-)
… into feat/motif-video
Co-authored-by: Beomgyu Kim <beomgyu.kim@motiftech.io>
- Use default Attention class with custom MotifVideoAttnProcessor2_0 - Inline cross-attention in transformer blocks - Use dispatch_attention_fn for backend support - Inherit AttentionMixin for attn_processors/set_attn_processor - Move TransformerBlockRegistry to _helpers.py - Add _repeated_blocks for regional compilation
… into feat/motif-video
… into feat/motif-video
… into feat/motif-video
|
@claude-2-serge can you do a final review? |
|
🤗 Serge says:
|
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
@claude-2-serge can you do a final review? |
There was a problem hiding this comment.
🤗 Serge says:
Overall this is a solid addition of the Motif-Video model and pipelines. The architecture, registration, tests, and docs are well-structured. I found one correctness bug (dead code from an unreachable branch that masks a missing negative-prompt expansion path), a few incorrect return-type annotations, and a missing newline at end of file.
Correctness
-
Dead code / unreachable branch in
encode_prompt(both pipelines): The outer guardif negative_prompt_embeds is None and negative_prompt is not None:guaranteesnegative_prompt is not Noneinside the block, so the innerif negative_prompt is None:on line 328 (T2V) / line 340 (I2V) is dead code. More importantly, when a user passes pre-computednegative_prompt_embeds(notNone), the duplication fornum_videos_per_promptis skipped entirely — the positive embeddings are always duplicated (lines 317-323) but negative embeddings are only duplicated inside the freshly-computed branch. Withnum_videos_per_prompt > 1and pre-computed negative embeds, shapes will mismatch. -
Incorrect return-type annotations:
MotifVideoSingleTransformerBlock.forwardis annotated-> torch.Tensorbut returns aTuple[torch.Tensor, torch.Tensor]. Same forMotifVideoRotaryPosEmbed.forward.
Style / Minor
docs/source/en/api/pipelines/motif_video.mdis missing a trailing newline.
Dead code analysis (advisory)
Under the default config (enable_text_cross_attention_dual=False, enable_text_cross_attention_single=False, num_decoder_layers=0, image_embed_dim=None):
MotifVideoCrossAttentionandMotifVideoCrossAttnProcessor2_0are instantiated butself.cross_attnis alwaysNonein both block types, so the cross-attention forward path is never exercised.MotifVideoImageProjection(self.image_embedder) is not created whenimage_embed_dimisNone.- The decoder path (
num_decoder_layers > 0) includingdecoder_hidden_states = hidden_states.clone()is never reached.
These are likely exercised by specific checkpoint configs (e.g. the published 2B model may set these), so they are advisory only.
31 LLM turns · 37 tool calls · 179.3s · 2606633 in / 6824 out tokens
| # Compute negative embeddings if needed | ||
| if negative_prompt_embeds is None and negative_prompt is not None: | ||
| # Prepare negative_prompt to match batch_size | ||
| if negative_prompt is None: |
There was a problem hiding this comment.
Bug: This if negative_prompt is None: branch is unreachable — the outer guard on line 326 already ensures negative_prompt is not None.
More importantly, when a user passes pre-computed negative_prompt_embeds (skipping this entire block), the negative embeddings are not duplicated for num_videos_per_prompt, while the positive embeddings always are (lines 317-323). This will cause a shape mismatch when num_videos_per_prompt > 1 and negative_prompt_embeds is pre-provided.
Suggested fix: move the negative-embed duplication (repeat / repeat_interleave) outside the if negative_prompt_embeds is None block so it applies to both freshly-computed and pre-provided negative embeddings, and remove the dead if negative_prompt is None: branch.
| # Compute negative embeddings if needed | ||
| if negative_prompt_embeds is None and negative_prompt is not None: | ||
| # Prepare negative_prompt to match batch_size | ||
| if negative_prompt is None: |
There was a problem hiding this comment.
Same dead-code / missing-duplication bug as in the T2V pipeline's encode_prompt. The inner if negative_prompt is None: is unreachable, and pre-provided negative_prompt_embeds won't be duplicated for num_videos_per_prompt.
| attention_mask: Optional[torch.Tensor] = None, | ||
| image_rotary_emb: Optional[Tuple[torch.Tensor, torch.Tensor]] = None, | ||
| image_embed_seq_len: int = 0, | ||
| ) -> torch.Tensor: |
There was a problem hiding this comment.
Incorrect return-type annotation: this method returns (hidden_states, encoder_hidden_states) at line 617, which is a Tuple[torch.Tensor, torch.Tensor], not torch.Tensor.
| ) -> torch.Tensor: | |
| ) -> Tuple[torch.Tensor, torch.Tensor]: |
| self.rope_dim = rope_dim | ||
| self.theta = theta | ||
|
|
||
| def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: |
There was a problem hiding this comment.
Incorrect return-type annotation: this method returns (freqs_cos, freqs_sin) at line 496, which is a Tuple[torch.Tensor, torch.Tensor], not torch.Tensor.
| def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: | |
| def forward(self, hidden_states: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]: |
|
|
||
| ## MotifVideoPipelineOutput | ||
|
|
||
| [[autodoc]] pipelines.motif_video.pipeline_output.MotifVideoPipelineOutput No newline at end of file |
There was a problem hiding this comment.
Nit: missing trailing newline at end of file.
What does this PR do?
This PR adds support for Motif-Video - a text-to-video (T2V) and image-to-video (I2V) diffusion model from Motif Technologies. The implementation includes the transformer architecture, both pipeline variants, guiding configurations, and comprehensive documentation.
Changes
New Files
src/diffusers/models/transformers/transformer_motif_video.py- MotifVideoTransformer3DModelsrc/diffusers/pipelines/motif_video/pipeline_motif_video.py- Text-to-Videosrc/diffusers/pipelines/motif_video/pipeline_motif_video_image2video.py- Image-to-Videosrc/diffusers/pipelines/motif_video/pipeline_output.pytests/pipelines/motif_video/test_motif_video.pytests/pipelines/motif_video/test_motif_video_image2video.pydocs/source/en/api/models/motif_video_transformer_3d.mddocs/source/en/api/pipelines/motif_video.mdKey Features
Version Requirements
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.