Skip to content

feat: Add Motif-Video model and pipelines#13748

Open
tarekziade wants to merge 102 commits into
huggingface:mainfrom
tarekziade:test/motif-video-clone-13551
Open

feat: Add Motif-Video model and pipelines#13748
tarekziade wants to merge 102 commits into
huggingface:mainfrom
tarekziade:test/motif-video-clone-13551

Conversation

@tarekziade
Copy link
Copy Markdown

What does this PR do?

This PR adds support for Motif-Video - a text-to-video (T2V) and image-to-video (I2V) diffusion model from Motif Technologies. The implementation includes the transformer architecture, both pipeline variants, guiding configurations, and comprehensive documentation.

Changes

New Files

  • Model: src/diffusers/models/transformers/transformer_motif_video.py - MotifVideoTransformer3DModel
  • Pipelines:
    • src/diffusers/pipelines/motif_video/pipeline_motif_video.py - Text-to-Video
    • src/diffusers/pipelines/motif_video/pipeline_motif_video_image2video.py - Image-to-Video
  • Output: src/diffusers/pipelines/motif_video/pipeline_output.py
  • Tests:
    • tests/pipelines/motif_video/test_motif_video.py
    • tests/pipelines/motif_video/test_motif_video_image2video.py
  • Documentation:
    • docs/source/en/api/models/motif_video_transformer_3d.md
    • docs/source/en/api/pipelines/motif_video.md

Key Features

  • Architecture: DiT-based transformer with T5Gemma2Encoder for text encoding
  • Flow Match: Uses FlowMatchEulerDiscreteScheduler
  • Guiding: Supports ClassifierFreeGuidance, SkipLayerGuidance, and AdaptiveProjectedGuidance
  • Video Processing: Wan-style VAE for video encoding/decoding

Version Requirements

  • transformers>=5.1.0 - Required for T5Gemma2Encoder (critical bug fix in PR #43633)
  • The pipeline includes a version check that raises a clear error with upgrade instructions if the transformers version is too old

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Ken Cheung and others added 30 commits April 23, 2026 07:08
…dance support

Add complete Motif Video implementation to diffusers:

New Models:
- Add MotifVideoTransformer3DModel with T5Gemma2Encoder for multimodal conditioning
- Supports text-to-video and image-to-video generation with vision tower integration

New Pipelines:
- Add MotifVideoPipeline for text-to-video generation
  - Default resolution: 736x1280, 121 frames, 25 fps
  - Supports classifier-free guidance and AdaptiveProjectedGuidance
- Add MotifVideoImage2VideoPipeline for image-to-video generation
  - First frame conditioning with vision encoder
  - Same defaults as T2V pipeline

Enhanced Guidance:
- Update AdaptiveProjectedGuidance with normalization_dims parameter
  - Support "spatial" normalization for 5D tensors (per-frame spatial normalization)
  - Support custom dimension lists for flexible normalization
  - Update AdaptiveProjectedMixGuidance with same parameter

Documentation & Tests:
- Add comprehensive API documentation for transformer and pipelines
- Add test suites for both T2V and I2V pipelines
- Register all new components in __init__ files
- Add dummy objects for torch and transformers backends

Total: 18 files changed, 3416 insertions(+), 2 deletions(-)
Co-authored-by: Beomgyu Kim <beomgyu.kim@motiftech.io>
- Use default Attention class with custom MotifVideoAttnProcessor2_0
- Inline cross-attention in transformer blocks
- Use dispatch_attention_fn for backend support
- Inherit AttentionMixin for attn_processors/set_attn_processor
- Move TransformerBlockRegistry to _helpers.py
- Add _repeated_blocks for regional compilation
@github-actions github-actions Bot added size/L PR with diff > 200 LOC documentation Improvements or additions to documentation models tests utils pipelines single-file hooks and removed size/L PR with diff > 200 LOC labels May 14, 2026
@tarekziade
Copy link
Copy Markdown
Author

@claude-2-serge can you do a final review?

@github-actions
Copy link
Copy Markdown
Contributor

🤗 Serge says:

⚠️ Review failed: HTTPError: 400 Client Error: Bad Request for url: https://api.anthropic.com/v1/chat/completions

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@tarekziade
Copy link
Copy Markdown
Author

@claude-2-serge can you do a final review?

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤗 Serge says:

Overall this is a solid addition of the Motif-Video model and pipelines. The architecture, registration, tests, and docs are well-structured. I found one correctness bug (dead code from an unreachable branch that masks a missing negative-prompt expansion path), a few incorrect return-type annotations, and a missing newline at end of file.

Correctness

  • Dead code / unreachable branch in encode_prompt (both pipelines): The outer guard if negative_prompt_embeds is None and negative_prompt is not None: guarantees negative_prompt is not None inside the block, so the inner if negative_prompt is None: on line 328 (T2V) / line 340 (I2V) is dead code. More importantly, when a user passes pre-computed negative_prompt_embeds (not None), the duplication for num_videos_per_prompt is skipped entirely — the positive embeddings are always duplicated (lines 317-323) but negative embeddings are only duplicated inside the freshly-computed branch. With num_videos_per_prompt > 1 and pre-computed negative embeds, shapes will mismatch.

  • Incorrect return-type annotations: MotifVideoSingleTransformerBlock.forward is annotated -> torch.Tensor but returns a Tuple[torch.Tensor, torch.Tensor]. Same for MotifVideoRotaryPosEmbed.forward.

Style / Minor

  • docs/source/en/api/pipelines/motif_video.md is missing a trailing newline.

Dead code analysis (advisory)

Under the default config (enable_text_cross_attention_dual=False, enable_text_cross_attention_single=False, num_decoder_layers=0, image_embed_dim=None):

  • MotifVideoCrossAttention and MotifVideoCrossAttnProcessor2_0 are instantiated but self.cross_attn is always None in both block types, so the cross-attention forward path is never exercised.
  • MotifVideoImageProjection (self.image_embedder) is not created when image_embed_dim is None.
  • The decoder path (num_decoder_layers > 0) including decoder_hidden_states = hidden_states.clone() is never reached.

These are likely exercised by specific checkpoint configs (e.g. the published 2B model may set these), so they are advisory only.

31 LLM turns · 37 tool calls · 179.3s · 2606633 in / 6824 out tokens

# Compute negative embeddings if needed
if negative_prompt_embeds is None and negative_prompt is not None:
# Prepare negative_prompt to match batch_size
if negative_prompt is None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: This if negative_prompt is None: branch is unreachable — the outer guard on line 326 already ensures negative_prompt is not None.

More importantly, when a user passes pre-computed negative_prompt_embeds (skipping this entire block), the negative embeddings are not duplicated for num_videos_per_prompt, while the positive embeddings always are (lines 317-323). This will cause a shape mismatch when num_videos_per_prompt > 1 and negative_prompt_embeds is pre-provided.

Suggested fix: move the negative-embed duplication (repeat / repeat_interleave) outside the if negative_prompt_embeds is None block so it applies to both freshly-computed and pre-provided negative embeddings, and remove the dead if negative_prompt is None: branch.

# Compute negative embeddings if needed
if negative_prompt_embeds is None and negative_prompt is not None:
# Prepare negative_prompt to match batch_size
if negative_prompt is None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same dead-code / missing-duplication bug as in the T2V pipeline's encode_prompt. The inner if negative_prompt is None: is unreachable, and pre-provided negative_prompt_embeds won't be duplicated for num_videos_per_prompt.

attention_mask: Optional[torch.Tensor] = None,
image_rotary_emb: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
image_embed_seq_len: int = 0,
) -> torch.Tensor:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect return-type annotation: this method returns (hidden_states, encoder_hidden_states) at line 617, which is a Tuple[torch.Tensor, torch.Tensor], not torch.Tensor.

Suggested change
) -> torch.Tensor:
) -> Tuple[torch.Tensor, torch.Tensor]:

self.rope_dim = rope_dim
self.theta = theta

def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect return-type annotation: this method returns (freqs_cos, freqs_sin) at line 496, which is a Tuple[torch.Tensor, torch.Tensor], not torch.Tensor.

Suggested change
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
def forward(self, hidden_states: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:


## MotifVideoPipelineOutput

[[autodoc]] pipelines.motif_video.pipeline_output.MotifVideoPipelineOutput No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: missing trailing newline at end of file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants