Skip to content

deepfloyd_if model/pipeline review #13646

@hlky

Description

@hlky

deepfloyd_if model/pipeline review

Commit tested: 0f1abc4ae8b0eb2a3b40e82a310507281144c423

Review performed against the repository review rules. Note: .ai/review-rules.md references AGENTS.md, but that file is absent in this checkout; the remaining referenced rule files were applied.

Duplicate search status: searched GitHub Issues and PRs for deepfloyd_if, affected class/function names, NumPy image/mask failures, IFPipelineOutput import behavior, strength validation, and T5FilmDecoder coverage. I found no likely duplicates.

Issue 1: Unbatched NumPy image inputs are rejected as batch-size mismatches

Affected code:

if isinstance(image, list):
check_image_type = image[0]
else:
check_image_type = image
if (
not isinstance(check_image_type, torch.Tensor)
and not isinstance(check_image_type, PIL.Image.Image)
and not isinstance(check_image_type, np.ndarray)
):
raise ValueError(
"`image` has to be of type `torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, or list[...] but is"
f" {type(check_image_type)}"
)
if isinstance(image, list):
image_batch_size = len(image)
elif isinstance(image, torch.Tensor):
image_batch_size = image.shape[0]
elif isinstance(image, PIL.Image.Image):
image_batch_size = 1
elif isinstance(image, np.ndarray):
image_batch_size = image.shape[0]
else:
assert False
if batch_size != image_batch_size:
raise ValueError(f"image batch size: {image_batch_size} must be same as prompt batch size {batch_size}")

if isinstance(image, list):
check_image_type = image[0]
else:
check_image_type = image
if (
not isinstance(check_image_type, torch.Tensor)
and not isinstance(check_image_type, PIL.Image.Image)
and not isinstance(check_image_type, np.ndarray)
):
raise ValueError(
"`image` has to be of type `torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, or list[...] but is"
f" {type(check_image_type)}"
)
if isinstance(image, list):
image_batch_size = len(image)
elif isinstance(image, torch.Tensor):
image_batch_size = image.shape[0]
elif isinstance(image, PIL.Image.Image):
image_batch_size = 1
elif isinstance(image, np.ndarray):
image_batch_size = image.shape[0]
else:
assert False
if batch_size != image_batch_size:
raise ValueError(f"image batch size: {image_batch_size} must be same as prompt batch size {batch_size}")

# image
if isinstance(image, list):
check_image_type = image[0]
else:
check_image_type = image
if (
not isinstance(check_image_type, torch.Tensor)
and not isinstance(check_image_type, PIL.Image.Image)
and not isinstance(check_image_type, np.ndarray)
):
raise ValueError(
"`image` has to be of type `torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, or list[...] but is"
f" {type(check_image_type)}"
)
if isinstance(image, list):
image_batch_size = len(image)
elif isinstance(image, torch.Tensor):
image_batch_size = image.shape[0]
elif isinstance(image, PIL.Image.Image):
image_batch_size = 1
elif isinstance(image, np.ndarray):
image_batch_size = image.shape[0]
else:
assert False
if batch_size != image_batch_size:
raise ValueError(f"image batch size: {image_batch_size} must be same as prompt batch size {batch_size}")
# mask_image
if isinstance(mask_image, list):
check_image_type = mask_image[0]
else:
check_image_type = mask_image
if (
not isinstance(check_image_type, torch.Tensor)
and not isinstance(check_image_type, PIL.Image.Image)
and not isinstance(check_image_type, np.ndarray)
):
raise ValueError(
"`mask_image` has to be of type `torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, or list[...] but is"
f" {type(check_image_type)}"
)
if isinstance(mask_image, list):
image_batch_size = len(mask_image)
elif isinstance(mask_image, torch.Tensor):
image_batch_size = mask_image.shape[0]
elif isinstance(mask_image, PIL.Image.Image):
image_batch_size = 1
elif isinstance(mask_image, np.ndarray):
image_batch_size = mask_image.shape[0]
else:
assert False
if image_batch_size != 1 and batch_size != image_batch_size:
raise ValueError(
f"mask_image batch size: {image_batch_size} must be `1` or the same as prompt batch size {batch_size}"
)

# image
if isinstance(image, list):
check_image_type = image[0]
else:
check_image_type = image
if (
not isinstance(check_image_type, torch.Tensor)
and not isinstance(check_image_type, PIL.Image.Image)
and not isinstance(check_image_type, np.ndarray)
):
raise ValueError(
"`image` has to be of type `torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, or list[...] but is"
f" {type(check_image_type)}"
)
if isinstance(image, list):
image_batch_size = len(image)
elif isinstance(image, torch.Tensor):
image_batch_size = image.shape[0]
elif isinstance(image, PIL.Image.Image):
image_batch_size = 1
elif isinstance(image, np.ndarray):
image_batch_size = image.shape[0]
else:
assert False
if batch_size != image_batch_size:
raise ValueError(f"image batch size: {image_batch_size} must be same as prompt batch size {batch_size}")
# original_image
if isinstance(original_image, list):
check_image_type = original_image[0]
else:
check_image_type = original_image
if (
not isinstance(check_image_type, torch.Tensor)
and not isinstance(check_image_type, PIL.Image.Image)
and not isinstance(check_image_type, np.ndarray)
):
raise ValueError(
"`original_image` has to be of type `torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, or list[...] but is"
f" {type(check_image_type)}"
)
if isinstance(original_image, list):
image_batch_size = len(original_image)
elif isinstance(original_image, torch.Tensor):
image_batch_size = original_image.shape[0]
elif isinstance(original_image, PIL.Image.Image):
image_batch_size = 1
elif isinstance(original_image, np.ndarray):
image_batch_size = original_image.shape[0]
else:
assert False
if batch_size != image_batch_size:
raise ValueError(
f"original_image batch size: {image_batch_size} must be same as prompt batch size {batch_size}"
)

f" {negative_prompt_embeds.shape}."
)
# image
if isinstance(image, list):
check_image_type = image[0]
else:
check_image_type = image
if (
not isinstance(check_image_type, torch.Tensor)
and not isinstance(check_image_type, PIL.Image.Image)
and not isinstance(check_image_type, np.ndarray)
):
raise ValueError(
"`image` has to be of type `torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, or list[...] but is"
f" {type(check_image_type)}"
)
if isinstance(image, list):
image_batch_size = len(image)
elif isinstance(image, torch.Tensor):
image_batch_size = image.shape[0]
elif isinstance(image, PIL.Image.Image):
image_batch_size = 1
elif isinstance(image, np.ndarray):
image_batch_size = image.shape[0]
else:
assert False
if batch_size != image_batch_size:
raise ValueError(f"image batch size: {image_batch_size} must be same as prompt batch size {batch_size}")
# original_image
if isinstance(original_image, list):
check_image_type = original_image[0]
else:
check_image_type = original_image
if (
not isinstance(check_image_type, torch.Tensor)
and not isinstance(check_image_type, PIL.Image.Image)
and not isinstance(check_image_type, np.ndarray)
):
raise ValueError(
"`original_image` has to be of type `torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, or list[...] but is"
f" {type(check_image_type)}"
)
if isinstance(original_image, list):
image_batch_size = len(original_image)
elif isinstance(original_image, torch.Tensor):
image_batch_size = original_image.shape[0]
elif isinstance(original_image, PIL.Image.Image):
image_batch_size = 1
elif isinstance(original_image, np.ndarray):
image_batch_size = original_image.shape[0]
else:
assert False
if batch_size != image_batch_size:
raise ValueError(
f"original_image batch size: {image_batch_size} must be same as prompt batch size {batch_size}"
)
# mask_image
if isinstance(mask_image, list):
check_image_type = mask_image[0]
else:
check_image_type = mask_image
if (
not isinstance(check_image_type, torch.Tensor)
and not isinstance(check_image_type, PIL.Image.Image)
and not isinstance(check_image_type, np.ndarray)
):
raise ValueError(
"`mask_image` has to be of type `torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, or list[...] but is"
f" {type(check_image_type)}"
)
if isinstance(mask_image, list):
image_batch_size = len(mask_image)
elif isinstance(mask_image, torch.Tensor):
image_batch_size = mask_image.shape[0]
elif isinstance(mask_image, PIL.Image.Image):
image_batch_size = 1
elif isinstance(mask_image, np.ndarray):
image_batch_size = mask_image.shape[0]
else:
assert False
if image_batch_size != 1 and batch_size != image_batch_size:

Problem:
The validators treat np.ndarray.shape[0] as batch size for every NumPy image. A normal single image shaped (H, W, C) is interpreted as batch size H, even though the preprocessors can handle single HWC arrays.

Impact:
Users passing valid single NumPy images get a misleading batch-size error. Fast tests only cover tensor batches and slow tests use PIL images, so this path is untested.

Reproduction:

import numpy as np
import torch
from diffusers import DDPMScheduler, IFImg2ImgPipeline, UNet2DConditionModel

unet = UNet2DConditionModel(
    sample_size=8, in_channels=3, out_channels=6, layers_per_block=1,
    block_out_channels=(8,), down_block_types=("CrossAttnDownBlock2D",),
    up_block_types=("CrossAttnUpBlock2D",), cross_attention_dim=4,
    attention_head_dim=4, norm_num_groups=1,
)
pipe = IFImg2ImgPipeline(None, None, unet, DDPMScheduler(num_train_timesteps=10, variance_type="learned_range"), None, None, None, False)

pipe.check_inputs(
    prompt=None,
    image=np.zeros((8, 8, 3), dtype=np.float32),
    batch_size=1,
    callback_steps=1,
    prompt_embeds=torch.zeros(1, 77, 4),
    negative_prompt_embeds=torch.zeros(1, 77, 4),
)

Relevant precedent:
The local preprocessors already wrap non-list NumPy images and would handle HWC as a single image if validation allowed it.

Suggested fix:

def _image_batch_size(image):
    if isinstance(image, list):
        return len(image)
    if isinstance(image, PIL.Image.Image):
        return 1
    if isinstance(image, np.ndarray):
        return image.shape[0] if image.ndim == 4 else 1
    if isinstance(image, torch.Tensor):
        return image.shape[0] if image.ndim == 4 else 1
    raise TypeError(type(image))

Issue 2: Batched NumPy masks are converted to 5D tensors

Affected code:

def preprocess_mask_image(self, mask_image) -> torch.Tensor:
if not isinstance(mask_image, list):
mask_image = [mask_image]
if isinstance(mask_image[0], torch.Tensor):
mask_image = torch.cat(mask_image, axis=0) if mask_image[0].ndim == 4 else torch.stack(mask_image, axis=0)
if mask_image.ndim == 2:
# Batch and add channel dim for single mask
mask_image = mask_image.unsqueeze(0).unsqueeze(0)
elif mask_image.ndim == 3 and mask_image.shape[0] == 1:
# Single mask, the 0'th dimension is considered to be
# the existing batch size of 1
mask_image = mask_image.unsqueeze(0)
elif mask_image.ndim == 3 and mask_image.shape[0] != 1:
# Batch of mask, the 0'th dimension is considered to be
# the batching dimension
mask_image = mask_image.unsqueeze(1)
mask_image[mask_image < 0.5] = 0
mask_image[mask_image >= 0.5] = 1
elif isinstance(mask_image[0], PIL.Image.Image):
new_mask_image = []
for mask_image_ in mask_image:
mask_image_ = mask_image_.convert("L")
mask_image_ = resize(mask_image_, self.unet.config.sample_size)
mask_image_ = np.array(mask_image_)
mask_image_ = mask_image_[None, None, :]
new_mask_image.append(mask_image_)
mask_image = new_mask_image
mask_image = np.concatenate(mask_image, axis=0)
mask_image = mask_image.astype(np.float32) / 255.0
mask_image[mask_image < 0.5] = 0
mask_image[mask_image >= 0.5] = 1
mask_image = torch.from_numpy(mask_image)
elif isinstance(mask_image[0], np.ndarray):
mask_image = np.concatenate([m[None, None, :] for m in mask_image], axis=0)
mask_image[mask_image < 0.5] = 0
mask_image[mask_image >= 0.5] = 1
mask_image = torch.from_numpy(mask_image)

# Copied from diffusers.pipelines.deepfloyd_if.pipeline_if_inpainting.IFInpaintingPipeline.preprocess_mask_image
def preprocess_mask_image(self, mask_image) -> torch.Tensor:
if not isinstance(mask_image, list):
mask_image = [mask_image]
if isinstance(mask_image[0], torch.Tensor):
mask_image = torch.cat(mask_image, axis=0) if mask_image[0].ndim == 4 else torch.stack(mask_image, axis=0)
if mask_image.ndim == 2:
# Batch and add channel dim for single mask
mask_image = mask_image.unsqueeze(0).unsqueeze(0)
elif mask_image.ndim == 3 and mask_image.shape[0] == 1:
# Single mask, the 0'th dimension is considered to be
# the existing batch size of 1
mask_image = mask_image.unsqueeze(0)
elif mask_image.ndim == 3 and mask_image.shape[0] != 1:
# Batch of mask, the 0'th dimension is considered to be
# the batching dimension
mask_image = mask_image.unsqueeze(1)
mask_image[mask_image < 0.5] = 0
mask_image[mask_image >= 0.5] = 1
elif isinstance(mask_image[0], PIL.Image.Image):
new_mask_image = []
for mask_image_ in mask_image:
mask_image_ = mask_image_.convert("L")
mask_image_ = resize(mask_image_, self.unet.config.sample_size)
mask_image_ = np.array(mask_image_)
mask_image_ = mask_image_[None, None, :]
new_mask_image.append(mask_image_)
mask_image = new_mask_image
mask_image = np.concatenate(mask_image, axis=0)
mask_image = mask_image.astype(np.float32) / 255.0
mask_image[mask_image < 0.5] = 0
mask_image[mask_image >= 0.5] = 1
mask_image = torch.from_numpy(mask_image)
elif isinstance(mask_image[0], np.ndarray):
mask_image = np.concatenate([m[None, None, :] for m in mask_image], axis=0)
mask_image[mask_image < 0.5] = 0
mask_image[mask_image >= 0.5] = 1
mask_image = torch.from_numpy(mask_image)

Problem:
preprocess_mask_image() wraps a non-list NumPy mask in a list, then blindly applies m[None, None, :]. For a batched mask shaped (B, H, W), this returns (1, 1, B, H, W) instead of (B, 1, H, W).

Impact:
Batched NumPy masks either fail later during broadcasting or apply the mask with the wrong shape. This is not covered by the current tests.

Reproduction:

import numpy as np
from diffusers import DDPMScheduler, IFInpaintingPipeline, UNet2DConditionModel

unet = UNet2DConditionModel(
    sample_size=8, in_channels=3, out_channels=6, layers_per_block=1,
    block_out_channels=(8,), down_block_types=("CrossAttnDownBlock2D",),
    up_block_types=("CrossAttnUpBlock2D",), cross_attention_dim=4,
    attention_head_dim=4, norm_num_groups=1,
)
pipe = IFInpaintingPipeline(None, None, unet, DDPMScheduler(num_train_timesteps=10, variance_type="learned_range"), None, None, None, False)

mask = np.zeros((2, 8, 8), dtype=np.float32)
print(pipe.preprocess_mask_image(mask).shape)  # torch.Size([1, 1, 2, 8, 8])

Relevant precedent:
Tensor masks in the same method distinguish 2D single masks from 3D batched masks before adding channel dimensions.

Suggested fix:

elif isinstance(mask_image[0], np.ndarray):
    mask_image = np.stack(mask_image, axis=0) if len(mask_image) > 1 else mask_image[0]

    if mask_image.ndim == 2:
        mask_image = mask_image[None, None, :, :]
    elif mask_image.ndim == 3:
        mask_image = mask_image[:, None, :, :]
    elif mask_image.ndim == 4 and mask_image.shape[-1] == 1:
        mask_image = mask_image.transpose(0, 3, 1, 2)
    else:
        raise ValueError(f"Unsupported mask shape: {mask_image.shape}")

    mask_image = (mask_image >= 0.5).astype(np.float32)
    mask_image = torch.from_numpy(mask_image)

Issue 3: strength is documented as constrained but invalid values are accepted

Affected code:

def check_inputs(
self,
prompt,
image,
batch_size,
callback_steps,
negative_prompt=None,
prompt_embeds=None,
negative_prompt_embeds=None,
):
if (callback_steps is None) or (
callback_steps is not None and (not isinstance(callback_steps, int) or callback_steps <= 0)
):
raise ValueError(
f"`callback_steps` has to be a positive integer but is {callback_steps} of type"
f" {type(callback_steps)}."
)
if prompt is not None and prompt_embeds is not None:
raise ValueError(
f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
" only forward one of the two."
)
elif prompt is None and prompt_embeds is None:
raise ValueError(
"Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
)
elif prompt is not None and (not isinstance(prompt, str) and not isinstance(prompt, list)):
raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
if negative_prompt is not None and negative_prompt_embeds is not None:
raise ValueError(
f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
)
if prompt_embeds is not None and negative_prompt_embeds is not None:
if prompt_embeds.shape != negative_prompt_embeds.shape:
raise ValueError(
"`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
f" {negative_prompt_embeds.shape}."
)
if isinstance(image, list):
check_image_type = image[0]
else:
check_image_type = image
if (
not isinstance(check_image_type, torch.Tensor)
and not isinstance(check_image_type, PIL.Image.Image)
and not isinstance(check_image_type, np.ndarray)
):
raise ValueError(
"`image` has to be of type `torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, or list[...] but is"
f" {type(check_image_type)}"
)
if isinstance(image, list):
image_batch_size = len(image)
elif isinstance(image, torch.Tensor):
image_batch_size = image.shape[0]
elif isinstance(image, PIL.Image.Image):
image_batch_size = 1
elif isinstance(image, np.ndarray):
image_batch_size = image.shape[0]
else:
assert False
if batch_size != image_batch_size:
raise ValueError(f"image batch size: {image_batch_size} must be same as prompt batch size {batch_size}")

# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion_img2img.StableDiffusionImg2ImgPipeline.get_timesteps
def get_timesteps(self, num_inference_steps, strength):
# get the original timestep using init_timestep
init_timestep = min(int(num_inference_steps * strength), num_inference_steps)
t_start = max(num_inference_steps - init_timestep, 0)
timesteps = self.scheduler.timesteps[t_start * self.scheduler.order :]
if hasattr(self.scheduler, "set_begin_index"):
self.scheduler.set_begin_index(t_start * self.scheduler.order)
return timesteps, num_inference_steps - t_start

def check_inputs(
self,
prompt,
image,
mask_image,
batch_size,
callback_steps,
negative_prompt=None,
prompt_embeds=None,
negative_prompt_embeds=None,
):
if (callback_steps is None) or (
callback_steps is not None and (not isinstance(callback_steps, int) or callback_steps <= 0)
):
raise ValueError(
f"`callback_steps` has to be a positive integer but is {callback_steps} of type"
f" {type(callback_steps)}."
)
if prompt is not None and prompt_embeds is not None:
raise ValueError(
f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
" only forward one of the two."
)
elif prompt is None and prompt_embeds is None:
raise ValueError(
"Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
)
elif prompt is not None and (not isinstance(prompt, str) and not isinstance(prompt, list)):
raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
if negative_prompt is not None and negative_prompt_embeds is not None:
raise ValueError(
f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
)
if prompt_embeds is not None and negative_prompt_embeds is not None:
if prompt_embeds.shape != negative_prompt_embeds.shape:
raise ValueError(
"`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
f" {negative_prompt_embeds.shape}."
)
# image
if isinstance(image, list):
check_image_type = image[0]
else:
check_image_type = image
if (
not isinstance(check_image_type, torch.Tensor)
and not isinstance(check_image_type, PIL.Image.Image)
and not isinstance(check_image_type, np.ndarray)
):
raise ValueError(
"`image` has to be of type `torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, or list[...] but is"
f" {type(check_image_type)}"
)
if isinstance(image, list):
image_batch_size = len(image)
elif isinstance(image, torch.Tensor):
image_batch_size = image.shape[0]
elif isinstance(image, PIL.Image.Image):
image_batch_size = 1
elif isinstance(image, np.ndarray):
image_batch_size = image.shape[0]
else:
assert False
if batch_size != image_batch_size:
raise ValueError(f"image batch size: {image_batch_size} must be same as prompt batch size {batch_size}")
# mask_image
if isinstance(mask_image, list):
check_image_type = mask_image[0]
else:
check_image_type = mask_image
if (
not isinstance(check_image_type, torch.Tensor)
and not isinstance(check_image_type, PIL.Image.Image)
and not isinstance(check_image_type, np.ndarray)
):
raise ValueError(
"`mask_image` has to be of type `torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, or list[...] but is"
f" {type(check_image_type)}"
)
if isinstance(mask_image, list):
image_batch_size = len(mask_image)
elif isinstance(mask_image, torch.Tensor):
image_batch_size = mask_image.shape[0]
elif isinstance(mask_image, PIL.Image.Image):
image_batch_size = 1
elif isinstance(mask_image, np.ndarray):
image_batch_size = mask_image.shape[0]
else:
assert False
if image_batch_size != 1 and batch_size != image_batch_size:
raise ValueError(
f"mask_image batch size: {image_batch_size} must be `1` or the same as prompt batch size {batch_size}"
)

def check_inputs(
self,
prompt,
image,
original_image,
batch_size,
callback_steps,
negative_prompt=None,
prompt_embeds=None,
negative_prompt_embeds=None,
):
if (callback_steps is None) or (
callback_steps is not None and (not isinstance(callback_steps, int) or callback_steps <= 0)
):
raise ValueError(
f"`callback_steps` has to be a positive integer but is {callback_steps} of type"
f" {type(callback_steps)}."
)
if prompt is not None and prompt_embeds is not None:
raise ValueError(
f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
" only forward one of the two."
)
elif prompt is None and prompt_embeds is None:
raise ValueError(
"Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
)
elif prompt is not None and (not isinstance(prompt, str) and not isinstance(prompt, list)):
raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
if negative_prompt is not None and negative_prompt_embeds is not None:
raise ValueError(
f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
)
if prompt_embeds is not None and negative_prompt_embeds is not None:
if prompt_embeds.shape != negative_prompt_embeds.shape:
raise ValueError(
"`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
f" {negative_prompt_embeds.shape}."
)
# image
if isinstance(image, list):
check_image_type = image[0]
else:
check_image_type = image
if (
not isinstance(check_image_type, torch.Tensor)
and not isinstance(check_image_type, PIL.Image.Image)
and not isinstance(check_image_type, np.ndarray)
):
raise ValueError(
"`image` has to be of type `torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, or list[...] but is"
f" {type(check_image_type)}"
)
if isinstance(image, list):
image_batch_size = len(image)
elif isinstance(image, torch.Tensor):
image_batch_size = image.shape[0]
elif isinstance(image, PIL.Image.Image):
image_batch_size = 1
elif isinstance(image, np.ndarray):
image_batch_size = image.shape[0]
else:
assert False
if batch_size != image_batch_size:
raise ValueError(f"image batch size: {image_batch_size} must be same as prompt batch size {batch_size}")
# original_image
if isinstance(original_image, list):
check_image_type = original_image[0]
else:
check_image_type = original_image
if (
not isinstance(check_image_type, torch.Tensor)
and not isinstance(check_image_type, PIL.Image.Image)
and not isinstance(check_image_type, np.ndarray)
):
raise ValueError(
"`original_image` has to be of type `torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, or list[...] but is"
f" {type(check_image_type)}"
)
if isinstance(original_image, list):
image_batch_size = len(original_image)
elif isinstance(original_image, torch.Tensor):
image_batch_size = original_image.shape[0]
elif isinstance(original_image, PIL.Image.Image):
image_batch_size = 1
elif isinstance(original_image, np.ndarray):
image_batch_size = original_image.shape[0]
else:
assert False
if batch_size != image_batch_size:
raise ValueError(
f"original_image batch size: {image_batch_size} must be same as prompt batch size {batch_size}"
)

def check_inputs(
self,
prompt,
image,
original_image,
mask_image,
batch_size,
callback_steps,
negative_prompt=None,
prompt_embeds=None,
negative_prompt_embeds=None,
):
if (callback_steps is None) or (
callback_steps is not None and (not isinstance(callback_steps, int) or callback_steps <= 0)
):
raise ValueError(
f"`callback_steps` has to be a positive integer but is {callback_steps} of type"
f" {type(callback_steps)}."
)
if prompt is not None and prompt_embeds is not None:
raise ValueError(
f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
" only forward one of the two."
)
elif prompt is None and prompt_embeds is None:
raise ValueError(
"Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
)
elif prompt is not None and (not isinstance(prompt, str) and not isinstance(prompt, list)):
raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
if negative_prompt is not None and negative_prompt_embeds is not None:
raise ValueError(
f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
)
if prompt_embeds is not None and negative_prompt_embeds is not None:
if prompt_embeds.shape != negative_prompt_embeds.shape:
raise ValueError(
"`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
f" {negative_prompt_embeds.shape}."
)
# image
if isinstance(image, list):
check_image_type = image[0]
else:
check_image_type = image
if (
not isinstance(check_image_type, torch.Tensor)
and not isinstance(check_image_type, PIL.Image.Image)
and not isinstance(check_image_type, np.ndarray)
):
raise ValueError(
"`image` has to be of type `torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, or list[...] but is"
f" {type(check_image_type)}"
)
if isinstance(image, list):
image_batch_size = len(image)
elif isinstance(image, torch.Tensor):
image_batch_size = image.shape[0]
elif isinstance(image, PIL.Image.Image):
image_batch_size = 1
elif isinstance(image, np.ndarray):
image_batch_size = image.shape[0]
else:
assert False
if batch_size != image_batch_size:
raise ValueError(f"image batch size: {image_batch_size} must be same as prompt batch size {batch_size}")
# original_image
if isinstance(original_image, list):
check_image_type = original_image[0]
else:
check_image_type = original_image
if (
not isinstance(check_image_type, torch.Tensor)
and not isinstance(check_image_type, PIL.Image.Image)
and not isinstance(check_image_type, np.ndarray)
):
raise ValueError(
"`original_image` has to be of type `torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, or list[...] but is"
f" {type(check_image_type)}"
)
if isinstance(original_image, list):
image_batch_size = len(original_image)
elif isinstance(original_image, torch.Tensor):
image_batch_size = original_image.shape[0]
elif isinstance(original_image, PIL.Image.Image):
image_batch_size = 1
elif isinstance(original_image, np.ndarray):
image_batch_size = original_image.shape[0]
else:
assert False
if batch_size != image_batch_size:
raise ValueError(
f"original_image batch size: {image_batch_size} must be same as prompt batch size {batch_size}"
)
# mask_image
if isinstance(mask_image, list):
check_image_type = mask_image[0]
else:
check_image_type = mask_image
if (
not isinstance(check_image_type, torch.Tensor)
and not isinstance(check_image_type, PIL.Image.Image)
and not isinstance(check_image_type, np.ndarray)
):
raise ValueError(
"`mask_image` has to be of type `torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, or list[...] but is"
f" {type(check_image_type)}"
)
if isinstance(mask_image, list):
image_batch_size = len(mask_image)
elif isinstance(mask_image, torch.Tensor):
image_batch_size = mask_image.shape[0]
elif isinstance(mask_image, PIL.Image.Image):
image_batch_size = 1
elif isinstance(mask_image, np.ndarray):
image_batch_size = mask_image.shape[0]
else:
assert False
if image_batch_size != 1 and batch_size != image_batch_size:

Problem:
The docs say strength must be between 0 and 1, but the four img2img/inpainting variants never validate it. Negative values can silently produce empty outputs, and values above 1 are clamped by timestep math and behave like 1.

Impact:
Invalid user input produces surprising generation behavior instead of a clear ValueError.

Reproduction:

import torch
from diffusers import DDPMScheduler, IFImg2ImgPipeline, UNet2DConditionModel

unet = UNet2DConditionModel(
    sample_size=8, in_channels=3, out_channels=6, layers_per_block=1,
    block_out_channels=(8,), down_block_types=("CrossAttnDownBlock2D",),
    up_block_types=("CrossAttnUpBlock2D",), cross_attention_dim=4,
    attention_head_dim=4, norm_num_groups=1,
)
pipe = IFImg2ImgPipeline(None, None, unet, DDPMScheduler(num_train_timesteps=10, variance_type="learned_range"), None, None, None, False)

embeds = torch.zeros(1, 77, 4)
image = torch.zeros(1, 3, 8, 8)
out = pipe(prompt_embeds=embeds, negative_prompt_embeds=embeds, image=image, strength=-0.1, num_inference_steps=2, output_type="pt")
print(out.images.shape)  # torch.Size([0, 3, 8, 8])

Relevant precedent:

def check_inputs(
self,
prompt,
strength,
callback_steps,
negative_prompt=None,
prompt_embeds=None,
negative_prompt_embeds=None,
ip_adapter_image=None,
ip_adapter_image_embeds=None,
callback_on_step_end_tensor_inputs=None,
):
if strength < 0 or strength > 1:
raise ValueError(f"The value of strength should in [0.0, 1.0] but is {strength}")

Suggested fix:

def check_inputs(..., strength, ...):
    if strength < 0 or strength > 1:
        raise ValueError(f"The value of `strength` should be in [0.0, 1.0], but is {strength}")

Issue 4: IFPipelineOutput is hidden behind torch+transformers lazy-import guards

Affected code:

]
}
try:
if not (is_transformers_available() and is_torch_available()):
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
from ...utils import dummy_torch_and_transformers_objects # noqa F403
_dummy_objects.update(get_objects_from_module(dummy_torch_and_transformers_objects))
else:
_import_structure["pipeline_if"] = ["IFPipeline"]
_import_structure["pipeline_if_img2img"] = ["IFImg2ImgPipeline"]
_import_structure["pipeline_if_img2img_superresolution"] = ["IFImg2ImgSuperResolutionPipeline"]
_import_structure["pipeline_if_inpainting"] = ["IFInpaintingPipeline"]
_import_structure["pipeline_if_inpainting_superresolution"] = ["IFInpaintingSuperResolutionPipeline"]
_import_structure["pipeline_if_superresolution"] = ["IFSuperResolutionPipeline"]
_import_structure["pipeline_output"] = ["IFPipelineOutput"]
_import_structure["safety_checker"] = ["IFSafetyChecker"]
_import_structure["watermark"] = ["IFWatermarker"]

Problem:
IFPipelineOutput is a lightweight dataclass that only needs NumPy/PIL/BaseOutput, but it is added to _import_structure only when both torch and transformers are available.

Impact:
In dependency-light environments, users cannot import an output type that does not require the missing dependency.

Reproduction:

import importlib
import sys
import diffusers.utils.import_utils as iu

iu._transformers_available = False
for name in list(sys.modules):
    if name.startswith("diffusers.pipelines.deepfloyd_if"):
        del sys.modules[name]

module = importlib.import_module("diffusers.pipelines.deepfloyd_if")
print(hasattr(module, "IFPipelineOutput"))  # False
from diffusers.pipelines.deepfloyd_if import IFPipelineOutput  # ImportError

Relevant precedent:
The timesteps constants in the same __init__.py are already exported outside the torch+transformers guard.

Suggested fix:

_import_structure = {
    "timesteps": [...],
    "pipeline_output": ["IFPipelineOutput"],
}

Issue 5: encode_prompt() detaches gradients in all IF pipelines

Affected code:

@torch.no_grad()
def encode_prompt(
self,
prompt: str | list[str],
do_classifier_free_guidance: bool = True,
num_images_per_prompt: int = 1,
device: torch.device | None = None,
negative_prompt: str | list[str] | None = None,
prompt_embeds: torch.Tensor | None = None,
negative_prompt_embeds: torch.Tensor | None = None,
clean_caption: bool = False,
):
r"""
Encodes the prompt into text encoder hidden states.
Args:
prompt (`str` or `list[str]`, *optional*):
prompt to be encoded
do_classifier_free_guidance (`bool`, *optional*, defaults to `True`):
whether to use classifier free guidance or not
num_images_per_prompt (`int`, *optional*, defaults to 1):
number of images that should be generated per prompt
device: (`torch.device`, *optional*):
torch device to place the resulting embeddings on
negative_prompt (`str` or `list[str]`, *optional*):
The prompt or prompts not to guide the image generation. If not defined, one has to pass
`negative_prompt_embeds`. instead. If not defined, one has to pass `negative_prompt_embeds`. instead.
Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
prompt_embeds (`torch.Tensor`, *optional*):
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
provided, text embeddings will be generated from `prompt` input argument.
negative_prompt_embeds (`torch.Tensor`, *optional*):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
argument.
clean_caption (bool, defaults to `False`):
If `True`, the function will preprocess and clean the provided caption before encoding.
"""
if prompt is not None and negative_prompt is not None:
if type(prompt) is not type(negative_prompt):
raise TypeError(
f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
f" {type(prompt)}."
)
if device is None:
device = self._execution_device
if prompt is not None and isinstance(prompt, str):
batch_size = 1
elif prompt is not None and isinstance(prompt, list):
batch_size = len(prompt)
else:
batch_size = prompt_embeds.shape[0]
# while T5 can handle much longer input sequences than 77, the text encoder was trained with a max length of 77 for IF
max_length = 77
if prompt_embeds is None:
prompt = self._text_preprocessing(prompt, clean_caption=clean_caption)
text_inputs = self.tokenizer(
prompt,
padding="max_length",
max_length=max_length,
truncation=True,
add_special_tokens=True,
return_tensors="pt",
)
text_input_ids = text_inputs.input_ids
untruncated_ids = self.tokenizer(prompt, padding="longest", return_tensors="pt").input_ids
if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not torch.equal(
text_input_ids, untruncated_ids
):
removed_text = self.tokenizer.batch_decode(untruncated_ids[:, max_length - 1 : -1])
logger.warning(
"The following part of your input was truncated because CLIP can only handle sequences up to"
f" {max_length} tokens: {removed_text}"
)
attention_mask = text_inputs.attention_mask.to(device)
prompt_embeds = self.text_encoder(
text_input_ids.to(device),
attention_mask=attention_mask,
)
prompt_embeds = prompt_embeds[0]
if self.text_encoder is not None:
dtype = self.text_encoder.dtype
elif self.unet is not None:
dtype = self.unet.dtype
else:
dtype = None
prompt_embeds = prompt_embeds.to(dtype=dtype, device=device)
bs_embed, seq_len, _ = prompt_embeds.shape
# duplicate text embeddings for each generation per prompt, using mps friendly method
prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
prompt_embeds = prompt_embeds.view(bs_embed * num_images_per_prompt, seq_len, -1)
# get unconditional embeddings for classifier free guidance
if do_classifier_free_guidance and negative_prompt_embeds is None:
uncond_tokens: list[str]
if negative_prompt is None:
uncond_tokens = [""] * batch_size
elif isinstance(negative_prompt, str):
uncond_tokens = [negative_prompt]
elif batch_size != len(negative_prompt):
raise ValueError(
f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
" the batch size of `prompt`."
)
else:
uncond_tokens = negative_prompt
uncond_tokens = self._text_preprocessing(uncond_tokens, clean_caption=clean_caption)
max_length = prompt_embeds.shape[1]
uncond_input = self.tokenizer(
uncond_tokens,
padding="max_length",
max_length=max_length,
truncation=True,
return_attention_mask=True,
add_special_tokens=True,
return_tensors="pt",
)
attention_mask = uncond_input.attention_mask.to(device)
negative_prompt_embeds = self.text_encoder(
uncond_input.input_ids.to(device),
attention_mask=attention_mask,
)
negative_prompt_embeds = negative_prompt_embeds[0]
if do_classifier_free_guidance:
# duplicate unconditional embeddings for each generation per prompt, using mps friendly method
seq_len = negative_prompt_embeds.shape[1]
negative_prompt_embeds = negative_prompt_embeds.to(dtype=dtype, device=device)
negative_prompt_embeds = negative_prompt_embeds.repeat(1, num_images_per_prompt, 1)
negative_prompt_embeds = negative_prompt_embeds.view(batch_size * num_images_per_prompt, seq_len, -1)
# For classifier free guidance, we need to do two forward passes.
# Here we concatenate the unconditional and text embeddings into a single batch
# to avoid doing two forward passes
else:
negative_prompt_embeds = None
return prompt_embeds, negative_prompt_embeds

@torch.no_grad()
def encode_prompt(
self,
prompt: str | list[str],
do_classifier_free_guidance: bool = True,
num_images_per_prompt: int = 1,
device: torch.device | None = None,
negative_prompt: str | list[str] | None = None,
prompt_embeds: torch.Tensor | None = None,
negative_prompt_embeds: torch.Tensor | None = None,
clean_caption: bool = False,
):
r"""
Encodes the prompt into text encoder hidden states.
Args:
prompt (`str` or `list[str]`, *optional*):
prompt to be encoded
do_classifier_free_guidance (`bool`, *optional*, defaults to `True`):
whether to use classifier free guidance or not
num_images_per_prompt (`int`, *optional*, defaults to 1):
number of images that should be generated per prompt
device: (`torch.device`, *optional*):
torch device to place the resulting embeddings on
negative_prompt (`str` or `list[str]`, *optional*):
The prompt or prompts not to guide the image generation. If not defined, one has to pass
`negative_prompt_embeds`. instead. If not defined, one has to pass `negative_prompt_embeds`. instead.
Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
prompt_embeds (`torch.Tensor`, *optional*):
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
provided, text embeddings will be generated from `prompt` input argument.
negative_prompt_embeds (`torch.Tensor`, *optional*):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
argument.
clean_caption (bool, defaults to `False`):
If `True`, the function will preprocess and clean the provided caption before encoding.
"""
if prompt is not None and negative_prompt is not None:
if type(prompt) is not type(negative_prompt):
raise TypeError(
f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
f" {type(prompt)}."
)
if device is None:
device = self._execution_device
if prompt is not None and isinstance(prompt, str):
batch_size = 1
elif prompt is not None and isinstance(prompt, list):
batch_size = len(prompt)
else:
batch_size = prompt_embeds.shape[0]
# while T5 can handle much longer input sequences than 77, the text encoder was trained with a max length of 77 for IF
max_length = 77
if prompt_embeds is None:
prompt = self._text_preprocessing(prompt, clean_caption=clean_caption)
text_inputs = self.tokenizer(
prompt,
padding="max_length",
max_length=max_length,
truncation=True,
add_special_tokens=True,
return_tensors="pt",
)
text_input_ids = text_inputs.input_ids
untruncated_ids = self.tokenizer(prompt, padding="longest", return_tensors="pt").input_ids
if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not torch.equal(
text_input_ids, untruncated_ids
):
removed_text = self.tokenizer.batch_decode(untruncated_ids[:, max_length - 1 : -1])
logger.warning(
"The following part of your input was truncated because CLIP can only handle sequences up to"
f" {max_length} tokens: {removed_text}"
)
attention_mask = text_inputs.attention_mask.to(device)
prompt_embeds = self.text_encoder(
text_input_ids.to(device),
attention_mask=attention_mask,
)
prompt_embeds = prompt_embeds[0]
if self.text_encoder is not None:
dtype = self.text_encoder.dtype
elif self.unet is not None:
dtype = self.unet.dtype
else:
dtype = None
prompt_embeds = prompt_embeds.to(dtype=dtype, device=device)
bs_embed, seq_len, _ = prompt_embeds.shape
# duplicate text embeddings for each generation per prompt, using mps friendly method
prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
prompt_embeds = prompt_embeds.view(bs_embed * num_images_per_prompt, seq_len, -1)
# get unconditional embeddings for classifier free guidance
if do_classifier_free_guidance and negative_prompt_embeds is None:
uncond_tokens: list[str]
if negative_prompt is None:
uncond_tokens = [""] * batch_size
elif isinstance(negative_prompt, str):
uncond_tokens = [negative_prompt]
elif batch_size != len(negative_prompt):
raise ValueError(
f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
" the batch size of `prompt`."
)
else:
uncond_tokens = negative_prompt
uncond_tokens = self._text_preprocessing(uncond_tokens, clean_caption=clean_caption)
max_length = prompt_embeds.shape[1]
uncond_input = self.tokenizer(
uncond_tokens,
padding="max_length",
max_length=max_length,
truncation=True,
return_attention_mask=True,
add_special_tokens=True,
return_tensors="pt",
)
attention_mask = uncond_input.attention_mask.to(device)
negative_prompt_embeds = self.text_encoder(
uncond_input.input_ids.to(device),
attention_mask=attention_mask,
)
negative_prompt_embeds = negative_prompt_embeds[0]
if do_classifier_free_guidance:
# duplicate unconditional embeddings for each generation per prompt, using mps friendly method
seq_len = negative_prompt_embeds.shape[1]
negative_prompt_embeds = negative_prompt_embeds.to(dtype=dtype, device=device)

@torch.no_grad()
# Copied from diffusers.pipelines.deepfloyd_if.pipeline_if.IFPipeline.encode_prompt
def encode_prompt(
self,
prompt: str | list[str],
do_classifier_free_guidance: bool = True,
num_images_per_prompt: int = 1,
device: torch.device | None = None,
negative_prompt: str | list[str] | None = None,
prompt_embeds: torch.Tensor | None = None,
negative_prompt_embeds: torch.Tensor | None = None,
clean_caption: bool = False,
):
r"""
Encodes the prompt into text encoder hidden states.
Args:
prompt (`str` or `list[str]`, *optional*):
prompt to be encoded
do_classifier_free_guidance (`bool`, *optional*, defaults to `True`):
whether to use classifier free guidance or not
num_images_per_prompt (`int`, *optional*, defaults to 1):
number of images that should be generated per prompt
device: (`torch.device`, *optional*):
torch device to place the resulting embeddings on
negative_prompt (`str` or `list[str]`, *optional*):
The prompt or prompts not to guide the image generation. If not defined, one has to pass
`negative_prompt_embeds`. instead. If not defined, one has to pass `negative_prompt_embeds`. instead.
Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
prompt_embeds (`torch.Tensor`, *optional*):
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
provided, text embeddings will be generated from `prompt` input argument.
negative_prompt_embeds (`torch.Tensor`, *optional*):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
argument.
clean_caption (bool, defaults to `False`):
If `True`, the function will preprocess and clean the provided caption before encoding.
"""
if prompt is not None and negative_prompt is not None:
if type(prompt) is not type(negative_prompt):
raise TypeError(
f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
f" {type(prompt)}."
)
if device is None:
device = self._execution_device
if prompt is not None and isinstance(prompt, str):
batch_size = 1
elif prompt is not None and isinstance(prompt, list):
batch_size = len(prompt)
else:
batch_size = prompt_embeds.shape[0]
# while T5 can handle much longer input sequences than 77, the text encoder was trained with a max length of 77 for IF
max_length = 77
if prompt_embeds is None:
prompt = self._text_preprocessing(prompt, clean_caption=clean_caption)
text_inputs = self.tokenizer(
prompt,
padding="max_length",
max_length=max_length,
truncation=True,
add_special_tokens=True,
return_tensors="pt",
)
text_input_ids = text_inputs.input_ids
untruncated_ids = self.tokenizer(prompt, padding="longest", return_tensors="pt").input_ids
if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not torch.equal(
text_input_ids, untruncated_ids
):
removed_text = self.tokenizer.batch_decode(untruncated_ids[:, max_length - 1 : -1])
logger.warning(
"The following part of your input was truncated because CLIP can only handle sequences up to"
f" {max_length} tokens: {removed_text}"
)
attention_mask = text_inputs.attention_mask.to(device)
prompt_embeds = self.text_encoder(
text_input_ids.to(device),
attention_mask=attention_mask,
)
prompt_embeds = prompt_embeds[0]
if self.text_encoder is not None:
dtype = self.text_encoder.dtype
elif self.unet is not None:
dtype = self.unet.dtype
else:
dtype = None
prompt_embeds = prompt_embeds.to(dtype=dtype, device=device)
bs_embed, seq_len, _ = prompt_embeds.shape
# duplicate text embeddings for each generation per prompt, using mps friendly method
prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
prompt_embeds = prompt_embeds.view(bs_embed * num_images_per_prompt, seq_len, -1)
# get unconditional embeddings for classifier free guidance
if do_classifier_free_guidance and negative_prompt_embeds is None:
uncond_tokens: list[str]
if negative_prompt is None:
uncond_tokens = [""] * batch_size
elif isinstance(negative_prompt, str):
uncond_tokens = [negative_prompt]
elif batch_size != len(negative_prompt):
raise ValueError(
f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
" the batch size of `prompt`."
)
else:
uncond_tokens = negative_prompt
uncond_tokens = self._text_preprocessing(uncond_tokens, clean_caption=clean_caption)
max_length = prompt_embeds.shape[1]
uncond_input = self.tokenizer(
uncond_tokens,
padding="max_length",
max_length=max_length,
truncation=True,
return_attention_mask=True,
add_special_tokens=True,
return_tensors="pt",
)
attention_mask = uncond_input.attention_mask.to(device)
negative_prompt_embeds = self.text_encoder(
uncond_input.input_ids.to(device),
attention_mask=attention_mask,
)
negative_prompt_embeds = negative_prompt_embeds[0]
if do_classifier_free_guidance:
# duplicate unconditional embeddings for each generation per prompt, using mps friendly method
seq_len = negative_prompt_embeds.shape[1]
negative_prompt_embeds = negative_prompt_embeds.to(dtype=dtype, device=device)
negative_prompt_embeds = negative_prompt_embeds.repeat(1, num_images_per_prompt, 1)
negative_prompt_embeds = negative_prompt_embeds.view(batch_size * num_images_per_prompt, seq_len, -1)
# For classifier free guidance, we need to do two forward passes.
# Here we concatenate the unconditional and text embeddings into a single batch
# to avoid doing two forward passes
else:
negative_prompt_embeds = None
return prompt_embeds, negative_prompt_embeds

Problem:
encode_prompt() is decorated with @torch.no_grad() across the copied IF variants. __call__() is already no-grad, so the helper-level decorator prevents advanced callers from using encode_prompt() with gradients enabled.

Impact:
Prompt-embedding optimization and training-style workflows cannot reuse the public helper without silently detaching tensors.

Reproduction:

import torch
from diffusers import DDPMScheduler, IFPipeline, UNet2DConditionModel

unet = UNet2DConditionModel(
    sample_size=8, in_channels=3, out_channels=6, layers_per_block=1,
    block_out_channels=(8,), down_block_types=("CrossAttnDownBlock2D",),
    up_block_types=("CrossAttnUpBlock2D",), cross_attention_dim=4,
    attention_head_dim=4, norm_num_groups=1,
)
pipe = IFPipeline(None, None, unet, DDPMScheduler(num_train_timesteps=10, variance_type="learned_range"), None, None, None, False)

x = torch.randn(1, 77, 4, requires_grad=True)
prompt_embeds, _ = pipe.encode_prompt(prompt=None, do_classifier_free_guidance=False, prompt_embeds=x, num_images_per_prompt=2)
print(x.requires_grad, prompt_embeds.requires_grad)  # True False

Relevant precedent:


Suggested fix:

# Remove @torch.no_grad() from encode_prompt() in all IF pipeline copies.
# Keep @torch.no_grad() on __call__().

Issue 6: T5FilmDecoder has no direct fast or slow tests

Affected code:

class T5FilmDecoder(ModelMixin, ConfigMixin):
r"""
T5 style decoder with FiLM conditioning.
Args:
input_dims (`int`, *optional*, defaults to `128`):
The number of input dimensions.
targets_length (`int`, *optional*, defaults to `256`):
The length of the targets.
d_model (`int`, *optional*, defaults to `768`):
Size of the input hidden states.
num_layers (`int`, *optional*, defaults to `12`):
The number of `DecoderLayer`'s to use.
num_heads (`int`, *optional*, defaults to `12`):
The number of attention heads to use.
d_kv (`int`, *optional*, defaults to `64`):
Size of the key-value projection vectors.
d_ff (`int`, *optional*, defaults to `2048`):
The number of dimensions in the intermediate feed-forward layer of `DecoderLayer`'s.
dropout_rate (`float`, *optional*, defaults to `0.1`):
Dropout probability.
"""
@register_to_config
def __init__(
self,
input_dims: int = 128,
targets_length: int = 256,
max_decoder_noise_time: float = 2000.0,
d_model: int = 768,
num_layers: int = 12,
num_heads: int = 12,
d_kv: int = 64,
d_ff: int = 2048,
dropout_rate: float = 0.1,
):
super().__init__()
self.conditioning_emb = nn.Sequential(
nn.Linear(d_model, d_model * 4, bias=False),
nn.SiLU(),
nn.Linear(d_model * 4, d_model * 4, bias=False),
nn.SiLU(),
)
self.position_encoding = nn.Embedding(targets_length, d_model)
self.position_encoding.weight.requires_grad = False
self.continuous_inputs_projection = nn.Linear(input_dims, d_model, bias=False)
self.dropout = nn.Dropout(p=dropout_rate)
self.decoders = nn.ModuleList()
for lyr_num in range(num_layers):
# FiLM conditional T5 decoder
lyr = DecoderLayer(d_model=d_model, d_kv=d_kv, num_heads=num_heads, d_ff=d_ff, dropout_rate=dropout_rate)
self.decoders.append(lyr)
self.decoder_norm = T5LayerNorm(d_model)
self.post_dropout = nn.Dropout(p=dropout_rate)
self.spec_out = nn.Linear(d_model, input_dims, bias=False)
def encoder_decoder_mask(self, query_input: torch.Tensor, key_input: torch.Tensor) -> torch.Tensor:
mask = torch.mul(query_input.unsqueeze(-1), key_input.unsqueeze(-2))
return mask.unsqueeze(-3)
def forward(self, encodings_and_masks, decoder_input_tokens, decoder_noise_time):
batch, _, _ = decoder_input_tokens.shape
assert decoder_noise_time.shape == (batch,)
# decoder_noise_time is in [0, 1), so rescale to expected timing range.
time_steps = get_timestep_embedding(
decoder_noise_time * self.config.max_decoder_noise_time,
embedding_dim=self.config.d_model,
max_period=self.config.max_decoder_noise_time,
).to(dtype=self.dtype)
conditioning_emb = self.conditioning_emb(time_steps).unsqueeze(1)
assert conditioning_emb.shape == (batch, 1, self.config.d_model * 4)
seq_length = decoder_input_tokens.shape[1]
# If we want to use relative positions for audio context, we can just offset
# this sequence by the length of encodings_and_masks.
decoder_positions = torch.broadcast_to(
torch.arange(seq_length, device=decoder_input_tokens.device),
(batch, seq_length),
)
position_encodings = self.position_encoding(decoder_positions)
inputs = self.continuous_inputs_projection(decoder_input_tokens)
inputs += position_encodings
y = self.dropout(inputs)
# decoder: No padding present.
decoder_mask = torch.ones(
decoder_input_tokens.shape[:2], device=decoder_input_tokens.device, dtype=inputs.dtype
)
# Translate encoding masks to encoder-decoder masks.
encodings_and_encdec_masks = [(x, self.encoder_decoder_mask(decoder_mask, y)) for x, y in encodings_and_masks]
# cross attend style: concat encodings
encoded = torch.cat([x[0] for x in encodings_and_encdec_masks], dim=1)
encoder_decoder_mask = torch.cat([x[1] for x in encodings_and_encdec_masks], dim=-1)
for lyr in self.decoders:
y = lyr(
y,
conditioning_emb=conditioning_emb,
encoder_hidden_states=encoded,
encoder_attention_mask=encoder_decoder_mask,
)[0]
y = self.decoder_norm(y)
y = self.post_dropout(y)
spec_out = self.spec_out(y)
return spec_out

Problem:
No tests under tests/ mention T5FilmDecoder. This leaves config serialization, save/load, forward shape, dtype behavior, and attention processor behavior unexercised for the model file in scope.

Impact:
Regressions in the model can land without fast model-test coverage or slow checkpoint smoke coverage.

Reproduction:

from pathlib import Path

hits = [str(p) for p in Path("tests").rglob("*.py") if "T5FilmDecoder" in p.read_text(encoding="utf-8")]
print(hits)  # []
assert hits, "No tests mention T5FilmDecoder"

Relevant precedent:
Other transformer model families have direct tests under tests/models/transformers/.

Suggested fix:
Add a small tests/models/transformers/test_models_t5_film_transformer.py covering tiny config construction, forward pass, save/load, and dtype/device movement. Add slow coverage only if there is a maintained pretrained T5FilmDecoder checkpoint to smoke-test.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions