deepfloyd_if model/pipeline review

# `deepfloyd_if` model/pipeline review

Commit tested: `0f1abc4ae8b0eb2a3b40e82a310507281144c423`

Review performed against the repository review rules. Note: `.ai/review-rules.md` references `AGENTS.md`, but that file is absent in this checkout; the remaining referenced rule files were applied.

Duplicate search status: searched GitHub Issues and PRs for `deepfloyd_if`, affected class/function names, NumPy image/mask failures, `IFPipelineOutput` import behavior, strength validation, and `T5FilmDecoder` coverage. I found no likely duplicates.

## Issue 1: Unbatched NumPy image inputs are rejected as batch-size mismatches

Affected code:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/deepfloyd_if/pipeline_if_img2img.py#L422-L449
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/deepfloyd_if/pipeline_if_superresolution.py#L539-L566
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/deepfloyd_if/pipeline_if_inpainting.py#L427-L489
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/deepfloyd_if/pipeline_if_img2img_superresolution.py#L576-L638
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/deepfloyd_if/pipeline_if_inpainting_superresolution.py#L576-L671

Problem:
The validators treat `np.ndarray.shape[0]` as batch size for every NumPy image. A normal single image shaped `(H, W, C)` is interpreted as batch size `H`, even though the preprocessors can handle single HWC arrays.

Impact:
Users passing valid single NumPy images get a misleading batch-size error. Fast tests only cover tensor batches and slow tests use PIL images, so this path is untested.

Reproduction:
```python
import numpy as np
import torch
from diffusers import DDPMScheduler, IFImg2ImgPipeline, UNet2DConditionModel

unet = UNet2DConditionModel(
    sample_size=8, in_channels=3, out_channels=6, layers_per_block=1,
    block_out_channels=(8,), down_block_types=("CrossAttnDownBlock2D",),
    up_block_types=("CrossAttnUpBlock2D",), cross_attention_dim=4,
    attention_head_dim=4, norm_num_groups=1,
)
pipe = IFImg2ImgPipeline(None, None, unet, DDPMScheduler(num_train_timesteps=10, variance_type="learned_range"), None, None, None, False)

pipe.check_inputs(
    prompt=None,
    image=np.zeros((8, 8, 3), dtype=np.float32),
    batch_size=1,
    callback_steps=1,
    prompt_embeds=torch.zeros(1, 77, 4),
    negative_prompt_embeds=torch.zeros(1, 77, 4),
)
```

Relevant precedent:
The local preprocessors already wrap non-list NumPy images and would handle HWC as a single image if validation allowed it.

Suggested fix:
```python
def _image_batch_size(image):
    if isinstance(image, list):
        return len(image)
    if isinstance(image, PIL.Image.Image):
        return 1
    if isinstance(image, np.ndarray):
        return image.shape[0] if image.ndim == 4 else 1
    if isinstance(image, torch.Tensor):
        return image.shape[0] if image.ndim == 4 else 1
    raise TypeError(type(image))
```

## Issue 2: Batched NumPy masks are converted to 5D tensors

Affected code:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/deepfloyd_if/pipeline_if_inpainting.py#L668-L713
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/deepfloyd_if/pipeline_if_inpainting_superresolution.py#L745-L791

Problem:
`preprocess_mask_image()` wraps a non-list NumPy mask in a list, then blindly applies `m[None, None, :]`. For a batched mask shaped `(B, H, W)`, this returns `(1, 1, B, H, W)` instead of `(B, 1, H, W)`.

Impact:
Batched NumPy masks either fail later during broadcasting or apply the mask with the wrong shape. This is not covered by the current tests.

Reproduction:
```python
import numpy as np
from diffusers import DDPMScheduler, IFInpaintingPipeline, UNet2DConditionModel

unet = UNet2DConditionModel(
    sample_size=8, in_channels=3, out_channels=6, layers_per_block=1,
    block_out_channels=(8,), down_block_types=("CrossAttnDownBlock2D",),
    up_block_types=("CrossAttnUpBlock2D",), cross_attention_dim=4,
    attention_head_dim=4, norm_num_groups=1,
)
pipe = IFInpaintingPipeline(None, None, unet, DDPMScheduler(num_train_timesteps=10, variance_type="learned_range"), None, None, None, False)

mask = np.zeros((2, 8, 8), dtype=np.float32)
print(pipe.preprocess_mask_image(mask).shape)  # torch.Size([1, 1, 2, 8, 8])
```

Relevant precedent:
Tensor masks in the same method distinguish 2D single masks from 3D batched masks before adding channel dimensions.

Suggested fix:
```python
elif isinstance(mask_image[0], np.ndarray):
    mask_image = np.stack(mask_image, axis=0) if len(mask_image) > 1 else mask_image[0]

    if mask_image.ndim == 2:
        mask_image = mask_image[None, None, :, :]
    elif mask_image.ndim == 3:
        mask_image = mask_image[:, None, :, :]
    elif mask_image.ndim == 4 and mask_image.shape[-1] == 1:
        mask_image = mask_image.transpose(0, 3, 1, 2)
    else:
        raise ValueError(f"Unsupported mask shape: {mask_image.shape}")

    mask_image = (mask_image >= 0.5).astype(np.float32)
    mask_image = torch.from_numpy(mask_image)
```

## Issue 3: `strength` is documented as constrained but invalid values are accepted

Affected code:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/deepfloyd_if/pipeline_if_img2img.py#L378-L449
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/deepfloyd_if/pipeline_if_img2img.py#L627-L637
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/deepfloyd_if/pipeline_if_inpainting.py#L382-L489
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/deepfloyd_if/pipeline_if_img2img_superresolution.py#L531-L638
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/deepfloyd_if/pipeline_if_inpainting_superresolution.py#L533-L671

Problem:
The docs say `strength` must be between 0 and 1, but the four img2img/inpainting variants never validate it. Negative values can silently produce empty outputs, and values above 1 are clamped by timestep math and behave like 1.

Impact:
Invalid user input produces surprising generation behavior instead of a clear `ValueError`.

Reproduction:
```python
import torch
from diffusers import DDPMScheduler, IFImg2ImgPipeline, UNet2DConditionModel

unet = UNet2DConditionModel(
    sample_size=8, in_channels=3, out_channels=6, layers_per_block=1,
    block_out_channels=(8,), down_block_types=("CrossAttnDownBlock2D",),
    up_block_types=("CrossAttnUpBlock2D",), cross_attention_dim=4,
    attention_head_dim=4, norm_num_groups=1,
)
pipe = IFImg2ImgPipeline(None, None, unet, DDPMScheduler(num_train_timesteps=10, variance_type="learned_range"), None, None, None, False)

embeds = torch.zeros(1, 77, 4)
image = torch.zeros(1, 3, 8, 8)
out = pipe(prompt_embeds=embeds, negative_prompt_embeds=embeds, image=image, strength=-0.1, num_inference_steps=2, output_type="pt")
print(out.images.shape)  # torch.Size([0, 3, 8, 8])
```

Relevant precedent:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py#L656-L670

Suggested fix:
```python
def check_inputs(..., strength, ...):
    if strength < 0 or strength > 1:
        raise ValueError(f"The value of `strength` should be in [0.0, 1.0], but is {strength}")
```

## Issue 4: `IFPipelineOutput` is hidden behind torch+transformers lazy-import guards

Affected code:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/deepfloyd_if/__init__.py#L24-L43

Problem:
`IFPipelineOutput` is a lightweight dataclass that only needs NumPy/PIL/BaseOutput, but it is added to `_import_structure` only when both torch and transformers are available.

Impact:
In dependency-light environments, users cannot import an output type that does not require the missing dependency.

Reproduction:
```python
import importlib
import sys
import diffusers.utils.import_utils as iu

iu._transformers_available = False
for name in list(sys.modules):
    if name.startswith("diffusers.pipelines.deepfloyd_if"):
        del sys.modules[name]

module = importlib.import_module("diffusers.pipelines.deepfloyd_if")
print(hasattr(module, "IFPipelineOutput"))  # False
from diffusers.pipelines.deepfloyd_if import IFPipelineOutput  # ImportError
```

Relevant precedent:
The `timesteps` constants in the same `__init__.py` are already exported outside the torch+transformers guard.

Suggested fix:
```python
_import_structure = {
    "timesteps": [...],
    "pipeline_output": ["IFPipelineOutput"],
}
```

## Issue 5: `encode_prompt()` detaches gradients in all IF pipelines

Affected code:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/deepfloyd_if/pipeline_if.py#L168-L320
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/deepfloyd_if/pipeline_if_img2img.py#L192-L333
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/deepfloyd_if/pipeline_if_superresolution.py#L302-L455

Problem:
`encode_prompt()` is decorated with `@torch.no_grad()` across the copied IF variants. `__call__()` is already no-grad, so the helper-level decorator prevents advanced callers from using `encode_prompt()` with gradients enabled.

Impact:
Prompt-embedding optimization and training-style workflows cannot reuse the public helper without silently detaching tensors.

Reproduction:
```python
import torch
from diffusers import DDPMScheduler, IFPipeline, UNet2DConditionModel

unet = UNet2DConditionModel(
    sample_size=8, in_channels=3, out_channels=6, layers_per_block=1,
    block_out_channels=(8,), down_block_types=("CrossAttnDownBlock2D",),
    up_block_types=("CrossAttnUpBlock2D",), cross_attention_dim=4,
    attention_head_dim=4, norm_num_groups=1,
)
pipe = IFPipeline(None, None, unet, DDPMScheduler(num_train_timesteps=10, variance_type="learned_range"), None, None, None, False)

x = torch.randn(1, 77, 4, requires_grad=True)
prompt_embeds, _ = pipe.encode_prompt(prompt=None, do_classifier_free_guidance=False, prompt_embeds=x, num_images_per_prompt=2)
print(x.requires_grad, prompt_embeds.requires_grad)  # True False
```

Relevant precedent:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/flux/pipeline_flux.py#L311
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/flux/pipeline_flux.py#L652

Suggested fix:
```python
# Remove @torch.no_grad() from encode_prompt() in all IF pipeline copies.
# Keep @torch.no_grad() on __call__().
```

## Issue 6: `T5FilmDecoder` has no direct fast or slow tests

Affected code:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/models/transformers/t5_film_transformer.py#L25-L146

Problem:
No tests under `tests/` mention `T5FilmDecoder`. This leaves config serialization, save/load, forward shape, dtype behavior, and attention processor behavior unexercised for the model file in scope.

Impact:
Regressions in the model can land without fast model-test coverage or slow checkpoint smoke coverage.

Reproduction:
```python
from pathlib import Path

hits = [str(p) for p in Path("tests").rglob("*.py") if "T5FilmDecoder" in p.read_text(encoding="utf-8")]
print(hits)  # []
assert hits, "No tests mention T5FilmDecoder"
```

Relevant precedent:
Other transformer model families have direct tests under `tests/models/transformers/`.

Suggested fix:
Add a small `tests/models/transformers/test_models_t5_film_transformer.py` covering tiny config construction, forward pass, save/load, and dtype/device movement. Add slow coverage only if there is a maintained pretrained `T5FilmDecoder` checkpoint to smoke-test.


	if isinstance(image, list):
	check_image_type = image[0]
	else:
	check_image_type = image

	if (
	not isinstance(check_image_type, torch.Tensor)
	and not isinstance(check_image_type, PIL.Image.Image)
	and not isinstance(check_image_type, np.ndarray)
	):
	raise ValueError(
	"`image` has to be of type `torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, or list[...] but is"
	f" {type(check_image_type)}"
	)

	if isinstance(image, list):
	image_batch_size = len(image)
	elif isinstance(image, torch.Tensor):
	image_batch_size = image.shape[0]
	elif isinstance(image, PIL.Image.Image):
	image_batch_size = 1
	elif isinstance(image, np.ndarray):
	image_batch_size = image.shape[0]
	else:
	assert False

	if batch_size != image_batch_size:
	raise ValueError(f"image batch size: {image_batch_size} must be same as prompt batch size {batch_size}")

	if isinstance(image, list):
	check_image_type = image[0]
	else:
	check_image_type = image

	if (
	not isinstance(check_image_type, torch.Tensor)
	and not isinstance(check_image_type, PIL.Image.Image)
	and not isinstance(check_image_type, np.ndarray)
	):
	raise ValueError(
	"`image` has to be of type `torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, or list[...] but is"
	f" {type(check_image_type)}"
	)

	if isinstance(image, list):
	image_batch_size = len(image)
	elif isinstance(image, torch.Tensor):
	image_batch_size = image.shape[0]
	elif isinstance(image, PIL.Image.Image):
	image_batch_size = 1
	elif isinstance(image, np.ndarray):
	image_batch_size = image.shape[0]
	else:
	assert False

	if batch_size != image_batch_size:
	raise ValueError(f"image batch size: {image_batch_size} must be same as prompt batch size {batch_size}")

	# image

	if isinstance(image, list):
	check_image_type = image[0]
	else:
	check_image_type = image

	if (
	not isinstance(check_image_type, torch.Tensor)
	and not isinstance(check_image_type, PIL.Image.Image)
	and not isinstance(check_image_type, np.ndarray)
	):
	raise ValueError(
	"`image` has to be of type `torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, or list[...] but is"
	f" {type(check_image_type)}"
	)

	if isinstance(image, list):
	image_batch_size = len(image)
	elif isinstance(image, torch.Tensor):
	image_batch_size = image.shape[0]
	elif isinstance(image, PIL.Image.Image):
	image_batch_size = 1
	elif isinstance(image, np.ndarray):
	image_batch_size = image.shape[0]
	else:
	assert False

	if batch_size != image_batch_size:
	raise ValueError(f"image batch size: {image_batch_size} must be same as prompt batch size {batch_size}")

	# mask_image

	if isinstance(mask_image, list):
	check_image_type = mask_image[0]
	else:
	check_image_type = mask_image

	if (
	not isinstance(check_image_type, torch.Tensor)
	and not isinstance(check_image_type, PIL.Image.Image)
	and not isinstance(check_image_type, np.ndarray)
	):
	raise ValueError(
	"`mask_image` has to be of type `torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, or list[...] but is"
	f" {type(check_image_type)}"
	)

	if isinstance(mask_image, list):
	image_batch_size = len(mask_image)
	elif isinstance(mask_image, torch.Tensor):
	image_batch_size = mask_image.shape[0]
	elif isinstance(mask_image, PIL.Image.Image):
	image_batch_size = 1
	elif isinstance(mask_image, np.ndarray):
	image_batch_size = mask_image.shape[0]
	else:
	assert False

	if image_batch_size != 1 and batch_size != image_batch_size:
	raise ValueError(
	f"mask_image batch size: {image_batch_size} must be `1` or the same as prompt batch size {batch_size}"
	)

	# image

	if isinstance(image, list):
	check_image_type = image[0]
	else:
	check_image_type = image

	if (
	not isinstance(check_image_type, torch.Tensor)
	and not isinstance(check_image_type, PIL.Image.Image)
	and not isinstance(check_image_type, np.ndarray)
	):
	raise ValueError(
	"`image` has to be of type `torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, or list[...] but is"
	f" {type(check_image_type)}"
	)

	if isinstance(image, list):
	image_batch_size = len(image)
	elif isinstance(image, torch.Tensor):
	image_batch_size = image.shape[0]
	elif isinstance(image, PIL.Image.Image):
	image_batch_size = 1
	elif isinstance(image, np.ndarray):
	image_batch_size = image.shape[0]
	else:
	assert False

	if batch_size != image_batch_size:
	raise ValueError(f"image batch size: {image_batch_size} must be same as prompt batch size {batch_size}")

	# original_image

	if isinstance(original_image, list):
	check_image_type = original_image[0]
	else:
	check_image_type = original_image

	if (
	not isinstance(check_image_type, torch.Tensor)
	and not isinstance(check_image_type, PIL.Image.Image)
	and not isinstance(check_image_type, np.ndarray)
	):
	raise ValueError(
	"`original_image` has to be of type `torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, or list[...] but is"
	f" {type(check_image_type)}"
	)

	if isinstance(original_image, list):
	image_batch_size = len(original_image)
	elif isinstance(original_image, torch.Tensor):
	image_batch_size = original_image.shape[0]
	elif isinstance(original_image, PIL.Image.Image):
	image_batch_size = 1
	elif isinstance(original_image, np.ndarray):
	image_batch_size = original_image.shape[0]
	else:
	assert False

	if batch_size != image_batch_size:
	raise ValueError(
	f"original_image batch size: {image_batch_size} must be same as prompt batch size {batch_size}"
	)

	f" {negative_prompt_embeds.shape}."
	)

	# image

	if isinstance(image, list):
	check_image_type = image[0]
	else:
	check_image_type = image

	if (
	not isinstance(check_image_type, torch.Tensor)
	and not isinstance(check_image_type, PIL.Image.Image)
	and not isinstance(check_image_type, np.ndarray)
	):
	raise ValueError(
	"`image` has to be of type `torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, or list[...] but is"
	f" {type(check_image_type)}"
	)

	if isinstance(image, list):
	image_batch_size = len(image)
	elif isinstance(image, torch.Tensor):
	image_batch_size = image.shape[0]
	elif isinstance(image, PIL.Image.Image):
	image_batch_size = 1
	elif isinstance(image, np.ndarray):
	image_batch_size = image.shape[0]
	else:
	assert False

	if batch_size != image_batch_size:
	raise ValueError(f"image batch size: {image_batch_size} must be same as prompt batch size {batch_size}")

	# original_image

	if isinstance(original_image, list):
	check_image_type = original_image[0]
	else:
	check_image_type = original_image

	if (
	not isinstance(check_image_type, torch.Tensor)
	and not isinstance(check_image_type, PIL.Image.Image)
	and not isinstance(check_image_type, np.ndarray)
	):
	raise ValueError(
	"`original_image` has to be of type `torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, or list[...] but is"
	f" {type(check_image_type)}"
	)

	if isinstance(original_image, list):
	image_batch_size = len(original_image)
	elif isinstance(original_image, torch.Tensor):
	image_batch_size = original_image.shape[0]
	elif isinstance(original_image, PIL.Image.Image):
	image_batch_size = 1
	elif isinstance(original_image, np.ndarray):
	image_batch_size = original_image.shape[0]
	else:
	assert False

	if batch_size != image_batch_size:
	raise ValueError(
	f"original_image batch size: {image_batch_size} must be same as prompt batch size {batch_size}"
	)

	# mask_image

	if isinstance(mask_image, list):
	check_image_type = mask_image[0]
	else:
	check_image_type = mask_image

	if (
	not isinstance(check_image_type, torch.Tensor)
	and not isinstance(check_image_type, PIL.Image.Image)
	and not isinstance(check_image_type, np.ndarray)
	):
	raise ValueError(
	"`mask_image` has to be of type `torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, or list[...] but is"
	f" {type(check_image_type)}"
	)

	if isinstance(mask_image, list):
	image_batch_size = len(mask_image)
	elif isinstance(mask_image, torch.Tensor):
	image_batch_size = mask_image.shape[0]
	elif isinstance(mask_image, PIL.Image.Image):
	image_batch_size = 1
	elif isinstance(mask_image, np.ndarray):
	image_batch_size = mask_image.shape[0]
	else:
	assert False

	if image_batch_size != 1 and batch_size != image_batch_size:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deepfloyd_if model/pipeline review #13646

`deepfloyd_if` model/pipeline review

Issue 1: Unbatched NumPy image inputs are rejected as batch-size mismatches

Issue 2: Batched NumPy masks are converted to 5D tensors

Issue 3: `strength` is documented as constrained but invalid values are accepted

Issue 4: `IFPipelineOutput` is hidden behind torch+transformers lazy-import guards

Issue 5: `encode_prompt()` detaches gradients in all IF pipelines

Issue 6: `T5FilmDecoder` has no direct fast or slow tests

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

	def preprocess_mask_image(self, mask_image) -> torch.Tensor:
	if not isinstance(mask_image, list):
	mask_image = [mask_image]

	if isinstance(mask_image[0], torch.Tensor):
	mask_image = torch.cat(mask_image, axis=0) if mask_image[0].ndim == 4 else torch.stack(mask_image, axis=0)

	if mask_image.ndim == 2:
	# Batch and add channel dim for single mask
	mask_image = mask_image.unsqueeze(0).unsqueeze(0)
	elif mask_image.ndim == 3 and mask_image.shape[0] == 1:
	# Single mask, the 0'th dimension is considered to be
	# the existing batch size of 1
	mask_image = mask_image.unsqueeze(0)
	elif mask_image.ndim == 3 and mask_image.shape[0] != 1:
	# Batch of mask, the 0'th dimension is considered to be
	# the batching dimension
	mask_image = mask_image.unsqueeze(1)

	mask_image[mask_image < 0.5] = 0
	mask_image[mask_image >= 0.5] = 1

	elif isinstance(mask_image[0], PIL.Image.Image):
	new_mask_image = []

	for mask_image_ in mask_image:
	mask_image_ = mask_image_.convert("L")
	mask_image_ = resize(mask_image_, self.unet.config.sample_size)
	mask_image_ = np.array(mask_image_)
	mask_image_ = mask_image_[None, None, :]
	new_mask_image.append(mask_image_)

	mask_image = new_mask_image

	mask_image = np.concatenate(mask_image, axis=0)
	mask_image = mask_image.astype(np.float32) / 255.0
	mask_image[mask_image < 0.5] = 0
	mask_image[mask_image >= 0.5] = 1
	mask_image = torch.from_numpy(mask_image)

	elif isinstance(mask_image[0], np.ndarray):
	mask_image = np.concatenate([m[None, None, :] for m in mask_image], axis=0)

	mask_image[mask_image < 0.5] = 0
	mask_image[mask_image >= 0.5] = 1
	mask_image = torch.from_numpy(mask_image)

	# Copied from diffusers.pipelines.deepfloyd_if.pipeline_if_inpainting.IFInpaintingPipeline.preprocess_mask_image
	def preprocess_mask_image(self, mask_image) -> torch.Tensor:
	if not isinstance(mask_image, list):
	mask_image = [mask_image]

	if isinstance(mask_image[0], torch.Tensor):
	mask_image = torch.cat(mask_image, axis=0) if mask_image[0].ndim == 4 else torch.stack(mask_image, axis=0)

	if mask_image.ndim == 2:
	# Batch and add channel dim for single mask
	mask_image = mask_image.unsqueeze(0).unsqueeze(0)
	elif mask_image.ndim == 3 and mask_image.shape[0] == 1:
	# Single mask, the 0'th dimension is considered to be
	# the existing batch size of 1
	mask_image = mask_image.unsqueeze(0)
	elif mask_image.ndim == 3 and mask_image.shape[0] != 1:
	# Batch of mask, the 0'th dimension is considered to be
	# the batching dimension
	mask_image = mask_image.unsqueeze(1)

	mask_image[mask_image < 0.5] = 0
	mask_image[mask_image >= 0.5] = 1

	elif isinstance(mask_image[0], PIL.Image.Image):
	new_mask_image = []

	for mask_image_ in mask_image:
	mask_image_ = mask_image_.convert("L")
	mask_image_ = resize(mask_image_, self.unet.config.sample_size)
	mask_image_ = np.array(mask_image_)
	mask_image_ = mask_image_[None, None, :]
	new_mask_image.append(mask_image_)

	mask_image = new_mask_image

	mask_image = np.concatenate(mask_image, axis=0)
	mask_image = mask_image.astype(np.float32) / 255.0
	mask_image[mask_image < 0.5] = 0
	mask_image[mask_image >= 0.5] = 1
	mask_image = torch.from_numpy(mask_image)

	elif isinstance(mask_image[0], np.ndarray):
	mask_image = np.concatenate([m[None, None, :] for m in mask_image], axis=0)

	mask_image[mask_image < 0.5] = 0
	mask_image[mask_image >= 0.5] = 1
	mask_image = torch.from_numpy(mask_image)

	def check_inputs(
	self,
	prompt,
	image,
	batch_size,
	callback_steps,
	negative_prompt=None,
	prompt_embeds=None,
	negative_prompt_embeds=None,
	):
	if (callback_steps is None) or (
	callback_steps is not None and (not isinstance(callback_steps, int) or callback_steps <= 0)
	):
	raise ValueError(
	f"`callback_steps` has to be a positive integer but is {callback_steps} of type"
	f" {type(callback_steps)}."
	)

	if prompt is not None and prompt_embeds is not None:
	raise ValueError(
	f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
	" only forward one of the two."
	)
	elif prompt is None and prompt_embeds is None:
	raise ValueError(
	"Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
	)
	elif prompt is not None and (not isinstance(prompt, str) and not isinstance(prompt, list)):
	raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")

	if negative_prompt is not None and negative_prompt_embeds is not None:
	raise ValueError(
	f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
	f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
	)

	if prompt_embeds is not None and negative_prompt_embeds is not None:
	if prompt_embeds.shape != negative_prompt_embeds.shape:
	raise ValueError(
	"`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
	f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
	f" {negative_prompt_embeds.shape}."
	)

	if isinstance(image, list):
	check_image_type = image[0]
	else:
	check_image_type = image

	if (
	not isinstance(check_image_type, torch.Tensor)
	and not isinstance(check_image_type, PIL.Image.Image)
	and not isinstance(check_image_type, np.ndarray)
	):
	raise ValueError(
	"`image` has to be of type `torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, or list[...] but is"
	f" {type(check_image_type)}"
	)

	if isinstance(image, list):
	image_batch_size = len(image)
	elif isinstance(image, torch.Tensor):
	image_batch_size = image.shape[0]
	elif isinstance(image, PIL.Image.Image):
	image_batch_size = 1
	elif isinstance(image, np.ndarray):
	image_batch_size = image.shape[0]
	else:
	assert False

	if batch_size != image_batch_size:
	raise ValueError(f"image batch size: {image_batch_size} must be same as prompt batch size {batch_size}")

	# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion_img2img.StableDiffusionImg2ImgPipeline.get_timesteps
	def get_timesteps(self, num_inference_steps, strength):
	# get the original timestep using init_timestep
	init_timestep = min(int(num_inference_steps * strength), num_inference_steps)

	t_start = max(num_inference_steps - init_timestep, 0)
	timesteps = self.scheduler.timesteps[t_start * self.scheduler.order :]
	if hasattr(self.scheduler, "set_begin_index"):
	self.scheduler.set_begin_index(t_start * self.scheduler.order)

	return timesteps, num_inference_steps - t_start

	]
	}

	try:
	if not (is_transformers_available() and is_torch_available()):
	raise OptionalDependencyNotAvailable()
	except OptionalDependencyNotAvailable:
	from ...utils import dummy_torch_and_transformers_objects # noqa F403

	_dummy_objects.update(get_objects_from_module(dummy_torch_and_transformers_objects))
	else:
	_import_structure["pipeline_if"] = ["IFPipeline"]
	_import_structure["pipeline_if_img2img"] = ["IFImg2ImgPipeline"]
	_import_structure["pipeline_if_img2img_superresolution"] = ["IFImg2ImgSuperResolutionPipeline"]
	_import_structure["pipeline_if_inpainting"] = ["IFInpaintingPipeline"]
	_import_structure["pipeline_if_inpainting_superresolution"] = ["IFInpaintingSuperResolutionPipeline"]
	_import_structure["pipeline_if_superresolution"] = ["IFSuperResolutionPipeline"]
	_import_structure["pipeline_output"] = ["IFPipelineOutput"]
	_import_structure["safety_checker"] = ["IFSafetyChecker"]
	_import_structure["watermark"] = ["IFWatermarker"]

	@torch.no_grad()
	def encode_prompt(
	self,
	prompt: str \| list[str],
	do_classifier_free_guidance: bool = True,
	num_images_per_prompt: int = 1,
	device: torch.device \| None = None,
	negative_prompt: str \| list[str] \| None = None,
	prompt_embeds: torch.Tensor \| None = None,
	negative_prompt_embeds: torch.Tensor \| None = None,
	clean_caption: bool = False,
	):
	r"""
	Encodes the prompt into text encoder hidden states.

	Args:
	prompt (`str` or `list[str]`, optional):
	prompt to be encoded
	do_classifier_free_guidance (`bool`, optional, defaults to `True`):
	whether to use classifier free guidance or not
	num_images_per_prompt (`int`, optional, defaults to 1):
	number of images that should be generated per prompt
	device: (`torch.device`, optional):
	torch device to place the resulting embeddings on
	negative_prompt (`str` or `list[str]`, optional):
	The prompt or prompts not to guide the image generation. If not defined, one has to pass
	`negative_prompt_embeds`. instead. If not defined, one has to pass `negative_prompt_embeds`. instead.
	Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
	prompt_embeds (`torch.Tensor`, optional):
	Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not
	provided, text embeddings will be generated from `prompt` input argument.
	negative_prompt_embeds (`torch.Tensor`, optional):
	Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt
	weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
	argument.
	clean_caption (bool, defaults to `False`):
	If `True`, the function will preprocess and clean the provided caption before encoding.
	"""
	if prompt is not None and negative_prompt is not None:
	if type(prompt) is not type(negative_prompt):
	raise TypeError(
	f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
	f" {type(prompt)}."
	)

	if device is None:
	device = self._execution_device

	if prompt is not None and isinstance(prompt, str):
	batch_size = 1
	elif prompt is not None and isinstance(prompt, list):
	batch_size = len(prompt)
	else:
	batch_size = prompt_embeds.shape[0]

	# while T5 can handle much longer input sequences than 77, the text encoder was trained with a max length of 77 for IF
	max_length = 77

	if prompt_embeds is None:
	prompt = self._text_preprocessing(prompt, clean_caption=clean_caption)
	text_inputs = self.tokenizer(
	prompt,
	padding="max_length",
	max_length=max_length,
	truncation=True,
	add_special_tokens=True,
	return_tensors="pt",
	)
	text_input_ids = text_inputs.input_ids
	untruncated_ids = self.tokenizer(prompt, padding="longest", return_tensors="pt").input_ids

	if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not torch.equal(
	text_input_ids, untruncated_ids
	):
	removed_text = self.tokenizer.batch_decode(untruncated_ids[:, max_length - 1 : -1])
	logger.warning(
	"The following part of your input was truncated because CLIP can only handle sequences up to"
	f" {max_length} tokens: {removed_text}"
	)

	attention_mask = text_inputs.attention_mask.to(device)

	prompt_embeds = self.text_encoder(
	text_input_ids.to(device),
	attention_mask=attention_mask,
	)
	prompt_embeds = prompt_embeds[0]

	if self.text_encoder is not None:
	dtype = self.text_encoder.dtype
	elif self.unet is not None:
	dtype = self.unet.dtype
	else:
	dtype = None

	prompt_embeds = prompt_embeds.to(dtype=dtype, device=device)

	bs_embed, seq_len, _ = prompt_embeds.shape
	# duplicate text embeddings for each generation per prompt, using mps friendly method
	prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
	prompt_embeds = prompt_embeds.view(bs_embed * num_images_per_prompt, seq_len, -1)

	# get unconditional embeddings for classifier free guidance
	if do_classifier_free_guidance and negative_prompt_embeds is None:
	uncond_tokens: list[str]
	if negative_prompt is None:
	uncond_tokens = [""] * batch_size
	elif isinstance(negative_prompt, str):
	uncond_tokens = [negative_prompt]
	elif batch_size != len(negative_prompt):
	raise ValueError(
	f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
	f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
	" the batch size of `prompt`."
	)
	else:
	uncond_tokens = negative_prompt

	uncond_tokens = self._text_preprocessing(uncond_tokens, clean_caption=clean_caption)
	max_length = prompt_embeds.shape[1]
	uncond_input = self.tokenizer(
	uncond_tokens,
	padding="max_length",
	max_length=max_length,
	truncation=True,
	return_attention_mask=True,
	add_special_tokens=True,
	return_tensors="pt",
	)
	attention_mask = uncond_input.attention_mask.to(device)

	negative_prompt_embeds = self.text_encoder(
	uncond_input.input_ids.to(device),
	attention_mask=attention_mask,
	)
	negative_prompt_embeds = negative_prompt_embeds[0]

	if do_classifier_free_guidance:
	# duplicate unconditional embeddings for each generation per prompt, using mps friendly method
	seq_len = negative_prompt_embeds.shape[1]

	negative_prompt_embeds = negative_prompt_embeds.to(dtype=dtype, device=device)

	negative_prompt_embeds = negative_prompt_embeds.repeat(1, num_images_per_prompt, 1)
	negative_prompt_embeds = negative_prompt_embeds.view(batch_size * num_images_per_prompt, seq_len, -1)

	# For classifier free guidance, we need to do two forward passes.
	# Here we concatenate the unconditional and text embeddings into a single batch
	# to avoid doing two forward passes
	else:
	negative_prompt_embeds = None

	return prompt_embeds, negative_prompt_embeds

	class T5FilmDecoder(ModelMixin, ConfigMixin):
	r"""
	T5 style decoder with FiLM conditioning.

	Args:
	input_dims (`int`, optional, defaults to `128`):
	The number of input dimensions.
	targets_length (`int`, optional, defaults to `256`):
	The length of the targets.
	d_model (`int`, optional, defaults to `768`):
	Size of the input hidden states.
	num_layers (`int`, optional, defaults to `12`):
	The number of `DecoderLayer`'s to use.
	num_heads (`int`, optional, defaults to `12`):
	The number of attention heads to use.
	d_kv (`int`, optional, defaults to `64`):
	Size of the key-value projection vectors.
	d_ff (`int`, optional, defaults to `2048`):
	The number of dimensions in the intermediate feed-forward layer of `DecoderLayer`'s.
	dropout_rate (`float`, optional, defaults to `0.1`):
	Dropout probability.
	"""

	@register_to_config
	def __init__(
	self,
	input_dims: int = 128,
	targets_length: int = 256,
	max_decoder_noise_time: float = 2000.0,
	d_model: int = 768,
	num_layers: int = 12,
	num_heads: int = 12,
	d_kv: int = 64,
	d_ff: int = 2048,
	dropout_rate: float = 0.1,
	):
	super().__init__()

	self.conditioning_emb = nn.Sequential(
	nn.Linear(d_model, d_model * 4, bias=False),
	nn.SiLU(),
	nn.Linear(d_model * 4, d_model * 4, bias=False),
	nn.SiLU(),
	)

	self.position_encoding = nn.Embedding(targets_length, d_model)
	self.position_encoding.weight.requires_grad = False

	self.continuous_inputs_projection = nn.Linear(input_dims, d_model, bias=False)

	self.dropout = nn.Dropout(p=dropout_rate)

	self.decoders = nn.ModuleList()
	for lyr_num in range(num_layers):
	# FiLM conditional T5 decoder
	lyr = DecoderLayer(d_model=d_model, d_kv=d_kv, num_heads=num_heads, d_ff=d_ff, dropout_rate=dropout_rate)
	self.decoders.append(lyr)

	self.decoder_norm = T5LayerNorm(d_model)

	self.post_dropout = nn.Dropout(p=dropout_rate)
	self.spec_out = nn.Linear(d_model, input_dims, bias=False)

	def encoder_decoder_mask(self, query_input: torch.Tensor, key_input: torch.Tensor) -> torch.Tensor:
	mask = torch.mul(query_input.unsqueeze(-1), key_input.unsqueeze(-2))
	return mask.unsqueeze(-3)

	def forward(self, encodings_and_masks, decoder_input_tokens, decoder_noise_time):
	batch, _, _ = decoder_input_tokens.shape
	assert decoder_noise_time.shape == (batch,)

	# decoder_noise_time is in [0, 1), so rescale to expected timing range.
	time_steps = get_timestep_embedding(
	decoder_noise_time * self.config.max_decoder_noise_time,
	embedding_dim=self.config.d_model,
	max_period=self.config.max_decoder_noise_time,
	).to(dtype=self.dtype)

	conditioning_emb = self.conditioning_emb(time_steps).unsqueeze(1)

	assert conditioning_emb.shape == (batch, 1, self.config.d_model * 4)

	seq_length = decoder_input_tokens.shape[1]

	# If we want to use relative positions for audio context, we can just offset
	# this sequence by the length of encodings_and_masks.
	decoder_positions = torch.broadcast_to(
	torch.arange(seq_length, device=decoder_input_tokens.device),
	(batch, seq_length),
	)

	position_encodings = self.position_encoding(decoder_positions)

	inputs = self.continuous_inputs_projection(decoder_input_tokens)
	inputs += position_encodings
	y = self.dropout(inputs)

	# decoder: No padding present.
	decoder_mask = torch.ones(
	decoder_input_tokens.shape[:2], device=decoder_input_tokens.device, dtype=inputs.dtype
	)

	# Translate encoding masks to encoder-decoder masks.
	encodings_and_encdec_masks = [(x, self.encoder_decoder_mask(decoder_mask, y)) for x, y in encodings_and_masks]

	# cross attend style: concat encodings
	encoded = torch.cat([x[0] for x in encodings_and_encdec_masks], dim=1)
	encoder_decoder_mask = torch.cat([x[1] for x in encodings_and_encdec_masks], dim=-1)

	for lyr in self.decoders:
	y = lyr(
	y,
	conditioning_emb=conditioning_emb,
	encoder_hidden_states=encoded,
	encoder_attention_mask=encoder_decoder_mask,
	)[0]

	y = self.decoder_norm(y)
	y = self.post_dropout(y)

	spec_out = self.spec_out(y)
	return spec_out

deepfloyd_if model/pipeline review #13646

Description

deepfloyd_if model/pipeline review

Issue 1: Unbatched NumPy image inputs are rejected as batch-size mismatches

Issue 2: Batched NumPy masks are converted to 5D tensors

Issue 3: strength is documented as constrained but invalid values are accepted

Issue 4: IFPipelineOutput is hidden behind torch+transformers lazy-import guards

Issue 5: encode_prompt() detaches gradients in all IF pipelines

Issue 6: T5FilmDecoder has no direct fast or slow tests

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`deepfloyd_if` model/pipeline review

Issue 3: `strength` is documented as constrained but invalid values are accepted

Issue 4: `IFPipelineOutput` is hidden behind torch+transformers lazy-import guards

Issue 5: `encode_prompt()` detaches gradients in all IF pipelines

Issue 6: `T5FilmDecoder` has no direct fast or slow tests