Fix torchao group offloading with use_stream=True by Sunt-ing · Pull Request #14112 · huggingface/diffusers

Sunt-ing · 2026-07-02T22:04:14Z

What does this PR do?

This is the TorchAO use_stream=True half split out from #14038, so that #14038 can stay focused on the quanto issue.

The streamed group-offload path keeps a CPU copy of each tensor and tries to pin that copy before transferring a group back to the accelerator. For TorchAO tensor subclasses, _to_cpu() already has to call tensor.cpu() instead of tensor.data.cpu(), but the stream path still calls pin_memory() and is_pinned() on the resulting subclass tensor. AffineQuantizedTensor does not implement those pinning ops, so enable_group_offload(..., use_stream=True) fails before the group can be onloaded.

This PR skips the pinning step for TorchAO tensors in the stream CPU cache. Plain tensors still use pinned memory, and the existing non-stream TorchAO swap path is unchanged.

Tests

Environment: NVIDIA RTX 4090, torch==2.8.0+cu128, torchao==0.17.0.

Pipeline before/after repro

The before checkout is this branch with the patch reversed. The after checkout is this branch. The script uses the public tiny Flux pipeline, quantizes the transformer with TorchAO int8 weight-only quantization, enables pipe.transformer.enable_group_offload(..., use_stream=True), moves the remaining modules to CUDA, and runs pipe(...).

import numpy as np
import torch

import diffusers
from diffusers import DiffusionPipeline, TorchAoConfig
from diffusers.quantizers import PipelineQuantizationConfig
from torchao.quantization import Int8WeightOnlyConfig

print(f"diffusers_file={diffusers.__file__}")
print(f"torch={torch.__version__}")
print(f"cuda={torch.cuda.get_device_name(0)}")

quantization_config = PipelineQuantizationConfig(
    quant_mapping={"transformer": TorchAoConfig(Int8WeightOnlyConfig())}
)
pipe = DiffusionPipeline.from_pretrained(
    "hf-internal-testing/tiny-flux-pipe",
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
)
pipe.set_progress_bar_config(disable=True)
pipe.transformer.enable_group_offload(
    onload_device=torch.device("cuda"),
    offload_device=torch.device("cpu"),
    offload_type="leaf_level",
    use_stream=True,
    non_blocking=True,
)

for name, component in pipe.components.items():
    if name != "transformer" and isinstance(component, torch.nn.Module):
        if torch.device(component.device).type == "cpu":
            component.to("cuda")

images = pipe(
    "a dog",
    num_inference_steps=2,
    max_sequence_length=16,
    height=32,
    width=32,
    output_type="np",
).images
arr = np.asarray(images)
print("RESULT=PASS")
print(f"output_shape={arr.shape}")
print(f"finite={np.isfinite(arr).all()}")
print(f"mean={arr.mean():.6f}")

Before:

diffusers_file=/root/autodl-tmp/code/diffusers-14-e2e-before/src/diffusers/__init__.py
torch=2.8.0+cu128
cuda=NVIDIA GeForce RTX 4090
RESULT=FAIL
exception:
Traceback (most recent call last):
  File "/root/autodl-tmp/code/diffusers-14-e2e-before/repro_torchao_stream_e2e.py", line 39, in main
    pipe.transformer.enable_group_offload(
  File "/root/autodl-tmp/code/diffusers-14-e2e-before/src/diffusers/models/modeling_utils.py", line 573, in enable_group_offload
    apply_group_offloading(
  File "/root/autodl-tmp/code/diffusers-14-e2e-before/src/diffusers/hooks/group_offloading.py", line 702, in apply_group_offloading
    _apply_group_offloading(module, config)
  File "/root/autodl-tmp/code/diffusers-14-e2e-before/src/diffusers/hooks/group_offloading.py", line 709, in _apply_group_offloading
    _apply_group_offloading_leaf_level(module, config)
  File "/root/autodl-tmp/code/diffusers-14-e2e-before/src/diffusers/hooks/group_offloading.py", line 827, in _apply_group_offloading_leaf_level
    group = ModuleGroup(
  File "/root/autodl-tmp/code/diffusers-14-e2e-before/src/diffusers/hooks/group_offloading.py", line 167, in __init__
    self.cpu_param_dict = self._init_cpu_param_dict()
  File "/root/autodl-tmp/code/diffusers-14-e2e-before/src/diffusers/hooks/group_offloading.py", line 189, in _init_cpu_param_dict
    cpu_param_dict[param] = self._to_cpu(param, self.low_cpu_mem_usage)
  File "/root/autodl-tmp/code/diffusers-14-e2e-before/src/diffusers/hooks/group_offloading.py", line 180, in _to_cpu
    return t if low_cpu_mem_usage else t.pin_memory()
  File "/root/autodl-tmp/code/diffusers-14-e2e-before/pydeps/torchao/utils.py", line 684, in _dispatch__torch_dispatch__
    raise NotImplementedError(
NotImplementedError: AffineQuantizedTensor dispatch: attempting to run unimplemented operator/function: func=<OpOverload(op='aten.is_pinned', overload='default')>, types=(<class 'torchao.dtypes.affine_quantized_tensor.AffineQuantizedTensor'>,), arg_types=(<class 'torchao.dtypes.affine_quantized_tensor.AffineQuantizedTensor'>,), kwarg_types={}
EXIT_CODE=1

After:

diffusers_file=/root/autodl-tmp/code/diffusers-14-e2e-after/src/diffusers/__init__.py
torch=2.8.0+cu128
cuda=NVIDIA GeForce RTX 4090
RESULT=PASS
output_shape=(1, 32, 32, 3)
finite=True
mean=0.508664
EXIT_CODE=0

Regression test:

python -m pytest tests/quantization/torchao/test_torchao.py::TorchAoTest::test_group_offloading -q -rs

1 passed, 11 warnings in 36.40s

Before submitting

Did you use an AI agent (Claude Code, Codex, Cursor, etc.) to help with this PR? If so:
- Did you read the Coding with AI agents guide?
- Did you self-review the diff against .ai/review-rules.md?
Did you read the contributor guideline?
Did you read our philosophy doc? (important for complex PRs)
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case. Group offloading with use_stream=True breaks torchao quantized models (device mismatch) in qwen image #13281
Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
Did you write any new necessary tests?
Are you the author (or part of the team) of the model/pipeline (only applicable for model/pipeline related PRs)?

Who can review?

cc @sayakpaul

Fix torchao group offloading with use_stream=True

d135a16

github-actions Bot added size/S PR with diff < 50 LOC fixes-issue tests hooks and removed size/S PR with diff < 50 LOC fixes-issue labels Jul 2, 2026

Sunt-ing mentioned this pull request Jul 2, 2026

Fix group offloading for quanto-quantized models #14038

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix torchao group offloading with use_stream=True#14112

Fix torchao group offloading with use_stream=True#14112
Sunt-ing wants to merge 1 commit into
huggingface:mainfrom
Sunt-ing:14

Sunt-ing commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Sunt-ing commented Jul 2, 2026

What does this PR do?

Tests

Before submitting

Who can review?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant