Skip to content

Fix torchao group offloading with use_stream=True#14112

Open
Sunt-ing wants to merge 1 commit into
huggingface:mainfrom
Sunt-ing:14
Open

Fix torchao group offloading with use_stream=True#14112
Sunt-ing wants to merge 1 commit into
huggingface:mainfrom
Sunt-ing:14

Conversation

@Sunt-ing

@Sunt-ing Sunt-ing commented Jul 2, 2026

Copy link
Copy Markdown

What does this PR do?

Fixes #13281

This is the TorchAO use_stream=True half split out from #14038, so that #14038 can stay focused on the quanto issue.

The streamed group-offload path keeps a CPU copy of each tensor and tries to pin that copy before transferring a group back to the accelerator. For TorchAO tensor subclasses, _to_cpu() already has to call tensor.cpu() instead of tensor.data.cpu(), but the stream path still calls pin_memory() and is_pinned() on the resulting subclass tensor. AffineQuantizedTensor does not implement those pinning ops, so enable_group_offload(..., use_stream=True) fails before the group can be onloaded.

This PR skips the pinning step for TorchAO tensors in the stream CPU cache. Plain tensors still use pinned memory, and the existing non-stream TorchAO swap path is unchanged.

Tests

Environment: NVIDIA RTX 4090, torch==2.8.0+cu128, torchao==0.17.0.

Pipeline before/after repro

The before checkout is this branch with the patch reversed. The after checkout is this branch. The script uses the public tiny Flux pipeline, quantizes the transformer with TorchAO int8 weight-only quantization, enables pipe.transformer.enable_group_offload(..., use_stream=True), moves the remaining modules to CUDA, and runs pipe(...).

import numpy as np
import torch

import diffusers
from diffusers import DiffusionPipeline, TorchAoConfig
from diffusers.quantizers import PipelineQuantizationConfig
from torchao.quantization import Int8WeightOnlyConfig

print(f"diffusers_file={diffusers.__file__}")
print(f"torch={torch.__version__}")
print(f"cuda={torch.cuda.get_device_name(0)}")

quantization_config = PipelineQuantizationConfig(
    quant_mapping={"transformer": TorchAoConfig(Int8WeightOnlyConfig())}
)
pipe = DiffusionPipeline.from_pretrained(
    "hf-internal-testing/tiny-flux-pipe",
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
)
pipe.set_progress_bar_config(disable=True)
pipe.transformer.enable_group_offload(
    onload_device=torch.device("cuda"),
    offload_device=torch.device("cpu"),
    offload_type="leaf_level",
    use_stream=True,
    non_blocking=True,
)

for name, component in pipe.components.items():
    if name != "transformer" and isinstance(component, torch.nn.Module):
        if torch.device(component.device).type == "cpu":
            component.to("cuda")

images = pipe(
    "a dog",
    num_inference_steps=2,
    max_sequence_length=16,
    height=32,
    width=32,
    output_type="np",
).images
arr = np.asarray(images)
print("RESULT=PASS")
print(f"output_shape={arr.shape}")
print(f"finite={np.isfinite(arr).all()}")
print(f"mean={arr.mean():.6f}")

Before:

diffusers_file=/root/autodl-tmp/code/diffusers-14-e2e-before/src/diffusers/__init__.py
torch=2.8.0+cu128
cuda=NVIDIA GeForce RTX 4090
RESULT=FAIL
exception:
Traceback (most recent call last):
  File "/root/autodl-tmp/code/diffusers-14-e2e-before/repro_torchao_stream_e2e.py", line 39, in main
    pipe.transformer.enable_group_offload(
  File "/root/autodl-tmp/code/diffusers-14-e2e-before/src/diffusers/models/modeling_utils.py", line 573, in enable_group_offload
    apply_group_offloading(
  File "/root/autodl-tmp/code/diffusers-14-e2e-before/src/diffusers/hooks/group_offloading.py", line 702, in apply_group_offloading
    _apply_group_offloading(module, config)
  File "/root/autodl-tmp/code/diffusers-14-e2e-before/src/diffusers/hooks/group_offloading.py", line 709, in _apply_group_offloading
    _apply_group_offloading_leaf_level(module, config)
  File "/root/autodl-tmp/code/diffusers-14-e2e-before/src/diffusers/hooks/group_offloading.py", line 827, in _apply_group_offloading_leaf_level
    group = ModuleGroup(
  File "/root/autodl-tmp/code/diffusers-14-e2e-before/src/diffusers/hooks/group_offloading.py", line 167, in __init__
    self.cpu_param_dict = self._init_cpu_param_dict()
  File "/root/autodl-tmp/code/diffusers-14-e2e-before/src/diffusers/hooks/group_offloading.py", line 189, in _init_cpu_param_dict
    cpu_param_dict[param] = self._to_cpu(param, self.low_cpu_mem_usage)
  File "/root/autodl-tmp/code/diffusers-14-e2e-before/src/diffusers/hooks/group_offloading.py", line 180, in _to_cpu
    return t if low_cpu_mem_usage else t.pin_memory()
  File "/root/autodl-tmp/code/diffusers-14-e2e-before/pydeps/torchao/utils.py", line 684, in _dispatch__torch_dispatch__
    raise NotImplementedError(
NotImplementedError: AffineQuantizedTensor dispatch: attempting to run unimplemented operator/function: func=<OpOverload(op='aten.is_pinned', overload='default')>, types=(<class 'torchao.dtypes.affine_quantized_tensor.AffineQuantizedTensor'>,), arg_types=(<class 'torchao.dtypes.affine_quantized_tensor.AffineQuantizedTensor'>,), kwarg_types={}
EXIT_CODE=1

After:

diffusers_file=/root/autodl-tmp/code/diffusers-14-e2e-after/src/diffusers/__init__.py
torch=2.8.0+cu128
cuda=NVIDIA GeForce RTX 4090
RESULT=PASS
output_shape=(1, 32, 32, 3)
finite=True
mean=0.508664
EXIT_CODE=0

Regression test:

python -m pytest tests/quantization/torchao/test_torchao.py::TorchAoTest::test_group_offloading -q -rs
1 passed, 11 warnings in 36.40s

Before submitting

Who can review?

cc @sayakpaul

@github-actions github-actions Bot added size/S PR with diff < 50 LOC fixes-issue tests hooks and removed size/S PR with diff < 50 LOC fixes-issue labels Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Group offloading with use_stream=True breaks torchao quantized models (device mismatch) in qwen image

1 participant