Fix group offloading for quanto-quantized models by Sunt-ing · Pull Request #14038 · huggingface/diffusers

Sunt-ing · 2026-06-22T14:21:21Z

What does this PR do?

Group offloading moves a group's parameters between CPU and the accelerator by reassigning param.data:

param.data = source_tensor.to(device)

This is correct for plain tensors but wrong for quanto tensor subclasses. A quanto WeightQBytesTensor stores the real payload in internal tensors such as _data and _scale; replacing .data only swaps the outer wrapper and leaves those internal tensors on the source device. The next matmul then fails with mat2 is on cpu, different from cuda:0.

#13276 fixed the same subclass-storage issue for TorchAO tensors by swapping the full tensor subclass instead of assigning .data, but quanto tensors still fall through to the plain tensor path. This PR adds the corresponding quanto path and keeps the TorchAO stream fix split out in #14112.

Changes

Detect quanto QTensor parameters without importing optimum-quanto unless it is installed.
Use torch.utils.swap_tensors for quanto onload/offload instead of assigning .data.
Restore and record streams for quanto internal tensors using the subclass tensor names from __tensor_flatten__().
Skip pinned-memory conversion for quanto tensors, since pin_memory() does not preserve the quanto subclass.

Tests

Environment: NVIDIA RTX 4090, torch==2.8.0+cu128, optimum-quanto==0.2.7.

Reproduction and before/after

Minimal standalone repro for #12610:

import torch
from diffusers import UNet2DConditionModel
from diffusers.hooks import apply_group_offloading
from optimum.quanto import quantize, freeze, qint8

m = UNet2DConditionModel.from_pretrained(
    "hf-internal-testing/tiny-stable-diffusion-pipe", subfolder="unet"
).to(torch.float32).eval()
quantize(m, weights=qint8)
freeze(m)
apply_group_offloading(
    m,
    onload_device=torch.device("cuda"),
    offload_device=torch.device("cpu"),
    offload_type="leaf_level",
)
x = torch.randn(2, m.config.in_channels, m.config.sample_size, m.config.sample_size, device="cuda")
t = torch.tensor([10, 10], device="cuda")
e = torch.randn(2, 4, m.config.cross_attention_dim, device="cuda")
with torch.no_grad():
    m(x, t, e)

On main, this fails with:

RuntimeError: mat2 is on cpu, different from cuda:0

With this PR, quanto group offload matches the fully-on-accelerator quantized baseline across leaf_level, block_level, non-stream, use_stream, and record_stream configs. The maximum absolute difference is 0.0.

Regression tests:

python -m pytest tests/quantization/quanto/test_quanto.py::FluxTransformerInt8WeightsTest::test_group_offloading -q
python -m pytest tests/quantization/quanto/test_quanto.py::FluxTransformerFloat8WeightsTest::test_group_offloading -q

Both tests fail on main with the device mismatch and pass with this PR.

Before submitting

Did you use an AI agent (Claude Code, Codex, Cursor, etc.) to help with this PR? If so:
- Did you read the Coding with AI agents guide?
- Did you self-review the diff against .ai/review-rules.md?
Did you read the contributor guideline?
Did you read our philosophy doc? (important for complex PRs)
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case. Quanto + Group Offload causes device mismatch error (weights on cpu, mat1 on gpu) #12610
Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
Did you write any new necessary tests?
Are you the author (or part of the team) of the model/pipeline (only applicable for model/pipeline related PRs)?

Who can review?

cc @sayakpaul

…ath for quantized tensor subclasses

sayakpaul · 2026-06-23T20:36:29Z

Group offloading should have been fixed, though with #13276. Can you check again?

Sunt-ing · 2026-06-25T17:20:38Z

Hi @sayakpaul, thanks. Yes, I rechecked against #13276 before opening this. #13276 makes group offloading work for torchao by swapping the subclass (_is_torchao_tensor → torch.utils.swap_tensors on onload, setattr of inner tensors on the offload restore). Two cases it doesn't cover are exactly what this PR targets:

quanto was never handled (Quanto + Group Offload causes device mismatch error (weights on cpu, mat1 on gpu) #12610). There is no quanto branch anywhere in group_offloading.py (still true on main today), so a quanto-quantized model with enable_group_offload falls through to the plain param.data = source.to(device) path. That swaps only the outer wrapper and leaves _data / _scale on the offload device, so the first forward crashes with mat2 is on cpu, different from cuda:0, for both leaf_level and block_level.
the streamed path is still broken for both subclasses (Group offloading with use_stream=True breaks torchao quantized models (device mismatch) in qwen image #13281). As [core] fix group offloading when using torchao #13276 itself notes, its stream handling assumes the subclass implements pinning ("not something we can always guarantee ... we need coordination with TorchAO", support pinning for mx and nvfp4 tensors pytorch/ao#4192). That coordination only added pinning for mx / nvfp4; Int8WeightOnlyConfig's AffineQuantizedTensor still raises NotImplementedError: ... aten.is_pinned on torchao==0.17.0, and quanto implements no torch pinning at all. _to_cpu / _pinned_memory_tensors still call pin_memory() / is_pinned() unconditionally, so torchao + use_stream=True crashes even with [core] fix group offloading when using torchao #13276 in. This PR skips pinning for both subclasses on the diffusers side, so it works regardless of the torchao version.

Both #12610 and #13281 are still open. I confirmed on current main (so with #13276 in) that the three tests this PR adds fail, and pass here:

main (with #13276) vs this PR

# main (fix reverted, tests kept)
quanto  FluxTransformerInt8WeightsTest::test_group_offloading    FAILED  (mat2 is on cpu, different from cuda:0)
quanto  FluxTransformerFloat8WeightsTest::test_group_offloading  FAILED  (mat2 is on cpu, different from cuda:0)
torchao TorchAoTest::test_group_offloading                       FAILED  (NotImplementedError: ... aten.is_pinned)

# with this PR
quanto  FluxTransformerInt8WeightsTest::test_group_offloading    PASSED
quanto  FluxTransformerFloat8WeightsTest::test_group_offloading  PASSED
torchao TorchAoTest::test_group_offloading                       PASSED

On approach: I deliberately mirrored the existing _is_torchao_tensor branch rather than touching it, to keep this a low-risk bug fix (_is_quanto_tensor gates on is_optimum_quanto_available() and pulls inner-tensor names from the standard __tensor_flatten__()). I also saw your note in #13276 about generalizing these utilities to swap_tensors for any subclass instead of .data. Happy to fold torchao + quanto into one generic subclass path here, or leave that as the separate follow-up you mentioned, whichever you prefer.

sayakpaul · 2026-07-01T10:37:29Z

Can we focus on one issue at a time? Therefore, I would suggest splitting the PR into two.

Sunt-ing · 2026-07-02T22:13:06Z

Thanks @sayakpaul. I split the TorchAO use_stream=True fix into #14112 and updated this PR to focus on the quanto issue in #12610.

Fix group offloading for quanto-quantized models and the use_stream p…

8ab88ee

…ath for quantized tensor subclasses

github-actions Bot added fixes-issue size/M PR with diff < 200 LOC tests hooks and removed size/M PR with diff < 200 LOC labels Jun 22, 2026

Sunt-ing mentioned this pull request Jul 2, 2026

Fix torchao group offloading with use_stream=True #14112

Open

9 tasks

Narrow group offloading fix to quanto tensors

7612f35

github-actions Bot added the size/M PR with diff < 200 LOC label Jul 2, 2026

Sunt-ing changed the title ~~Fix group offloading for quanto-quantized models and the use_stream path for quantized tensor subclasses~~ Fix group offloading for quanto-quantized models Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix group offloading for quanto-quantized models#14038

Fix group offloading for quanto-quantized models#14038
Sunt-ing wants to merge 2 commits into
huggingface:mainfrom
Sunt-ing:0

Sunt-ing commented Jun 22, 2026 •

edited

Loading

Uh oh!

sayakpaul commented Jun 23, 2026

Uh oh!

Sunt-ing commented Jun 25, 2026

Uh oh!

sayakpaul commented Jul 1, 2026

Uh oh!

Sunt-ing commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Sunt-ing commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Changes

Tests

Before submitting

Who can review?

Uh oh!

sayakpaul commented Jun 23, 2026

Uh oh!

Sunt-ing commented Jun 25, 2026

Uh oh!

sayakpaul commented Jul 1, 2026

Uh oh!

Sunt-ing commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Sunt-ing commented Jun 22, 2026 •

edited

Loading