Add Nunchaku Lite single-file quantization by rootonchair · Pull Request #14100 · huggingface/diffusers

rootonchair · 2026-07-01T17:08:21Z

What does this PR do?

Adds Nunchaku Lite single-file checkpoint loading for Diffusers models.

This introduces NunchakuLiteQuantizationConfig and a new Nunchaku Lite quantizer that can patch supported nn.Linear modules into runtime SVDQ/AWQ linear layers before strict checkpoint loading. The loader reads safetensors metadata during from_single_file so Nunchaku Lite checkpoints can use their embedded runtime manifest to decide which modules to replace.

Deprecated API

import torch
from diffusers import (
    ErnieImagePipeline,
    ErnieImageTransformer2DModel,
    NunchakuLiteQuantizationConfig,
)

checkpoint = hf_hub_download(
    repo_id="rootonchair/ERNIE-Image-Turbo-nunchaku-lite",
    filename="svdq-int4_r32-ernie-image-turbo-zero-svdq-fix-bias.safetensors",
)

transformer = ErnieImageTransformer2DModel.from_single_file(
    checkpoint,
    config="baidu/ERNIE-Image-Turbo",
    subfolder="transformer",
    quantization_config=NunchakuLiteQuantizationConfig(compute_dtype=torch.bfloat16),
    torch_dtype=torch.bfloat16,
)

pipe = ErnieImagePipeline.from_pretrained(
    "baidu/ERNIE-Image-Turbo",
    transformer=transformer,
    torch_dtype=torch.bfloat16,
)

pipe.to("cuda")

image = pipe(
    prompt="A modern red armchair in a quiet studio, soft window light, realistic product photography",
    height=512,
    width=512,
    num_inference_steps=8,
    guidance_scale=1.0,
    generator=torch.Generator(device="cuda").manual_seed(1234),
    use_pe=False,
).images[0]

image.save("ernie-image-turbo-nunchaku-lite.png")

New API for `from_single_file` use

import torch
from huggingface_hub import hf_hub_download

from diffusers import (
    ErnieImagePipeline,
    ErnieImageTransformer2DModel,
    NunchakuLiteQuantizationConfig,
)


dtype = torch.bfloat16
device = "cuda"

checkpoint = hf_hub_download(
    repo_id="rootonchair/ERNIE-Image-Turbo-nunchaku-lite",
    filename="svdq-int4_r32-ernie-image-turbo-zero-svdq-fix-bias.safetensors",
)

svdq_targets = []
for name in [
    "self_attention.to_q",
    "self_attention.to_k",
    "self_attention.to_v",
    "self_attention.to_out.0",
    "mlp.gate_proj",
    "mlp.up_proj",
    "mlp.linear_fc2",
]:
    svdq_targets.extend([f"layers.{i}.{name}" for i in range(36)])

quantization_config = NunchakuLiteQuantizationConfig(
    compute_dtype=dtype,
    svdq_w4a4={
        "precision": "int4",
        "group_size": 64,
        "rank": 32,
        "targets": svdq_targets,
    },
    awq_w4a16={
        "precision": "int4",
        "group_size": 64,
        "targets": [
            "text_proj",
            "time_embedding.linear_1",
            "time_embedding.linear_2",
            "adaLN_modulation.1",
            "final_norm.linear",
            "final_linear",
        ],
    },
)

transformer = ErnieImageTransformer2DModel.from_single_file(
    checkpoint,
    config="baidu/ERNIE-Image-Turbo",
    subfolder="transformer",
    quantization_config=quantization_config,
    torch_dtype=dtype,
)

pipe = ErnieImagePipeline.from_pretrained(
    "baidu/ERNIE-Image-Turbo",
    transformer=transformer,
    torch_dtype=dtype,
)

pipe.to(device)

image = pipe(
    prompt="A modern red armchair in a quiet studio, soft window light, realistic product photography",
    height=512,
    width=512,
    num_inference_steps=8,
    guidance_scale=1.0,
    generator=torch.Generator(device=device).manual_seed(1234),
    use_pe=False,
).images[0]

image.save("ernie-image-turbo-nunchaku-lite.png")

Fixes # (issue)

Before submitting

Did you use an AI agent (Claude Code, Codex, Cursor, etc.) to help with this PR? If so:
- Did you read the Coding with AI agents guide?
- Did you self-review the diff against .ai/review-rules.md?
Did you read the contributor guideline?
Did you read our philosophy doc? (important for complex PRs)
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?
Are you the author (or part of the team) of the model/pipeline (only applicable for model/pipeline related PRs)?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

sayakpaul

Thanks for getting started! Just did a first pass and left high-level reviews.

sayakpaul · 2026-07-02T04:15:28Z

+    def __init__(self, compute_dtype: "torch.dtype" | None = None):
+        self.quant_method = QuantizationMethod.NUNCHAKU_LITE
+        self.compute_dtype = compute_dtype
+        self.pre_quantized = True


Can we also guide the readers on how to obtain the checkpoints?

Also, can we ensure torch.compile compatibility?

The kernels are compatible with torch.compile, as well as SVDQLinear and AWQLinear, I will make a test to assure that the compatibility still remains when we integrate to diffusers

Can we also guide the readers on how to obtain the checkpoints?

I'm a little confused here. Could you help provide more context

I'm a little confused here. Could you help provide more context

How are the example checkpoints obtained? I think we're only dealing with pre-quantized checkpoints in this PR?

Yes we are only dealing with pre-quantized checkpoint here. Perhaps we can leave a comment that said the checkpoints is quantized with diffuse-compressor + run diffuser format converter?

sayakpaul · 2026-07-02T04:16:04Z

@@ -0,0 +1,161 @@
+import json


For tests, WDYT of adding a mixin to https://github.com/huggingface/diffusers/blob/main/tests/models/testing_utils/quantization.py and then extending a popular model like Flux to use that mixin?

Yes, let's do it that way

rootonchair · 2026-07-02T07:04:43Z

I just did some benchmark on RTX PRO 6000, here is the visual result between bf16 and nvfp4 checkpoint for ERNIE-Image-Turbo

BF16	Nunchaku NVFP4

Case	Full mean	Denoise mean	Denoise peak alloc	Full peak alloc	Speedup
original	3.003s	2.862s	29.429GB	31.081GB	1.0x
nunchaku_lite NVFP4	2.271s	2.127s	18.926GB	20.578GB	1.35x
nunchaku_lite NVFP4 + compile	1.675s	1.525s	18.672GB	20.578GB	1.8x
nunchaku_lite NVFP4 + bnb text encoder	2.285s	2.132s	14.317GB	15.969GB	1.35x

By replacing Nunchaku Linear, we have reduced the latency of these linear operations by 2x with large shape

Target	Op	Rows	Shape	Normal ms	Nunchaku ms	Speedup
`layers.0.self_attention.to_q`	`svdq_w4a4`	4096	4096 -> 4096	0.3660	0.1563	2.34x
`layers.0.mlp.gate_proj`	`svdq_w4a4`	4096	4096 -> 12288	1.0646	0.4272	2.49x
`layers.0.mlp.linear_fc2`	`svdq_w4a4`	4096	12288 -> 4096	1.0269	0.4596	2.23x

One note here, AWQ only benefit with adaLN layer, so other modules like time embedding or final linear can stay as bf16

Target	Op	Rows	Shape	Normal ms	Nunchaku ms	Speedup
`text_proj`	`awq_w4a16`	1	3072 -> 4096	0.0111	0.0201	0.55x
`time_embedding.linear_1`	`awq_w4a16`	1	4096 -> 4096	0.0129	0.0251	0.52x
`time_embedding.linear_2`	`awq_w4a16`	1	4096 -> 4096	0.0129	0.0247	0.52x
`adaLN_modulation.1`	`awq_w4a16`	1	4096 -> 24576	0.1389	0.0243	5.71x
`final_norm.linear`	`awq_w4a16`	1	4096 -> 8192	0.0220	0.0245	0.90x
`final_linear`	`awq_w4a16`	4114	4096 -> 128	0.0248	0.0635	0.39x

rootonchair · 2026-07-02T10:20:55Z

I have just implemented the native loading feature, which now can load by from_pretrained with converted repo:

import torch
from diffusers import ErnieImagePipeline

pipe = ErnieImagePipeline.from_pretrained(
    "rootonchair/ERNIE-Image-Turbo-nunchaku-lite-int4",
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe(
    prompt="A modern red armchair in a quiet studio, soft window light, realistic product photography",
    height=1024,
    width=1024,
    num_inference_steps=8,
    guidance_scale=1.0,
    use_pe=False,
).images[0]

image.save("ernie-image-turbo-nunchaku-lite-int4.png")

Quantization config now change to:

"quantization_config": {
    "awq_w4a16": {
      "group_size": 64,
      "precision": "int4",
      "targets": [
        "text_proj",
        "time_embedding.linear_1",
        "time_embedding.linear_2",
        "adaLN_modulation.1",
        "final_norm.linear",
        "final_linear"
      ]
    },
    "compute_dtype": "bfloat16",
    "quant_method": "nunchaku_lite",
    "svdq_w4a4": {
      "group_size": 16,
      "precision": "fp4",
      "rank": 32,
      "targets": [
        "layers.0.self_attention.to_q",
        "layers.1.self_attention.to_q",
        "layers.2.self_attention.to_q",
        "layers.3.self_attention.to_q",
         ...
      ]
    }

If we agree to use this schema, I will remove the old metadata/from_single_file approach

sayakpaul

Looking good. I think we can remove all metadata related code?

sayakpaul · 2026-07-02T13:28:34Z

+    def is_serializable(self):
+        return False
+
+    @property


We should set is_compileable() property too:

diffusers/src/diffusers/quantizers/base.py

Line 263 in 9159a58

def is_compileable(self) -> bool:

sayakpaul · 2026-07-02T13:30:08Z

+    def __init__(self, compute_dtype: "torch.dtype" | None = None):
+        self.quant_method = QuantizationMethod.NUNCHAKU_LITE
+        self.compute_dtype = compute_dtype
+        self.pre_quantized = True


I'm a little confused here. Could you help provide more context

How are the example checkpoints obtained? I think we're only dealing with pre-quantized checkpoints in this PR?

sayakpaul · 2026-07-02T13:36:58Z

+        self.svdq_w4a4 = svdq_w4a4
+        self.awq_w4a16 = awq_w4a16


Do we need any validation around these two?

Sure, we does need it

Add Nunchaku Lite single-file quantization

7f4a3a0

github-actions Bot added size/L PR with diff > 200 LOC quantization tests single-file and removed size/L PR with diff > 200 LOC labels Jul 1, 2026

rootonchair marked this pull request as draft July 1, 2026 17:13

sayakpaul reviewed Jul 2, 2026

View reviewed changes

Support config-backed Nunchaku Lite loading

1a66ac9

github-actions Bot added the size/L PR with diff > 200 LOC label Jul 2, 2026

sayakpaul reviewed Jul 2, 2026

View reviewed changes

rootonchair added 3 commits July 2, 2026 15:53

Remove Nunchaku runtime manifest metadata loading

36b4fc4

Simplify Nunchaku compact config loading

db05e0b

Add Nunchaku Lite quantization tests

5d4822a

Uh oh!

Conversation

rootonchair commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Deprecated API

New API for from_single_file use

Before submitting

Who can review?

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rootonchair commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rootonchair commented Jul 2, 2026

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rootonchair commented Jul 1, 2026 •

edited

Loading

New API for `from_single_file` use

rootonchair commented Jul 2, 2026 •

edited

Loading