Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,8 @@
title: NVIDIA ModelOpt
- local: quantization/autoround
title: AutoRound
- local: quantization/quark
title: Quark
title: Quantization
- isExpanded: false
sections:
Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/quantization/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ There are two ways to use [`~quantizers.PipelineQuantizationConfig`] depending o

Initialize [`~quantizers.PipelineQuantizationConfig`] with the following parameters.

- `quant_backend` specifies which quantization backend to use. Currently supported backends include: `bitsandbytes_4bit`, `bitsandbytes_8bit`, `gguf`, `quanto`, and `torchao`.
- `quant_backend` specifies which quantization backend to use. Currently supported backends include: `bitsandbytes_4bit`, `bitsandbytes_8bit`, `gguf`, `quanto`, `torchao`, and `quark`.
- `quant_kwargs` specifies the quantization arguments to use.

> [!TIP]
Expand Down
112 changes: 112 additions & 0 deletions docs/source/en/quantization/quark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
<!--Copyright 2025 - 2026 Advanced Micro Devices, Inc. and The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# Quark

[Quark](https://quark.docs.amd.com/latest/) is AMD's deep-learning quantization toolkit. It is agnostic to specific data types, algorithms, and hardware, and primarily targets AMD CPUs and GPUs. Quark supports a broad range of strategies — INT8, INT4, FP8, MX, FP4, SVDQuant, SmoothQuant, AWQ, GPTQ, QuaRot, SpinQuant — combinable across diffusion submodules (UNet, transformer, VAE).

The Diffusers integration mirrors the [Transformers integration](https://huggingface.co/docs/transformers/quantization/quark): models exported with `quark.torch.export_safetensors` can be loaded back through `DiffusionPipeline.from_pretrained` / `ModelMixin.from_pretrained` without per-layer setup code.

To use Quark with Diffusers, install Quark:

```bash
pip install amd-quark
```

## Loading a pre-quantized model

If a model on the Hub already carries a `quantization_config` block in `config.json`, no extra setup is needed:

```python
import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
"amd/sd3-quark-int8",
torch_dtype=torch.float16,
).to("cuda")

image = pipe("A cat on a windowsill", num_inference_steps=30).images[0]
```

The dispatch is automatic: the loader sees `quant_method = "quark"` and instantiates `QuarkDiffusersQuantizer`.

## On-the-fly weight-only quantization

Pass `QuarkConfig(...)` against a vanilla fp16/bf16 model to quantize weights at load time:

```python
import torch
from diffusers import StableDiffusion3Pipeline, QuarkConfig

# A QConfig that produces INT8 weight-only quantization (no activation quantizers).
# Build with quark.torch.quantization.config.config.QConfig and pass its dict.
quark_config_dict = ... # see https://quark.docs.amd.com/latest/

quantization_config = QuarkConfig(quant_method="quark", **quark_config_dict)
pipe = StableDiffusion3Pipeline.from_pretrained(
"stabilityai/stable-diffusion-3-medium-diffusers",
quantization_config=quantization_config,
torch_dtype=torch.float16,
).to("cuda")
```

This works for any QConfig that does not declare activation quantizers (input or output `QTensorConfig`). Examples: INT8 weight-only, MXFP4 weight-only.

For activation-quantized configurations (SmoothQuant, SVDQuant w4a4, FP8 with calibrated activations, etc.), `from_pretrained` will raise a `NotImplementedError` directing you to the offline path.

## Producing a quantized checkpoint

For configurations that need calibration data, use the offline workflow:

```python
import torch
from diffusers import StableDiffusion3Pipeline
from quark.torch import ModelQuantizer, export_safetensors
from quark.torch.utils.diffusers import get_calib_dataloader

pipe = StableDiffusion3Pipeline.from_pretrained(
"stabilityai/stable-diffusion-3-medium-diffusers",
torch_dtype=torch.float16,
).to("cuda")

prompts = [
"A serene lake reflecting mountains at sunset",
"A futuristic city with flying cars at night",
]
dataloader = get_calib_dataloader(pipe, pipe.transformer, prompts, n_steps=20)

qconfig = ... # SVDQuant / SmoothQuant / FP8 + activation calibration
pipe.transformer = ModelQuantizer(qconfig).quantize_model(pipe.transformer, dataloader)

export_safetensors(pipe.transformer, "sd3-quark-svdquant/transformer")
```

The exported directory then reloads through `from_pretrained` per the first section.

## Support matrix

| Feature | Supported |
| --- | --- |
| Data types | INT8, INT4, INT2, BFloat16, Float16, FP8 (E4M3/E5M2), FP6, FP4, OCP MX, MX6, MX9, BFP16 |
| Pre-quantization transforms | SmoothQuant, QuaRot, SpinQuant, AWQ |
| Quantization algorithms | GPTQ, SVDQuant |
| Operators | `nn.Linear`, `nn.Conv2d`, `nn.ConvTranspose2d`, `nn.Embedding`, `nn.EmbeddingBag` |
| Granularity | per-tensor, per-channel, per-group, per-block, per-layer, per-layer-type |
| Activation calibration | min/max, percentile, histogram, MSE |
| Quantization strategy | weight-only, static, dynamic, with or without output quantization |
| `torch.compile` | yes (after `ModelQuantizer.freeze`) |

## Resources

- Quark documentation: <https://quark.docs.amd.com/latest/>
- Quark-quantized models on the Hub: <https://huggingface.co/models?other=quark>
21 changes: 21 additions & 0 deletions src/diffusers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
is_onnx_available,
is_opencv_available,
is_optimum_quanto_available,
is_quark_available,
is_scipy_available,
is_sentencepiece_available,
is_torch_available,
Expand Down Expand Up @@ -136,6 +137,18 @@
else:
_import_structure["quantizers.quantization_config"].append("AutoRoundConfig")

try:
if not is_torch_available() and not is_accelerate_available() and not is_quark_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
from .utils import dummy_quark_objects

_import_structure["utils.dummy_quark_objects"] = [
name for name in dir(dummy_quark_objects) if not name.startswith("_")
]
else:
_import_structure["quantizers.quantization_config"].append("QuarkConfig")

try:
if not is_onnx_available():
raise OptionalDependencyNotAvailable()
Expand Down Expand Up @@ -1017,6 +1030,14 @@
else:
from .quantizers.quantization_config import AutoRoundConfig

try:
if not is_quark_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
from .utils.dummy_quark_objects import *
else:
from .quantizers.quantization_config import QuarkConfig

try:
if not is_onnx_available():
raise OptionalDependencyNotAvailable()
Expand Down
4 changes: 4 additions & 0 deletions src/diffusers/quantizers/auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,9 +30,11 @@
QuantizationConfigMixin,
QuantizationMethod,
QuantoConfig,
QuarkConfig,
TorchAoConfig,
)
from .quanto import QuantoQuantizer
from .quark import QuarkDiffusersQuantizer
from .torchao import TorchAoHfQuantizer


Expand All @@ -44,6 +46,7 @@
"torchao": TorchAoHfQuantizer,
"modelopt": NVIDIAModelOptQuantizer,
"auto-round": AutoRoundQuantizer,
"quark": QuarkDiffusersQuantizer,
}

AUTO_QUANTIZATION_CONFIG_MAPPING = {
Expand All @@ -54,6 +57,7 @@
"torchao": TorchAoConfig,
"modelopt": NVIDIAModelOptConfig,
"auto-round": AutoRoundConfig,
"quark": QuarkConfig,
}


Expand Down
83 changes: 82 additions & 1 deletion src/diffusers/quantizers/quantization_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
from __future__ import annotations

import copy
import dataclasses
import importlib.metadata
import json
import os
Expand All @@ -33,7 +34,7 @@

from packaging import version

from ..utils import deprecate, is_torch_available, is_torchao_version, logging
from ..utils import deprecate, is_quark_available, is_torch_available, is_torchao_version, logging


if is_torch_available():
Expand All @@ -49,6 +50,7 @@ class QuantizationMethod(str, Enum):
QUANTO = "quanto"
MODELOPT = "modelopt"
AUTOROUND = "auto-round"
QUARK = "quark"


@dataclass
Expand Down Expand Up @@ -828,3 +830,82 @@ def from_dict(cls, config_dict: dict, return_unused_kwargs: bool = False, **kwar
# (e.g. quant_method is set automatically)
config_dict = {k: v for k, v in config_dict.items() if k != "quant_method"}
return super().from_dict(config_dict, return_unused_kwargs=return_unused_kwargs, **kwargs)


class QuarkConfig(QuantizationConfigMixin):
"""Configuration for AMD [Quark](https://quark.docs.amd.com/latest/) quantized diffusion models.

Mirrors ``transformers.utils.quantization_config.QuarkConfig`` so that a model serialized by
``quark.torch.export_safetensors`` reloads with the same schema in either library.

The ``quantization_config`` section of ``config.json`` is forwarded to this constructor as keyword arguments. Two
on-disk layouts are accepted:

* **Native.** Produced by ``custom_mode='quark'``. Contains a flat dump of ``QConfig.to_dict()`` together with a
top-level ``"export"`` block holding the ``JsonExporterConfig`` fields.
* **Custom mode (legacy).** Produced by ``custom_mode='awq'`` or ``custom_mode='fp8'``. ``quant_method`` carries
the custom mode tag and the rest of the body matches the AutoAWQ / native-FP8 schemas.
"""

def __init__(self, quant_config_dict: dict[str, Any] | None = None, **kwargs):
if quant_config_dict is not None:
kwargs = {**quant_config_dict, **kwargs}

if not (is_torch_available() and is_quark_available()):
raise ImportError(
"Quark is not installed. Install it with `pip install amd-quark` or "
"refer to https://quark.docs.amd.com/latest/install.html."
)

from quark import __version__ as quark_version
from quark.torch.export.config.config import JsonExporterConfig
from quark.torch.export.main_export.quant_config_parser import QuantConfigParser
from quark.torch.quantization.config.config import QConfig

self.custom_mode = kwargs.get("quant_method", QuantizationMethod.QUARK.value)
self.legacy = "export" not in kwargs

if self.custom_mode in ["awq", "fp8"]:
self.quant_config = QuantConfigParser.from_custom_config(kwargs, is_bias_quantized=False)
self.json_export_config = JsonExporterConfig()
else:
self.quant_config = QConfig.from_dict(kwargs)

if "export" in kwargs:
export_kwargs = dict(kwargs["export"])
# ``min_kv_scale`` is amd-quark>=0.8 only. Drop with a warning on older versions.
if "min_kv_scale" in export_kwargs and version.parse(quark_version) < version.parse("0.8"):
min_kv_scale = export_kwargs.pop("min_kv_scale")
logger.warning(
"Found `min_kv_scale=%s` in the model config.json's `quantization_config.export` block, but "
"this parameter is supported only for amd-quark>=0.8. Ignoring. Please upgrade `amd-quark`.",
min_kv_scale,
)
self.json_export_config = JsonExporterConfig(**export_kwargs)
elif self.custom_mode == QuantizationMethod.QUARK.value:
self.json_export_config = JsonExporterConfig()

self.quant_method = QuantizationMethod.QUARK

def to_dict(self) -> dict[str, Any]:
"""Serialize to the JSON-friendly kwargs form accepted by ``__init__``.

The default ``QuantizationConfigMixin.to_dict`` does
``copy.deepcopy(self.__dict__)``, which would embed the live Quark
``QConfig`` and ``JsonExporterConfig`` dataclasses (not JSON-serializable
through ``json.dumps``). Mirror what
``quark.torch.export.api.QuarkSafetensorsExporter`` writes into
``config.json``: a flat dump of ``QConfig.to_dict()`` plus a top-level
``"export"`` block holding the ``JsonExporterConfig`` fields.
"""
config_dict: dict[str, Any] = {}
if self.quant_config is not None:
config_dict.update(self.quant_config.to_dict())
config_dict["quant_method"] = self.custom_mode
if self.json_export_config is not None:
config_dict["export"] = dataclasses.asdict(self.json_export_config)
return config_dict

def to_diff_dict(self) -> dict[str, Any]:
"""No meaningful "default" QuarkConfig to diff against — return ``to_dict``."""
return self.to_dict()
14 changes: 14 additions & 0 deletions src/diffusers/quantizers/quark/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Copyright 2025 - 2026 Advanced Micro Devices, Inc. and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .quark_quantizer import QuarkDiffusersQuantizer
Loading
Loading