Skip to content

feat(mm): add Qwen Image single-file checkpoint loader with fp8 support#9253

Open
Pfannkuchensack wants to merge 11 commits into
invoke-ai:mainfrom
Pfannkuchensack:feat/qwen-image-checkpoint-loader
Open

feat(mm): add Qwen Image single-file checkpoint loader with fp8 support#9253
Pfannkuchensack wants to merge 11 commits into
invoke-ai:mainfrom
Pfannkuchensack:feat/qwen-image-checkpoint-loader

Conversation

@Pfannkuchensack

@Pfannkuchensack Pfannkuchensack commented May 30, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds Main_Checkpoint_QwenImage_Config and QwenImageCheckpointModel so that single-file safetensors checkpoints (e.g. Qwen-Image-Edit 2511 fp8_scaled from Civitai) can be imported. ComfyUI-style fp8 weights are dequantized at load time; the existing default_settings.fp8_storage toggle then optionally re-casts to fp8 for VRAM savings.

Also wires _apply_fp8_layerwise_casting into the Qwen Image diffusers loader so the fp8 storage option works across all three formats (diffusers, single-file checkpoint; GGUF stays untouched as it carries its own quantization).

Shared variant inference (marker tensor → filename heuristic) and transformer architecture auto-detection are extracted into module-level helpers so the GGUF and checkpoint loaders stay in sync.

Additional fixes in this PR:

  • Memory-efficient dequantization. ComfyUI fp8_scaled weights are now dequantized directly to the compute dtype (bf16) instead of via a full-precision float32 intermediate. The previous path materialised a 4-byte/param copy of the entire model before downcasting, spiking peak RAM to ~2× the final bf16 size (~80 GB for the 20B transformer). bf16 shares float32's exponent range and fp8 carries only 3 mantissa bits, so there is no meaningful precision loss. Applies to both the transformer and the single-file Qwen2.5-VL encoder loaders.
  • Qwen2.5-VL vs. Qwen3 encoder disambiguation. A single-file Qwen2.5-VL encoder satisfies the Qwen3 key heuristic (model.layers.* / model.embed_tokens.weight), so it matched both Qwen3Encoder_Checkpoint_Config and QwenVLEncoder_Checkpoint_Config; the tiebreak misrouted it to Qwen3Encoder, hiding it from the Qwen Image loader's encoder field. The Qwen3 single-file/GGUF configs now reject state dicts carrying a Qwen-VL visual tower (visual.blocks.* / visual.patch_embed.*), making the two mutually exclusive. Text-only Qwen3 encoders (Z-Image, FLUX.2 Klein) are unaffected.
  • Silenced bitsandbytes log spam. The LLM.int8 path emitted a MatMul8bitLt: inputs will be cast from bfloat16 to float16 UserWarning on every matmul of every layer (LLM.int8 only supports fp16 activations; the bf16→fp16 cast is correct and intended). Suppressed once at import.

Related Issues / Discussions

Qwen-Image-Edit 2511 fp8_scaled.

QA Instructions

Running a quantized transformer (GGUF or fp8 single-file) together with a standalone VAE + standalone Qwen2.5-VL encoder avoids ever downloading the full ~40 GB diffusers pipeline.

  1. Import a Qwen Image single-file checkpoint via the Model Manager. Tested files:
  2. Confirm classification:
    • The transformer checkpoint → Main / QwenImage / Checkpoint (not Diffusers, not GGUFQuantized), with the variant (edit vs generate) inferred correctly:
      • filename containing "edit" (case-insensitive) → edit
      • state dict containing __index_timestep_zero__edit
      • otherwise → generate
      • explicit override in import options must win.
    • The Qwen2.5-VL encoder → QwenVLEncoder / Checkpoint (not Qwen3Encoder) and must be selectable in the Qwen2.5-VL Encoder field of the Main Model – Qwen Image loader node.
  3. In the loader node, mix and match: GGUF/checkpoint Transformer + standalone Qwen Image VAE + standalone Qwen2.5-VL Encoder, leaving Component Source empty. Generate end-to-end and confirm a sensible image. For an Edit variant, verify the reference image actually conditions the output (dual modulation works).
  4. Toggle FP8 Storage in the model's default settings and re-generate:
    • Log line FP8 layerwise casting enabled for <model> ... should appear.
    • Transformer VRAM should drop ~50%; output should remain visually equivalent.
    • Repeat the toggle test for a diffusers-format Qwen Image model (previously fp8_storage was a no-op there).
  5. Regression check — re-import a GGUF Qwen Image model and a diffusers folder Qwen Image model; both must still load and infer correctly (loader helpers were extracted, behavior should be identical). Confirm a standalone text-only Qwen3 encoder (Z-Image / FLUX.2 Klein) still classifies as Qwen3Encoder.
  6. Run the relevant tests:
    uv run --extra cuda pytest \
      tests/backend/model_manager/configs/test_qwen_image_checkpoint_variant_detection.py \
      tests/backend/model_manager/configs/ \
      tests/backend/model_manager/load/test_load_default_fp8.py

Testing Status

Tested locally with:

Merge Plan

Standard merge — no DB schema changes, no migrations needed. The new config class registers in the discriminator union but only matches files that are explicitly Qwen Image single-file checkpoints (not GGUF, not diffusers), so it cannot accidentally re-classify existing models. Note: the Qwen2.5-VL/Qwen3 disambiguation only affects new classifications — an encoder imported before this PR stays Qwen3Encoder until re-imported.

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • ❗Changes to a redux slice have a corresponding migration — n/a, backend only
  • Documentation added / updated (if applicable) — n/a, no user-facing config changes
  • Updated What's New copy (if doing a release after this PR)

…h fp8 support

Adds Main_Checkpoint_QwenImage_Config and QwenImageCheckpointModel so that
single-file safetensors checkpoints (e.g. Qwen-Image-Edit 2511 fp8_scaled
from Civitai) can be imported. ComfyUI-style fp8 weights are dequantized to
bf16 at load time; the existing default_settings.fp8_storage toggle then
optionally re-casts to fp8 for VRAM savings.

Also wires _apply_fp8_layerwise_casting into the Qwen Image diffusers loader
so the fp8 storage option works across all three formats (diffusers, single-
file checkpoint, GGUF stays untouched as it carries its own quantization).

Shared variant inference (marker tensor → filename heuristic) and transformer
architecture auto-detection are extracted into module-level helpers so the
GGUF and checkpoint loaders stay in sync.
@github-actions github-actions Bot added python PRs that change python files backend PRs that change backend files python-tests PRs that change python tests labels May 30, 2026
@lstein lstein self-assigned this Jun 3, 2026
@lstein lstein added the 6.13.5 Library Updates label Jun 3, 2026
@lstein lstein moved this to 6.13.5 LIBRARY UPDATES in Invoke - Community Roadmap Jun 3, 2026
@github-actions github-actions Bot added the frontend PRs that change frontend files label Jun 5, 2026

@lstein lstein left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review

Solid, well-documented change. Good refactor of the duplicated GGUF logic into _strip_comfyui_prefix / _build_qwen_image_transformer_config / _infer_qwen_image_variant, tests pass, and ruff is clean. A few issues worth addressing.

1. [Medium] Detection and loading disagree about the ComfyUI prefix

The loader strips model.diffusion_model. / diffusion_model. prefixes via _strip_comfyui_prefix (qwen_image.py:240, :292), but the config probe never strips them. Main_Checkpoint_QwenImage_Config.from_model_on_disk (main.py:1383) calls _has_qwen_image_keys(sd) on the raw state dict, and that check uses strict startswith("txt_in.") / "txt_norm." / "img_in.") (main.py:1338-1340). ModelOnDisk.load_state_dict (model_on_disk.py:81) does no prefix normalization.

So a ComfyUI checkpoint whose keys are actually prefixed (model.diffusion_model.txt_in.weight) will fail identification and never reach the new loader — even though the loader was specifically built to strip that prefix. The config docstring explicitly claims "Covers… ComfyUI-style fp8_scaled checkpoints" (main.py:1361-1363), so this is a real gap.

The same inconsistency is pre-existing in Main_GGUF_QwenImage_Config, which suggests the files tested so far have bare keys and the prefix-stripping is defensive. Two ways to resolve:

  • If prefixed files are a real input → strip the prefix in _has_qwen_image_keys (or before calling it) so detection and loading agree.
  • If they're not → the _strip_comfyui_prefix calls are effectively dead and the docstring overstates coverage.

Worth confirming which, since right now the two paths can't both be right.

2. [Low] QwenVLEncoderCheckpointLoader still inlines the now-extracted helpers

The PR extracted _dequantize_comfyui_fp8 (qwen_image.py:51) and _strip_quantization_metadata (qwen_image.py:84), but QwenVLEncoderCheckpointLoader._load_text_encoder_from_singlefile (qwen_image.py:429-472) still carries verbatim copies of both blocks — ~45 lines, identical down to the comments. Since these are now module-level helpers in the same file, the encoder loader should call them. Leaving two copies means a future fix to the dequant logic has to be applied twice.

3. [Low] make_room estimate is taken before the float32→bf16 cast

In _load_from_singlefile (qwen_image.py:305-310), new_sd_size is computed with model_dtype.itemsize (bf16 = 2 bytes) and make_room is called before the cast loop. But at that moment the dequantized weights in sd are float32 (weight_float * scale_float → fp32, qwen_image.py:79), so the actual transient footprint is ~2× the estimate. For a model this size that's a non-trivial undercount feeding the cache eviction logic. The QwenVL loader (qwen_image.py:517) avoids this by computing the size after casting with actual element_size(). Consider reordering (cast, then size, then make_room) for consistency, or estimating with fp32 width.

4. [Nit] Missing type hint on _infer_qwen_image_variant

def _infer_qwen_image_variant(sd: ..., path) (main.py:1346) — path is untyped; it should be Path. The function relies on path.stem.

5. [Nit] Filename "edit" substring heuristic is broad

_infer_qwen_image_variant treats any "edit" substring in the stem as the Edit variant (main.py:1355). Names like credited, edited, or unedited would false-positive. This is moved-not-new logic, and the marker-tensor check takes precedence, so it's low risk — but a word-boundary match would be safer if you touch it.

Checked, not issues

  • The override_fields.pop("variant", None) or _infer_... pattern (main.py:1387) is safe: QwenImageVariantType is a str Enum, so Generate is truthy (covered by test_explicit_variant_override_not_overwritten).
  • _dequantize_comfyui_fp8's weight_key is always bound, since weight_scale_keys is pre-filtered to keys ending in one of scale_suffixes.
  • _strip_quantization_metadata correctly removes the .scale_input keys that _dequantize_comfyui_fp8 leaves behind.
  • The new config correctly rejects GGUF and non-Qwen state dicts, with tests covering both — so it won't double-match with Main_GGUF_QwenImage_Config.

Recommendation: address #1 (verify/fix the prefix detection gap) before merge; #2#5 are cleanups that can ride along or follow up.

🤖 Generated with Claude Code

@lstein lstein left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far I haven't been able to run generations with Qwen Image Edit 2511 fp8 . Generation gets to the text encoder loading message and then the whole InvokeAI process dies with "Killed". Sometimes it brings the shell down with it, and once it locked up my machine and I had to cold reboot.

I get the same behavior regardless of whether fp8 storage is active or not.

…edit heuristic, dedupe fp8 helpers

- strip ComfyUI key prefixes in _has_qwen_image_keys so prefixed checkpoints
  are identified and reach the loader
- match "edit" as a filename token instead of any substring (no credited/edited/unedited false positives)
- reuse _dequantize_comfyui_fp8 / _strip_quantization_metadata in the QwenVL encoder loader
- size make_room reservation after the bf16 cast to avoid fp32 undercount
- add Path type hint on _infer_qwen_image_variant
@lstein

lstein commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

Attempts to generate using qwenImageEdit2511_fp8.safetensors from Civitai reproducibly have a hard crash. Stack trace appended. Also note the log messages indicating that the system tries to load the text encoder twice.

[2026-06-06 12:25:02,934]::[InvokeAI]::INFO --> Executing queue item 20, session 519d17e7-633d-41db-b8c2-3364a06afd36
[2026-06-06 12:25:03,243]::[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model '53272211-ed62-4a29-b2fd-0611f25be722:vae' (AutoencoderKLQwenImage) onto cuda device in 0.06s. Total model size: 242.03MB, VRAM: 242.03MB (100.0%)
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 14.58it/s]
[2026-06-06 12:25:16,667]::[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model '3804fea9-d0a3-4fff-8fd0-d062fbe4d068:text_encoder' (Qwen2_5_VLForConditionalGeneration) onto cuda device in 11.89s. Total model size: 15816.05MB, VRAM: 9439.05MB (59.7%)
[2026-06-06 12:25:24,566]::[InvokeAI]::WARNING --> Loading 0.0 MB into VRAM, but only -38.1875 MB were requested. This is the minimum set of weights in VRAM required to run the model.
[2026-06-06 12:25:24,569]::[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model '3804fea9-d0a3-4fff-8fd0-d062fbe4d068:text_encoder' (Qwen2_5_VLForConditionalGeneration) onto cuda device in 0.02s. Total model size: 15816.05MB, VRAM: 9392.10MB (59.4%)
Killed

@Pfannkuchensack

Copy link
Copy Markdown
Collaborator Author
image i need to dig a bit deeper. it is running but it needs a lots of vram/ram

…ilence int8 warning

- qwen_image: dequantize ComfyUI fp8_scaled weights directly to compute_dtype
  instead of a full-precision float32 intermediate. The previous path materialised
  a 4-byte/param copy of the whole model before downcasting, spiking peak RAM to
  ~2x the final bf16 size (~80GB for the 20B transformer). bf16 shares float32's
  exponent range and fp8 has only 3 mantissa bits, so no meaningful precision loss.

- qwen3_encoder: reject checkpoints that bundle a Qwen-VL visual tower
  (visual.blocks.* / visual.patch_embed.*). A Qwen2.5-VL file satisfies the Qwen3
  key heuristic too, so it matched both configs and the tiebreak misrouted it to
  Qwen3Encoder, hiding it from the Qwen Image loader's encoder field. Qwen3 (text)
  and QwenVLEncoder (vision+language) are now mutually exclusive.

- bnb_llm_int8: silence the per-matmul "inputs will be cast from bfloat16 to
  float16" UserWarning. LLM.int8 only supports fp16 activations; the bf16->fp16
  cast is correct and intended, so the warning is pure log spam on every layer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.13.5 Library Updates backend PRs that change backend files frontend PRs that change frontend files python PRs that change python files python-tests PRs that change python tests

Projects

Status: 6.13.5 LIBRARY UPDATES

Development

Successfully merging this pull request may close these issues.

2 participants