feat: experimental vision-based ref support for ideogram4 by stduhpf · Pull Request #1690 · leejet/stable-diffusion.cpp

stduhpf · 2026-06-21T17:21:47Z

Summary

Enables Qwen3-VL vision for Ideogram4 conditionning.

Related Issue / Discussion

#1679

Additional Information

Not sure if it actually works, needs testing.

Checklist

I have read and confirmed this PR follows the contribution guidelines.

stduhpf · 2026-06-21T18:37:03Z

Doesn't seem to be doing much, Probably not worth merging.

Green-Sky · 2026-06-21T18:42:12Z

[INFO ] stable-diffusion.cpp:3984 - EDIT mode
[DEBUG] stable-diffusion.cpp:3994 - auto resize ref images
[DEBUG] stable-diffusion.cpp:4006 - resize vae ref image 0 from 768x768 to 768x768
[INFO ] ggml_graph_cut.cpp:938  - vae build cached graph cut plan done (taking 0 ms)
[DEBUG] model_loader.cpp:990  - loading 108/248 tensors from models/flux2/full_encoder_small_decoder.safetensors
  |##################################################| 108/108 - 653.35MB/s
[INFO ] model_loader.cpp:1227 - loading tensors completed, taking 0.20s (read: 0.01s, memcpy: 0.00s, convert: 0.01s, copy_to_backend: 0.00s)
[DEBUG] model_manager.cpp:218  - model manager prepared params backend buffer ( 65.72 MB, 108 tensors, RAM)
[DEBUG] model_manager.cpp:309  - model manager staged compute params ( 65.72 MB, 108 tensors) to CUDA0, taking 0.01s
[DEBUG] ggml_extend.hpp:1995 - vae compute buffer size: 1909.13 MB(VRAM)
[DEBUG] vae.hpp:161  - computing vae encode graph completed, taking 0.82s
[INFO ] stable-diffusion.cpp:4093 - encode_first_stage completed, taking 0.94s
/build/ld1qymfwc2f9ib04nh43zfb1lx01gklv-source/src/model/te/llm.hpp:1457: GGML_ASSERT(pos_embed != nullptr) failed

On the commit before your force push just now.

Green-Sky · 2026-06-21T18:45:56Z

It looks to me like we still need to disable the vae latent as context ref path.

stduhpf · 2026-06-21T18:47:00Z

@Green-Sky Have you tried boogu_image_edit? Because I suspect it might not work either.

To run the Qwen3VL mmproj I had to patch the name conversion like that:

diff --git a/src/name_conversion.cpp b/src/name_conversion.cpp
index da2a8d5..f2618c4 100644
--- a/src/name_conversion.cpp
+++ b/src/name_conversion.cpp
@@ -153,8 +153,9 @@ std::string convert_cond_stage_model_name(std::string name, std::string prefix)
     };
 
     static const std::vector<std::pair<std::string, std::string>> llm_vision_name_map{
-        {"mm.", "merger.mlp."},
-        {"v.post_ln.", "merger.ln_q."},
+        // {"mm.", "merger.mlp."},
+        // {"v.post_ln.", "merger.ln_q."},
+        {"v.patch_embd.bias", "patch_embed.proj.1.bias"},
         {"v.patch_embd.weight", "patch_embed.proj.0.weight"},
         {"patch_embed.proj.0.weight.1", "patch_embed.proj.1.weight"},
         {"v.patch_embd.weight.1", "patch_embed.proj.1.weight"},
@@ -163,11 +164,18 @@ std::string convert_cond_stage_model_name(std::string name, std::string prefix)
         {"attn_k.", "attn.k_proj."},
         {"attn_v.", "attn.v_proj."},
         {"attn_out.", "attn.proj."},
-        {"ffn_down.", "mlp.down_proj."},
+        {"ffn_down.", "mlp.linear_fc2."},
+        // {"ffn_down.", "mlp.down_proj."},
         {"ffn_gate.", "mlp.gate_proj."},
-        {"ffn_up.", "mlp.up_proj."},
+        {"ffn_up.", "mlp.linear_fc1."},
+        // {"ffn_up.", "mlp.up_proj."},
         {"ln1.", "norm1."},
         {"ln2.", "norm2."},
+        {"attn_qkv.", "attn.qkv."},
+        {"mm.0.", "merger.linear_fc1."},
+        {"mm.2.", "merger.linear_fc2."},
+        {"v.post_ln.", "merger.norm."},
+        {"v.position_embd.", "pos_embed."},
     };
     if (contains(name, "t5xxl")) {
         replace_with_name_map(name, t5_name_map);

stduhpf · 2026-06-21T18:48:33Z

It looks to me like we still need to disable the vae latent as context ref path.

Yes, its a bit of wasted compute, but that's probably not what's causing the crash

Green-Sky · 2026-06-21T18:54:15Z

Posting my command.

sd-cli --diffusion-model models/ideogram4-Q4_K.gguf --llm models/Qwen3-VL-8B-Instruct-UD-Q4_K_XL.gguf --llm_vision models/Qwen3-VL-8B-Instruct-mmproj-F16.gguf --vae models/flux2/full_encoder_small_decoder.safetensors --lora-model-dir models/loras/ig4/ --fa -v --offload-to-cpu --max-vram 5 -H 768 -W 768 --steps 4 --cfg-scale 1 -p "$(cat ./ig4prompt_5.json)<lora:ideogram_4_turbotime_v1:1>" -r ig4_cup.png

ig4prompt_5.json pretty:

{
  "high_level_description": "An amateurish photograph of a wooden desk with a specific item from a reference image placed centrally.",
  "compositional_deconstruction": {
    "background": "A worn wooden desk filling the entire frame with visible scratches and dust, soft uneven natural daylight from an unmarked window to the left, a beige wall in the upper right corner, low resolution and slight camera shake evident.",
    "elements": [
      {
        "type": "obj",
        "desc": "The item from the reference image resting in the center of the desk, its distinct features clearly visible despite the amateurish lighting and low resolution, casting a vague shadow to the right due to the uneven light source."
      }
    ]
  }
}

ref img:

Green-Sky · 2026-06-21T19:03:03Z

+                    auto image_embed = llm->encode_image(n_threads, resized_image, false, true, true);
+                    GGML_ASSERT(!image_embed.empty());
+
+                    img_prompt += "Picture " + std::to_string(i + 1) + ": <|vision_start|>";


"Picture 1: <|vision_start|>"
According to the prompt template for that model they are supposed to start at 0.

Green-Sky · 2026-06-21T19:05:02Z

You are right not only does it not work, but also the rest of the prompt seems to be cut off.

edit: ~~actually, that might be the reason.~~ nvm. seems to be the log only.

stduhpf · 2026-06-21T21:25:53Z

Doesn't seem to work. I don't think it's worth spending more time on that.

Green-Sky · 2026-06-21T21:29:56Z

Well, thanks for looking at it anyway :)

feat: experimental vision-based ref support for ideogram4

b47c598

stduhpf force-pushed the ideogram-vision branch from 2490401 to b47c598 Compare June 21, 2026 18:36

Green-Sky reviewed Jun 21, 2026

View reviewed changes

start at 0

30207fa

stduhpf closed this Jun 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: experimental vision-based ref support for ideogram4#1690

feat: experimental vision-based ref support for ideogram4#1690
stduhpf wants to merge 2 commits into
leejet:masterfrom
stduhpf:ideogram-vision

stduhpf commented Jun 21, 2026 •

edited

Loading

Uh oh!

stduhpf commented Jun 21, 2026

Uh oh!

Green-Sky commented Jun 21, 2026

Uh oh!

Green-Sky commented Jun 21, 2026

Uh oh!

stduhpf commented Jun 21, 2026

Uh oh!

stduhpf commented Jun 21, 2026

Uh oh!

Green-Sky commented Jun 21, 2026

Uh oh!

Green-Sky Jun 21, 2026

Uh oh!

Green-Sky commented Jun 21, 2026 •

edited

Loading

Uh oh!

stduhpf commented Jun 21, 2026

Uh oh!

Green-Sky commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stduhpf commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue / Discussion

Additional Information

Checklist

Uh oh!

stduhpf commented Jun 21, 2026

Uh oh!

Green-Sky commented Jun 21, 2026

Uh oh!

Green-Sky commented Jun 21, 2026

Uh oh!

stduhpf commented Jun 21, 2026

Uh oh!

stduhpf commented Jun 21, 2026

Uh oh!

Green-Sky commented Jun 21, 2026

Uh oh!

Green-Sky Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

Green-Sky commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stduhpf commented Jun 21, 2026

Uh oh!

Green-Sky commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stduhpf commented Jun 21, 2026 •

edited

Loading

Green-Sky commented Jun 21, 2026 •

edited

Loading