Skip to content

feat: experimental vision-based ref support for ideogram4#1690

Closed
stduhpf wants to merge 2 commits into
leejet:masterfrom
stduhpf:ideogram-vision
Closed

feat: experimental vision-based ref support for ideogram4#1690
stduhpf wants to merge 2 commits into
leejet:masterfrom
stduhpf:ideogram-vision

Conversation

@stduhpf

@stduhpf stduhpf commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Summary

Enables Qwen3-VL vision for Ideogram4 conditionning.

Related Issue / Discussion

#1679

Additional Information

Not sure if it actually works, needs testing.

Checklist

@stduhpf

stduhpf commented Jun 21, 2026

Copy link
Copy Markdown
Contributor Author

Doesn't seem to be doing much, Probably not worth merging.

@Green-Sky

Copy link
Copy Markdown
Contributor
[INFO ] stable-diffusion.cpp:3984 - EDIT mode
[DEBUG] stable-diffusion.cpp:3994 - auto resize ref images
[DEBUG] stable-diffusion.cpp:4006 - resize vae ref image 0 from 768x768 to 768x768
[INFO ] ggml_graph_cut.cpp:938  - vae build cached graph cut plan done (taking 0 ms)
[DEBUG] model_loader.cpp:990  - loading 108/248 tensors from models/flux2/full_encoder_small_decoder.safetensors
  |##################################################| 108/108 - 653.35MB/s
[INFO ] model_loader.cpp:1227 - loading tensors completed, taking 0.20s (read: 0.01s, memcpy: 0.00s, convert: 0.01s, copy_to_backend: 0.00s)
[DEBUG] model_manager.cpp:218  - model manager prepared params backend buffer ( 65.72 MB, 108 tensors, RAM)
[DEBUG] model_manager.cpp:309  - model manager staged compute params ( 65.72 MB, 108 tensors) to CUDA0, taking 0.01s
[DEBUG] ggml_extend.hpp:1995 - vae compute buffer size: 1909.13 MB(VRAM)
[DEBUG] vae.hpp:161  - computing vae encode graph completed, taking 0.82s
[INFO ] stable-diffusion.cpp:4093 - encode_first_stage completed, taking 0.94s
/build/ld1qymfwc2f9ib04nh43zfb1lx01gklv-source/src/model/te/llm.hpp:1457: GGML_ASSERT(pos_embed != nullptr) failed

On the commit before your force push just now.

@Green-Sky

Copy link
Copy Markdown
Contributor

It looks to me like we still need to disable the vae latent as context ref path.

@stduhpf

stduhpf commented Jun 21, 2026

Copy link
Copy Markdown
Contributor Author

@Green-Sky Have you tried boogu_image_edit? Because I suspect it might not work either.

To run the Qwen3VL mmproj I had to patch the name conversion like that:

diff --git a/src/name_conversion.cpp b/src/name_conversion.cpp
index da2a8d5..f2618c4 100644
--- a/src/name_conversion.cpp
+++ b/src/name_conversion.cpp
@@ -153,8 +153,9 @@ std::string convert_cond_stage_model_name(std::string name, std::string prefix)
     };
 
     static const std::vector<std::pair<std::string, std::string>> llm_vision_name_map{
-        {"mm.", "merger.mlp."},
-        {"v.post_ln.", "merger.ln_q."},
+        // {"mm.", "merger.mlp."},
+        // {"v.post_ln.", "merger.ln_q."},
+        {"v.patch_embd.bias", "patch_embed.proj.1.bias"},
         {"v.patch_embd.weight", "patch_embed.proj.0.weight"},
         {"patch_embed.proj.0.weight.1", "patch_embed.proj.1.weight"},
         {"v.patch_embd.weight.1", "patch_embed.proj.1.weight"},
@@ -163,11 +164,18 @@ std::string convert_cond_stage_model_name(std::string name, std::string prefix)
         {"attn_k.", "attn.k_proj."},
         {"attn_v.", "attn.v_proj."},
         {"attn_out.", "attn.proj."},
-        {"ffn_down.", "mlp.down_proj."},
+        {"ffn_down.", "mlp.linear_fc2."},
+        // {"ffn_down.", "mlp.down_proj."},
         {"ffn_gate.", "mlp.gate_proj."},
-        {"ffn_up.", "mlp.up_proj."},
+        {"ffn_up.", "mlp.linear_fc1."},
+        // {"ffn_up.", "mlp.up_proj."},
         {"ln1.", "norm1."},
         {"ln2.", "norm2."},
+        {"attn_qkv.", "attn.qkv."},
+        {"mm.0.", "merger.linear_fc1."},
+        {"mm.2.", "merger.linear_fc2."},
+        {"v.post_ln.", "merger.norm."},
+        {"v.position_embd.", "pos_embed."},
     };
     if (contains(name, "t5xxl")) {
         replace_with_name_map(name, t5_name_map);

@stduhpf

stduhpf commented Jun 21, 2026

Copy link
Copy Markdown
Contributor Author

It looks to me like we still need to disable the vae latent as context ref path.

Yes, its a bit of wasted compute, but that's probably not what's causing the crash

@Green-Sky

Copy link
Copy Markdown
Contributor

Posting my command.

sd-cli --diffusion-model models/ideogram4-Q4_K.gguf --llm models/Qwen3-VL-8B-Instruct-UD-Q4_K_XL.gguf --llm_vision models/Qwen3-VL-8B-Instruct-mmproj-F16.gguf --vae models/flux2/full_encoder_small_decoder.safetensors --lora-model-dir models/loras/ig4/ --fa -v --offload-to-cpu --max-vram 5 -H 768 -W 768 --steps 4 --cfg-scale 1 -p "$(cat ./ig4prompt_5.json)<lora:ideogram_4_turbotime_v1:1>" -r ig4_cup.png

ig4prompt_5.json pretty:

{
  "high_level_description": "An amateurish photograph of a wooden desk with a specific item from a reference image placed centrally.",
  "compositional_deconstruction": {
    "background": "A worn wooden desk filling the entire frame with visible scratches and dust, soft uneven natural daylight from an unmarked window to the left, a beige wall in the upper right corner, low resolution and slight camera shake evident.",
    "elements": [
      {
        "type": "obj",
        "desc": "The item from the reference image resting in the center of the desk, its distinct features clearly visible despite the amateurish lighting and low resolution, casting a vague shadow to the right due to the uneven light source."
      }
    ]
  }
}

ref img:
ig4_cup

Comment thread src/conditioning/conditioner.hpp Outdated
auto image_embed = llm->encode_image(n_threads, resized_image, false, true, true);
GGML_ASSERT(!image_embed.empty());

img_prompt += "Picture " + std::to_string(i + 1) + ": <|vision_start|>";

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Picture 1: <|vision_start|>"
According to the prompt template for that model they are supposed to start at 0.

@Green-Sky

Green-Sky commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

You are right not only does it not work, but also the rest of the prompt seems to be cut off.

edit: actually, that might be the reason. nvm. seems to be the log only.

@stduhpf

stduhpf commented Jun 21, 2026

Copy link
Copy Markdown
Contributor Author

Doesn't seem to work. I don't think it's worth spending more time on that.

@stduhpf stduhpf closed this Jun 21, 2026
@Green-Sky

Copy link
Copy Markdown
Contributor

Well, thanks for looking at it anyway :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants