feat: experimental vision-based ref support for ideogram4#1690
Conversation
|
Doesn't seem to be doing much, Probably not worth merging. |
On the commit before your force push just now. |
|
It looks to me like we still need to disable the vae latent as context ref path. |
|
@Green-Sky Have you tried boogu_image_edit? Because I suspect it might not work either. To run the Qwen3VL mmproj I had to patch the name conversion like that: diff --git a/src/name_conversion.cpp b/src/name_conversion.cpp
index da2a8d5..f2618c4 100644
--- a/src/name_conversion.cpp
+++ b/src/name_conversion.cpp
@@ -153,8 +153,9 @@ std::string convert_cond_stage_model_name(std::string name, std::string prefix)
};
static const std::vector<std::pair<std::string, std::string>> llm_vision_name_map{
- {"mm.", "merger.mlp."},
- {"v.post_ln.", "merger.ln_q."},
+ // {"mm.", "merger.mlp."},
+ // {"v.post_ln.", "merger.ln_q."},
+ {"v.patch_embd.bias", "patch_embed.proj.1.bias"},
{"v.patch_embd.weight", "patch_embed.proj.0.weight"},
{"patch_embed.proj.0.weight.1", "patch_embed.proj.1.weight"},
{"v.patch_embd.weight.1", "patch_embed.proj.1.weight"},
@@ -163,11 +164,18 @@ std::string convert_cond_stage_model_name(std::string name, std::string prefix)
{"attn_k.", "attn.k_proj."},
{"attn_v.", "attn.v_proj."},
{"attn_out.", "attn.proj."},
- {"ffn_down.", "mlp.down_proj."},
+ {"ffn_down.", "mlp.linear_fc2."},
+ // {"ffn_down.", "mlp.down_proj."},
{"ffn_gate.", "mlp.gate_proj."},
- {"ffn_up.", "mlp.up_proj."},
+ {"ffn_up.", "mlp.linear_fc1."},
+ // {"ffn_up.", "mlp.up_proj."},
{"ln1.", "norm1."},
{"ln2.", "norm2."},
+ {"attn_qkv.", "attn.qkv."},
+ {"mm.0.", "merger.linear_fc1."},
+ {"mm.2.", "merger.linear_fc2."},
+ {"v.post_ln.", "merger.norm."},
+ {"v.position_embd.", "pos_embed."},
};
if (contains(name, "t5xxl")) {
replace_with_name_map(name, t5_name_map);
|
Yes, its a bit of wasted compute, but that's probably not what's causing the crash |
|
Posting my command.
{
"high_level_description": "An amateurish photograph of a wooden desk with a specific item from a reference image placed centrally.",
"compositional_deconstruction": {
"background": "A worn wooden desk filling the entire frame with visible scratches and dust, soft uneven natural daylight from an unmarked window to the left, a beige wall in the upper right corner, low resolution and slight camera shake evident.",
"elements": [
{
"type": "obj",
"desc": "The item from the reference image resting in the center of the desk, its distinct features clearly visible despite the amateurish lighting and low resolution, casting a vague shadow to the right due to the uneven light source."
}
]
}
} |
| auto image_embed = llm->encode_image(n_threads, resized_image, false, true, true); | ||
| GGML_ASSERT(!image_embed.empty()); | ||
|
|
||
| img_prompt += "Picture " + std::to_string(i + 1) + ": <|vision_start|>"; |
There was a problem hiding this comment.
"Picture 1: <|vision_start|>"
According to the prompt template for that model they are supposed to start at 0.
|
You are right not only does it not work, but also the rest of the prompt seems to be cut off. edit: |
|
Doesn't seem to work. I don't think it's worth spending more time on that. |
|
Well, thanks for looking at it anyway :) |

Summary
Enables Qwen3-VL vision for Ideogram4 conditionning.
Related Issue / Discussion
#1679
Additional Information
Not sure if it actually works, needs testing.
Checklist