Skip to content

Fix BOS token embedding: use embed_tokens() instead of raw integer#264

Open
Mr-Neutr0n wants to merge 1 commit intoOpenGVLab:mainfrom
Mr-Neutr0n:fix/bos-token-embedding
Open

Fix BOS token embedding: use embed_tokens() instead of raw integer#264
Mr-Neutr0n wants to merge 1 commit intoOpenGVLab:mainfrom
Mr-Neutr0n:fix/bos-token-embedding

Conversation

@Mr-Neutr0n
Copy link

Summary

  • Bug: In the forward() methods across 4 model files, the BOS token position in inputs_embeds is set using the raw integer bos_token_id (e.g., 1) instead of its actual embedding vector. Since inputs_embeds is a float tensor produced by embed_tokens(), assigning a scalar integer broadcasts that value across the entire embedding dimension, effectively corrupting the BOS position with a uniform constant instead of the learned BOS embedding.
  • Fix: Pass bos_token_id through embed_tokens() to obtain the proper embedding vector before assigning it to inputs_embeds[:, :1]. The fix respects each file's model attribute path (llama_model vs mistral_model) and LoRA configuration (base_model.model.model vs model).

Affected files

  • video_chat/models/videochat_it.py
  • video_chat2/models/videochat_vicuna/videochat2_it_vicuna.py
  • video_chat2/models/videochat_mistra/videochat2_it_mistral.py
  • video_chat2/models/videochat_mistra/videochat2_it_hd_mistral.py

Before (buggy)

# inputs_embeds is a float tensor of shape [B, seq_len, hidden_dim]
# bos_token_id is an integer (e.g., 1)
# This broadcasts the scalar 1 across all hidden_dim dimensions
inputs_embeds[:, :1] = self.llama_tokenizer.bos_token_id

After (fixed)

bos_token_id = torch.tensor([[self.llama_tokenizer.bos_token_id]], device=inputs_embeds.device)
if self.use_lora:
    bos_embeds = self.llama_model.base_model.model.model.embed_tokens(bos_token_id)
else:
    bos_embeds = self.llama_model.model.embed_tokens(bos_token_id)
inputs_embeds[:, :1] = bos_embeds

Test plan

  • Verify that inputs_embeds[:, :1] now contains a proper embedding vector (varying values across the hidden dimension) rather than a uniform scalar
  • Run training on a small sample to confirm loss values are comparable or improved
  • Check that inference outputs remain coherent with the corrected BOS embedding

…nteger

In the forward() methods, the BOS token position in inputs_embeds was
being set to the raw integer bos_token_id (e.g. 1) instead of the
actual embedding vector. Since inputs_embeds is a float tensor produced
by embed_tokens(), assigning a scalar integer broadcasts that value
across the entire embedding dimension, corrupting the BOS representation.

Fix by passing bos_token_id through embed_tokens() to obtain the correct
embedding vector before assignment. This is applied consistently across
all four model files, respecting each file's model attribute path and
LoRA configuration.

Affected files:
- video_chat/models/videochat_it.py
- video_chat2/models/videochat_vicuna/videochat2_it_vicuna.py
- video_chat2/models/videochat_mistra/videochat2_it_mistral.py
- video_chat2/models/videochat_mistra/videochat2_it_hd_mistral.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant