Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
| def test_tokenizer_w_generation_prompt(self): | ||
| verify_chat_template_generation_prompt_logic(self.qwen3_tokenizer) | ||
|
|
||
| def test_tokenizer_wo_generation_promt(self): |
There was a problem hiding this comment.
nit: test_tokenizer_wo_generation_prompt
| "gsutil", | ||
| "cp", | ||
| "-r", | ||
| "gs://maxtext-dataset/hf/llama2-chat-tokenizer", |
There was a problem hiding this comment.
is it okay to use a private gs: location ?
There was a problem hiding this comment.
hi @SurbhiJainUSC ,
Do you have any suggestions? I am referencing the practice here
| ValueError: If the `add_generation_prompt` tokens do not exactly | ||
| match the beginning of an assistant message in the template. | ||
| """ | ||
| dummy_msgs = [{"role": "user", "content": "Test message"}] |
There was a problem hiding this comment.
Can you also include system prompt in verification?
There was a problem hiding this comment.
Added. I also added a fallback for models that do not support system prompts (e.g., gemma-2b-it)
| prompt_wo_gen = tokenizer_model.apply_chat_template(dummy_msgs, add_generation_prompt=False, tokenize=True) | ||
| prompt_with_gen = tokenizer_model.apply_chat_template(dummy_msgs, add_generation_prompt=True, tokenize=True) | ||
| # Extract the tokenized generation prompt (the expected assistant prefix) | ||
| assistant_prefix_tokens = prompt_with_gen[len(prompt_wo_gen) :] |
There was a problem hiding this comment.
Add a check before this:
if prompt_with_gen[:len(prompt_wo_gen)] != prompt_wo_gen:
raise ValueError("Unable to extract generation prompt tokens.")
055b52d to
7ed2d56
Compare
7ed2d56 to
931086c
Compare
| return False | ||
|
|
||
|
|
||
| def _extract_token_ids(tokens): |
There was a problem hiding this comment.
Refactor the token_ids extraction into a function so it can be reused in verify_chat_template_generation_prompt_logic.
Description
This PR is adding test follow-up to PR #3284, which added the generation_prompt to the prompt tokens to improve SFT masking.
Add an additional check to ensure the generation prompt behaves as expected for SFT masking. If the tokenizer's chat template is misconfigured, the pipeline will now quickly raise an error.
add_generation_promptwhen runningtokenizer.apply_chat_templateis set toTrueorFalse. refThanks to @vlad-karp for raising the concern that led to adding this test!
Tests
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.