Skip to content

Fix(fun-asr-dataset): fix dynamic masking error in input_ids#2801

Merged
LauraGPT merged 1 commit intomodelscope:mainfrom
roney123:fix_dataset
Feb 4, 2026
Merged

Fix(fun-asr-dataset): fix dynamic masking error in input_ids#2801
LauraGPT merged 1 commit intomodelscope:mainfrom
roney123:fix_dataset

Conversation

@roney123
Copy link
Contributor

@roney123 roney123 commented Feb 2, 2026

Problem: input_ids sequence incorrectly contained -100 masking tokens.

Solution:

  • Use original target_ids for input construction
  • Apply masking only to labels
  • Ensure input_ids contain only valid token IDs

Result: Correct implements simulate prev-token fixed output.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @roney123, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an issue where the dynamic masking mechanism incorrectly inserted -100 tokens into the input_ids sequence. The changes ensure that input_ids are built from the original, unmasked target tokens, while the masking operation is correctly applied only to the labels sequence. This refinement in data preparation guarantees that the model receives clean inputs and appropriately masked labels for the prev-token fixed output simulation.

Highlights

  • Corrected Dynamic Masking Application: Introduced a separate target_labels variable, which is a copy of target_ids, to ensure that dynamic masking is applied exclusively to the labels and not to the input sequence.
  • Ensured Valid Input IDs: Modified the data processing to guarantee that input_ids are constructed using the original, unmasked target_ids, preventing the inclusion of -100 masking tokens in the model's input.
  • Accurate Label Generation: Updated the labels sequence to incorporate the dynamically masked target_labels, aligning the masking behavior with the intended prev-token fixed output simulation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a bug where masking tokens were incorrectly added to input_ids. By creating a copy of target_ids for masking and using the original for input_ids, the issue is resolved. The change is clear and effective. I've added one suggestion to improve maintainability by refactoring the new masking logic.

@LauraGPT LauraGPT merged commit 81d96c9 into modelscope:main Feb 4, 2026
@roney123 roney123 deleted the fix_dataset branch February 4, 2026 11:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants