Conversation
Collaborator
ItzikVa
commented
May 14, 2026
- Make buffers non-persistent - persistent=False removes them from state_dict() so accelerate ignores them entirely. These buffers are config-derived metadata, not learned weights.
- Rebuild buffers from config on first forward - When from_pretrained uses device_map="auto", accelerate's init_empty_weights() zeros all tensors during initialization. Since non-persistent buffers aren't restored from the checkpoint, they stay as zeros. We detect this on the first forward call and rebuild adapter_token_ids, token_to_group_mask, and adapter_hiding_matrix from config values.
- Device sync - Ensure rebuilt buffers are placed on the same device as the input tensors.
- Suppress warnings for old checkpoints - Add _keys_to_ignore_on_load_unexpected so existing checkpoints (which still contain these keys) load without "UNEXPECTED" warnings.
accelerate raises ValueError when device_map="auto" encounters persistent buffers (adapter_token_ids, token_to_group_mask, adapter_hiding_matrix) that it cannot assign to a device. These buffers are config-derived metadata reconstructed at __init__ time, so they don't need to be in state_dict(). Also adds docs/VLLM_QUANTIZATION_STATUS.md summarizing vLLM quantization work status for future reference. Fixes: #16
accelerate's device_map does not move non-persistent buffers to GPU. Add a one-time device sync at the start of the switch block so adapter_token_ids, token_to_group_mask, and adapter_hiding_matrix follow the input tensors to the correct device.
Suppresses "UNEXPECTED" warnings when loading checkpoints that still contain the now non-persistent buffer keys (adapter_token_ids, token_to_group_mask, adapter_hiding_matrix).
accelerate's init_empty_weights() zeros all buffers during from_pretrained with device_map. Since non-persistent buffers aren't restored from the checkpoint, they stay as zeros. Detect this on first forward call and rebuild adapter_token_ids, token_to_group_mask, and adapter_hiding_matrix from config values.
antonpibm
reviewed
May 14, 2026
| @@ -0,0 +1,75 @@ | |||
| # vLLM Quantization — Status Summary | |||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.