Skip to content

Issue 23 - Multi-GPU device_map fix for GraniteSwitch#31

Open
ItzikVa wants to merge 4 commits into
mainfrom
issue-23
Open

Issue 23 - Multi-GPU device_map fix for GraniteSwitch#31
ItzikVa wants to merge 4 commits into
mainfrom
issue-23

Conversation

@ItzikVa
Copy link
Copy Markdown
Collaborator

@ItzikVa ItzikVa commented May 14, 2026

  1. Make buffers non-persistent - persistent=False removes them from state_dict() so accelerate ignores them entirely. These buffers are config-derived metadata, not learned weights.
  2. Rebuild buffers from config on first forward - When from_pretrained uses device_map="auto", accelerate's init_empty_weights() zeros all tensors during initialization. Since non-persistent buffers aren't restored from the checkpoint, they stay as zeros. We detect this on the first forward call and rebuild adapter_token_ids, token_to_group_mask, and adapter_hiding_matrix from config values.
  3. Device sync - Ensure rebuilt buffers are placed on the same device as the input tensors.
  4. Suppress warnings for old checkpoints - Add _keys_to_ignore_on_load_unexpected so existing checkpoints (which still contain these keys) load without "UNEXPECTED" warnings.

ItzikVa added 4 commits May 13, 2026 11:44
accelerate raises ValueError when device_map="auto" encounters persistent
buffers (adapter_token_ids, token_to_group_mask, adapter_hiding_matrix) that
it cannot assign to a device. These buffers are config-derived metadata
reconstructed at __init__ time, so they don't need to be in state_dict().

Also adds docs/VLLM_QUANTIZATION_STATUS.md summarizing vLLM quantization
work status for future reference.

Fixes: #16
accelerate's device_map does not move non-persistent buffers to GPU.
Add a one-time device sync at the start of the switch block so
adapter_token_ids, token_to_group_mask, and adapter_hiding_matrix
follow the input tensors to the correct device.
Suppresses "UNEXPECTED" warnings when loading checkpoints that still
contain the now non-persistent buffer keys (adapter_token_ids,
token_to_group_mask, adapter_hiding_matrix).
accelerate's init_empty_weights() zeros all buffers during
from_pretrained with device_map. Since non-persistent buffers aren't
restored from the checkpoint, they stay as zeros. Detect this on first
forward call and rebuild adapter_token_ids, token_to_group_mask, and
adapter_hiding_matrix from config values.
@@ -0,0 +1,75 @@
# vLLM Quantization — Status Summary
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this file

@ItzikVa ItzikVa linked an issue May 14, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add HF multi-GPU auto device mode for GraniteSwitchForCausalLM

2 participants