Issue 23 - Multi-GPU device_map fix for GraniteSwitch by ItzikVa · Pull Request #31 · generative-computing/granite-switch

ItzikVa · 2026-05-14T06:08:38Z

Make buffers non-persistent - persistent=False removes them from state_dict() so accelerate ignores them entirely. These buffers are config-derived metadata, not learned weights.
Rebuild buffers from config on first forward - When from_pretrained uses device_map="auto", accelerate's init_empty_weights() zeros all tensors during initialization. Since non-persistent buffers aren't restored from the checkpoint, they stay as zeros. We detect this on the first forward call and rebuild adapter_token_ids, token_to_group_mask, and adapter_hiding_matrix from config values.
Device sync - Ensure rebuilt buffers are placed on the same device as the input tensors.
Suppress warnings for old checkpoints - Add _keys_to_ignore_on_load_unexpected so existing checkpoints (which still contain these keys) load without "UNEXPECTED" warnings.

accelerate raises ValueError when device_map="auto" encounters persistent buffers (adapter_token_ids, token_to_group_mask, adapter_hiding_matrix) that it cannot assign to a device. These buffers are config-derived metadata reconstructed at __init__ time, so they don't need to be in state_dict(). Also adds docs/VLLM_QUANTIZATION_STATUS.md summarizing vLLM quantization work status for future reference. Fixes: #16

accelerate's device_map does not move non-persistent buffers to GPU. Add a one-time device sync at the start of the switch block so adapter_token_ids, token_to_group_mask, and adapter_hiding_matrix follow the input tensors to the correct device.

Suppresses "UNEXPECTED" warnings when loading checkpoints that still contain the now non-persistent buffer keys (adapter_token_ids, token_to_group_mask, adapter_hiding_matrix).

accelerate's init_empty_weights() zeros all buffers during from_pretrained with device_map. Since non-persistent buffers aren't restored from the checkpoint, they stay as zeros. Detect this on first forward call and rebuild adapter_token_ids, token_to_group_mask, and adapter_hiding_matrix from config values.

antonpibm · 2026-05-14T07:44:12Z

@@ -0,0 +1,75 @@
+# vLLM Quantization — Status Summary


Remove this file

ItzikVa added 4 commits May 13, 2026 11:44

Add _keys_to_ignore_on_load_unexpected for old checkpoint buffers

10c6ec1

Suppresses "UNEXPECTED" warnings when loading checkpoints that still contain the now non-persistent buffer keys (adapter_token_ids, token_to_group_mask, adapter_hiding_matrix).

ItzikVa requested review from antonpibm, freunda and yairallouche as code owners May 14, 2026 06:08

antonpibm reviewed May 14, 2026

View reviewed changes

ItzikVa linked an issue May 14, 2026 that may be closed by this pull request

Add HF multi-GPU auto device mode for GraniteSwitchForCausalLM #23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 23 - Multi-GPU device_map fix for GraniteSwitch#31

Issue 23 - Multi-GPU device_map fix for GraniteSwitch#31
ItzikVa wants to merge 4 commits into
mainfrom
issue-23

ItzikVa commented May 14, 2026

Uh oh!

antonpibm May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ItzikVa commented May 14, 2026

Uh oh!

antonpibm May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants