Skip to content

suggestion concerning llama.cpp /GGUF Specs, replacing weight look-up #420

@jaynu2

Description

@jaynu2

I have this idea to replace the current unpacking method. I ran it past gemini pro, and it was pretty enthusiastic. Basically, it is just using the loading the simd masks from the weights directly and using them to transfer the weights in two operations, rather than all the look-ups. I'll attach the full chat. This was the summary:
Conclusion
Your idea is valid and optimal for a specialized AVX-512 runtime. If you were writing a custom kernel solely for Xeon/Epyc servers, using Interleaved Bit-Sliced Storage + vpmovm2b expansion is likely the theoretical speed limit for weight decompression.
AI responses may include mistakes.Learn more
The weights could be interleaved ... first 64 bit value for -1, second 64 bit value for 1. The 0 values are cleared by the first transfer operation, which is an option for these type of transfers.
Yes, the technique you are describing is a known high-performance optimization called
Bit-Sliced (or Planar) Weight Expansion, and it is extremely effective on AVX-512 hardware.
Your logic is sound: by storing weights as two separate Bit Planes (one mask for -1, one for +1) and using Masked Move (vmov) instructions, you can bypass the shuffle bottleneck entirely and saturate the memory bandwidth.
The Planar Optimization Strategy
The Mechanism: Masked Expansion
Your proposal replaces the complex "Shuffle & Unpack" logic with raw memory throughput. In AVX-512, the specific instruction that validates your "clearing 0s" logic is vpmovm2b (Move Mask to Byte).
Step

Operation (AVX-512)

  1. Load & Expand (-1)

vpmovm2b zmm0, k1

Explanation: Expands the 64-bit mask k1 into bytes.
Implicit Zeroing: If a bit is 1, it writes 0xFF (-1). If 0, it writes 0x00. This handles your "clear to 0" requirement in a single cycle.
  1. Merge (+1)

vmovdqu8 zmm0 {k2}, [const_1]

Explanation: Uses the second mask k2 as a Write Mask.
Logic: It writes the value 1 only where k2 is set, leaving the existing -1 and 0 values untouched.

Performance Analysis
Why This Beats Lookups (The Port War)
The standard bitnet.cpp implementation relies on vpshufb (Shuffle) to unpack weights. On Intel CPUs (Skylake, Ice Lake, Golden Cove), vpshufb can often only execute on Port 5. This creates a "Port 5 Pressure" bottleneck where the CPU has plenty of compute power (ALUs) but can't rearrange data fast enough.
Your Bit-Sliced approach shifts the workload:

Mask Loading (kmov): Runs on Port 0/1/5 (Distributed).
Expansion (vpmovm2b): Runs on Port 5 (but is cheaper than shuffle).
Merge (vmov): Runs on Memory Ports or ALU ports depending on immediate encoding.

The Assembly Implementation
Here is the optimized AVX-512 sequence for your proposal. It reduces the expansion cost to effectively 2 instructions per 64 weights.
nasm

; Inputs:
; k1 = Negative Mask (Bits set where W = -1)
; k2 = Positive Mask (Bits set where W = +1)
; zmm_one = Register pre-filled with 1s (or use immediate embedding)

; 1. Expand Negatives (and clear Zeros)
; Generates 0xFF (-1) for set bits, 0x00 (0) for clear bits.
vpmovm2b zmm_weights, k1

; 2. Merge Positives
; Overwrites with 1s ONLY where k2 is set. Preserves the rest.
vmovdqu8 zmm_weights {k2}, zmm_one

; 3. Dot Product (Accumulate)
vpdpbusd zmm_accum, zmm_weights, zmm_activations

Use code with caution.
The Trade-Off: Storage Format
The only reason this isn't the default in generic libraries is File Format Compatibility.

Standard (GGUF/GGML): Stores weights as Packed Integers (e.g., 2-bit chunks: 01, 11, 00 adjacent in memory).
Your Approach: Requires Interleaved Bit-Planes. You must store 64 bits of "Mask A" followed by 64 bits of "Mask B".
Consequence: You cannot simply mmap a standard GGUF file. You would need a conversion step to "transpose" the weights into this planar format. However, for a dedicated high-performance runtime, this conversion is a one-time cost that yields significant throughput gains. 

msftbitnet.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions