Skip to content

themoddedcube/ChannelQuant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ChannelQuant

A streaming hardware codec for transformer KV caches. Status: Phase 0 (scaffold) · Date: 2026-06-22 · Host: DGX Spark (GB10, Grace-Blackwell).

ChannelQuant is a streaming hardware codec for transformer KV caches that implements the KIVI/KVQuant recipe — per-channel INT4 scaling for keys, per-token INT4 for values, with the handful of high-magnitude key channels held in higher precision via a statically-calibrated mask. On Qwen2 GQA models it reaches ~3.6–3.8× compression at near-lossless task accuracy (within 0.01 HellaSwag acc_norm of FP16 up to 7B), where naive INT4 collapses to chance. It replaces the TurboQuant+ vector codec, which hit a similar ratio but with a 25% relative accuracy loss on grouped-query attention.

Attribution — read this first

ChannelQuant is not a novel quantization algorithm. The scheme it implements is established prior art and we credit it without qualification:

  • KIVI — Liu et al., ICML 2024. Per-channel INT4 for keys, per-token INT4 for values, motivated by fixed large-magnitude outlier channels in the keys. https://github.com/jy-yuan/KIVI
  • KVQuant — Hooper et al., 2024. Per-channel-K + per-token-V + per-channel outlier isolation → near-lossless low-bit KV cache. https://arxiv.org/pdf/2401.18079

The contribution of this work is the silicon, not the math. KIVI and KVQuant are PyTorch/CUDA methods. There is, to our knowledge, no published streaming hardware codec that implements per-channel-key INT4 with calibrated, offline-fixed outlier isolation in RTL, integrated with an attention compute unit. ChannelQuant's novel pieces are exactly:

  1. a residual-group buffer that makes per-channel-key scaling work in a streaming write path (you cannot scale by a column-max you have not seen yet — so the in-flight key group is held in FP16 until it fills, then quantized as a block);
  2. a static outlier-channel ROM that turns KVQuant's runtime top-k detection into a per-layer mask lookup (valid because the outlier channels are a property of the trained weights, not the activation — see §c19 below);
  3. the K/V-asymmetric datapath (buffered per-channel keys vs. streaming per-token values) and its synthesis on an open PDK.

Any claim of novelty in this repo or the paper is scoped to the hardware implementation and its evaluation. Overclaiming the algorithm is a hard failure of this project.

The evidence we build toward

HellaSwag acc_norm, n=250 (screening), Qwen2 GQA, quantizer applied uniformly to every K/V (no routing). Source: analysis/c17_{q05,q15,q7}_summary.json (reproduced from the predecessor APA c17 study; regenerated here in Phase 1). Values shown with Wilson 95% CIs in the source JSON.

4-bit variant 0.5B 1.5B 7B verdict
FP16 (ref) 0.416 0.540 0.612
INT8 per-token (2×) 0.420 0.528 0.600 lossless 2× floor
naive INT4 per-token 0.372 0.248 0.212 collapses at scale
per-channel INT4 0.436 0.536 0.604 recovers ~FP16
KIVI (per-ch K / per-tok V) 0.408 0.540 0.600 recovers ~FP16
outlier (top-2 ch FP16 + per-ch) 0.428 0.552 0.616 best, ~FP16
MXFP4 (microscale FP4) 0.336 0.500 0.288 erratic — rejected
NVFP4 (microscale FP4) 0.384 0.412 0.380 erratic — rejected

The headline contrast — naive per-token INT4 collapses toward the 0.25 chance line at scale, while per-channel / KIVI / outlier recover ~FP16 — is large and robust. We do not over-claim per-model orderings among the recovering variants: at n=250 their CIs overlap (see §8 of REVAMP_SPEC.md). Headline paper numbers will be re-run at n≥1000.

4-bit KV quant on Qwen GQA

Regenerate: MPLCONFIGDIR=/tmp/mpl /home/chaithu/lhs/.venv/bin/python analysis/c18b_quantizer_fig.py

Outlier-stability gate (c19) — PASSED at all scales. The static outlier ROM rests on key outlier channels being input-independent. Measured across 8 independent calibration batches, mean top-2 stability was 0.958 / 0.986 / 0.984 on Qwen2-{0.5B, 1.5B, 7B}; layer-0 (the noisiest layer) is perfectly pinned (1.00) at every scale; outlier-channel magnitude is 5.4–8.0× the median channel. Source: analysis/c19_{q05,q15,q7}_summary.json. → Build CQ-4+.

Tiers

Tier Recipe Combined ratio Use
CQ-8 INT8 per-token (K and V) ~2× safe lossless floor, all scales
CQ-4 per-channel-K / per-token-V INT4 ~3.8× the target, near-lossless
CQ-4+ CQ-4 + top-2 key outlier channels in FP16 ~3.6× best accuracy, GQA-heavy / 7B+

No routing, no per-token adaptive bit allocation — the lever is granularity, not bit-allocation (predecessor c16). See REVAMP_SPEC.md §4 for the full compression accounting.

Scope

ChannelQuant is the quantization algorithm + a bit-accurate reference model + accuracy/ablation experiments + the method paper. It is not the silicon. The RTL and tape-out live in the KVCE block (the kv-cache-engine repo — a silicon block of the Longhorn chip), which is being revamped separately to implement ChannelQuant. This repo's hardware-facing deliverable is docs/HW_CONTRACT.md (the algorithm-to-silicon interface KVCE implements against) plus golden test vectors (reference/testvectors/) for KVCE's Python↔C++↔SV parity. Silicon results (area/Fmax vs TurboQuant+) will come from the KVCE block and are noted as forthcoming — not produced here.

Repository layout

channelquant/
├── README.md                  # this file
├── NOTES.md                   # lab notebook — dated entries, full provenance
├── CLAIMS.md                  # claims ledger — every number → artifact + script
├── REVAMP_SPEC.md             # the design spec (source of truth)
├── analysis/                  # accuracy sweeps, calibration, group-size Pareto
│   ├── c17_quantizer_sweep.py        # the decisive 4-bit sweep (reproduced Phase 1)
│   ├── c19_outlier_stability.py      # static-mask gate (PASSED)
│   ├── c18b_quantizer_fig.py         # figure regen
│   ├── outlier_calibration.py        # offline per-layer outlier mask  (Phase 2)
│   ├── group_size_sweep.py           # G ∈ {32,64,128,256} Pareto      (Phase 2)
│   └── c1*_*_summary.json            # logged results
├── reference/                 # bit-accurate reference: CQ-8/CQ-4/CQ-4+ (Py, then C++)
│   └── testvectors/           # GOLDEN VECTORS for the KVCE block's 3-way parity
└── docs/
    ├── HW_CONTRACT.md                   # algorithm→silicon interface (KVCE consumes this)
    ├── research_kv_quant_landscape.md   # prior-art landscape (KIVI/KVQuant/vLLM)
    └── paper/                            # the method paper (maintained from Phase 1)

Build status

Phase Deliverable Status
0 Repo scaffold + README + NOTES + HW_CONTRACT done
1 Reference model (CQ-8/4/4+) + reproduce c17 (±0.02) + golden vectors pending
2 Outlier calibrator + group-size Pareto pending
3 Generalization (non-Qwen GQA + ARC-Challenge) pending
4 Method paper assembly pending

Each phase gates the next (see REVAMP_SPEC.md §7). c19 (the static-mask gate) already passed; Phase 1's c17 reproduction is the next blocking gate. RTL and synthesis are out of scope — they belong to the KVCE block, which implements this repo's docs/HW_CONTRACT.md and validates against reference/testvectors/.

Environment

export HF_HOME=/home/chaithu/lhs/.hf_cache        # ~/.cache is root-owned
export MPLCONFIGDIR=/tmp/mpl                       # for figure scripts
/home/chaithu/lhs/.venv/bin/python <script>        # transformers 5.10.2, torch 2.12.0+cu130

The predecessor KVCE pq4 vector codec hung the GB10; ChannelQuant is plain integer arithmetic (no PolarQuant tables), expected GPU-safe — but small/CPU validation runs first and all GPU runs are monitored.

Lineage

ChannelQuant supersedes TurboQuant+ (PolarQuant + QJL + Walsh–Hadamard rotation; the kv-cache-engine repo), which reached a comparable ratio (~3.5×) but with a −0.10 HellaSwag acc_norm collapse on Qwen2 GQA. The diagnosis and decisive evidence originated in the adaptive-precision-attention lab notebook (studies c13c19); ChannelQuant is a self-contained repo that re-derives its own artifacts. See docs/research_kv_quant_landscape.md for the prior-art landscape and REVAMP_SPEC.md for the full design.

About

Near-lossless 4× KV-cache compression for GQA models at ~4.1 bits/value — per-channel-key INT4 + static outlier ROM, with a reproducible reference model and paper.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages