A streaming hardware codec for transformer KV caches. Status: Phase 0 (scaffold) · Date: 2026-06-22 · Host: DGX Spark (GB10, Grace-Blackwell).
ChannelQuant is a streaming hardware codec for transformer KV caches that implements the KIVI/KVQuant recipe — per-channel INT4 scaling for keys, per-token INT4 for values, with the handful of high-magnitude key channels held in higher precision via a statically-calibrated mask. On Qwen2 GQA models it reaches ~3.6–3.8× compression at near-lossless task accuracy (within 0.01 HellaSwag acc_norm of FP16 up to 7B), where naive INT4 collapses to chance. It replaces the TurboQuant+ vector codec, which hit a similar ratio but with a 25% relative accuracy loss on grouped-query attention.
ChannelQuant is not a novel quantization algorithm. The scheme it implements is established prior art and we credit it without qualification:
- KIVI — Liu et al., ICML 2024. Per-channel INT4 for keys, per-token INT4 for values, motivated by fixed large-magnitude outlier channels in the keys. https://github.com/jy-yuan/KIVI
- KVQuant — Hooper et al., 2024. Per-channel-K + per-token-V + per-channel outlier isolation → near-lossless low-bit KV cache. https://arxiv.org/pdf/2401.18079
The contribution of this work is the silicon, not the math. KIVI and KVQuant are PyTorch/CUDA methods. There is, to our knowledge, no published streaming hardware codec that implements per-channel-key INT4 with calibrated, offline-fixed outlier isolation in RTL, integrated with an attention compute unit. ChannelQuant's novel pieces are exactly:
- a residual-group buffer that makes per-channel-key scaling work in a streaming write path (you cannot scale by a column-max you have not seen yet — so the in-flight key group is held in FP16 until it fills, then quantized as a block);
- a static outlier-channel ROM that turns KVQuant's runtime top-k detection into a per-layer mask lookup (valid because the outlier channels are a property of the trained weights, not the activation — see §c19 below);
- the K/V-asymmetric datapath (buffered per-channel keys vs. streaming per-token values) and its synthesis on an open PDK.
Any claim of novelty in this repo or the paper is scoped to the hardware implementation and its evaluation. Overclaiming the algorithm is a hard failure of this project.
HellaSwag acc_norm, n=250 (screening), Qwen2 GQA, quantizer applied
uniformly to every K/V (no routing). Source: analysis/c17_{q05,q15,q7}_summary.json
(reproduced from the predecessor APA c17 study; regenerated here in Phase 1).
Values shown with Wilson 95% CIs in the source JSON.
| 4-bit variant | 0.5B | 1.5B | 7B | verdict |
|---|---|---|---|---|
| FP16 (ref) | 0.416 | 0.540 | 0.612 | — |
| INT8 per-token (2×) | 0.420 | 0.528 | 0.600 | lossless 2× floor |
| naive INT4 per-token | 0.372 | 0.248 | 0.212 | collapses at scale |
| per-channel INT4 | 0.436 | 0.536 | 0.604 | recovers ~FP16 |
| KIVI (per-ch K / per-tok V) | 0.408 | 0.540 | 0.600 | recovers ~FP16 |
| outlier (top-2 ch FP16 + per-ch) | 0.428 | 0.552 | 0.616 | best, ~FP16 |
| MXFP4 (microscale FP4) | 0.336 | 0.500 | 0.288 | erratic — rejected |
| NVFP4 (microscale FP4) | 0.384 | 0.412 | 0.380 | erratic — rejected |
The headline contrast — naive per-token INT4 collapses toward the 0.25
chance line at scale, while per-channel / KIVI / outlier recover ~FP16 — is
large and robust. We do not over-claim per-model orderings among the
recovering variants: at n=250 their CIs overlap (see §8 of REVAMP_SPEC.md).
Headline paper numbers will be re-run at n≥1000.
Regenerate: MPLCONFIGDIR=/tmp/mpl /home/chaithu/lhs/.venv/bin/python analysis/c18b_quantizer_fig.py
Outlier-stability gate (c19) — PASSED at all scales. The static outlier
ROM rests on key outlier channels being input-independent. Measured across 8
independent calibration batches, mean top-2 stability was 0.958 / 0.986 /
0.984 on Qwen2-{0.5B, 1.5B, 7B}; layer-0 (the noisiest layer) is perfectly
pinned (1.00) at every scale; outlier-channel magnitude is 5.4–8.0× the median
channel. Source: analysis/c19_{q05,q15,q7}_summary.json. → Build CQ-4+.
| Tier | Recipe | Combined ratio | Use |
|---|---|---|---|
| CQ-8 | INT8 per-token (K and V) | ~2× | safe lossless floor, all scales |
| CQ-4 | per-channel-K / per-token-V INT4 | ~3.8× | the target, near-lossless |
| CQ-4+ | CQ-4 + top-2 key outlier channels in FP16 | ~3.6× | best accuracy, GQA-heavy / 7B+ |
No routing, no per-token adaptive bit allocation — the lever is granularity,
not bit-allocation (predecessor c16). See REVAMP_SPEC.md §4 for the full
compression accounting.
ChannelQuant is the quantization algorithm + a bit-accurate reference model +
accuracy/ablation experiments + the method paper. It is not the silicon. The
RTL and tape-out live in the KVCE block (the kv-cache-engine repo — a
silicon block of the Longhorn chip), which is being revamped separately to
implement ChannelQuant. This repo's hardware-facing deliverable is
docs/HW_CONTRACT.md (the algorithm-to-silicon interface KVCE implements
against) plus golden test vectors (reference/testvectors/) for KVCE's
Python↔C++↔SV parity. Silicon results (area/Fmax vs TurboQuant+) will come from
the KVCE block and are noted as forthcoming — not produced here.
channelquant/
├── README.md # this file
├── NOTES.md # lab notebook — dated entries, full provenance
├── CLAIMS.md # claims ledger — every number → artifact + script
├── REVAMP_SPEC.md # the design spec (source of truth)
├── analysis/ # accuracy sweeps, calibration, group-size Pareto
│ ├── c17_quantizer_sweep.py # the decisive 4-bit sweep (reproduced Phase 1)
│ ├── c19_outlier_stability.py # static-mask gate (PASSED)
│ ├── c18b_quantizer_fig.py # figure regen
│ ├── outlier_calibration.py # offline per-layer outlier mask (Phase 2)
│ ├── group_size_sweep.py # G ∈ {32,64,128,256} Pareto (Phase 2)
│ └── c1*_*_summary.json # logged results
├── reference/ # bit-accurate reference: CQ-8/CQ-4/CQ-4+ (Py, then C++)
│ └── testvectors/ # GOLDEN VECTORS for the KVCE block's 3-way parity
└── docs/
├── HW_CONTRACT.md # algorithm→silicon interface (KVCE consumes this)
├── research_kv_quant_landscape.md # prior-art landscape (KIVI/KVQuant/vLLM)
└── paper/ # the method paper (maintained from Phase 1)
| Phase | Deliverable | Status |
|---|---|---|
| 0 | Repo scaffold + README + NOTES + HW_CONTRACT | done |
| 1 | Reference model (CQ-8/4/4+) + reproduce c17 (±0.02) + golden vectors | pending |
| 2 | Outlier calibrator + group-size Pareto | pending |
| 3 | Generalization (non-Qwen GQA + ARC-Challenge) | pending |
| 4 | Method paper assembly | pending |
Each phase gates the next (see REVAMP_SPEC.md §7). c19 (the static-mask gate)
already passed; Phase 1's c17 reproduction is the next blocking gate. RTL and
synthesis are out of scope — they belong to the KVCE block, which implements
this repo's docs/HW_CONTRACT.md and validates against reference/testvectors/.
export HF_HOME=/home/chaithu/lhs/.hf_cache # ~/.cache is root-owned
export MPLCONFIGDIR=/tmp/mpl # for figure scripts
/home/chaithu/lhs/.venv/bin/python <script> # transformers 5.10.2, torch 2.12.0+cu130The predecessor KVCE pq4 vector codec hung the GB10; ChannelQuant is plain integer arithmetic (no PolarQuant tables), expected GPU-safe — but small/CPU validation runs first and all GPU runs are monitored.
ChannelQuant supersedes TurboQuant+ (PolarQuant + QJL + Walsh–Hadamard
rotation; the kv-cache-engine repo), which reached a comparable ratio
(~3.5×) but with a −0.10 HellaSwag acc_norm collapse on Qwen2 GQA. The
diagnosis and decisive evidence originated in the adaptive-precision-attention
lab notebook (studies c13–c19); ChannelQuant is a self-contained repo that
re-derives its own artifacts. See docs/research_kv_quant_landscape.md for the
prior-art landscape and REVAMP_SPEC.md for the full design.
