ChannelQuant

A streaming hardware codec for transformer KV caches. Status: Phase 0 (scaffold) · Date: 2026-06-22 · Host: DGX Spark (GB10, Grace-Blackwell).

ChannelQuant is a streaming hardware codec for transformer KV caches that implements the KIVI/KVQuant recipe — per-channel INT4 scaling for keys, per-token INT4 for values, with the handful of high-magnitude key channels held in higher precision via a statically-calibrated mask. On Qwen2 GQA models it reaches ~3.6–3.8× compression at near-lossless task accuracy (within 0.01 HellaSwag acc_norm of FP16 up to 7B), where naive INT4 collapses to chance. It replaces the TurboQuant+ vector codec, which hit a similar ratio but with a 25% relative accuracy loss on grouped-query attention.

Attribution — read this first

ChannelQuant is not a novel quantization algorithm. The scheme it implements is established prior art and we credit it without qualification:

KIVI — Liu et al., ICML 2024. Per-channel INT4 for keys, per-token INT4 for values, motivated by fixed large-magnitude outlier channels in the keys. https://github.com/jy-yuan/KIVI
KVQuant — Hooper et al., 2024. Per-channel-K + per-token-V + per-channel outlier isolation → near-lossless low-bit KV cache. https://arxiv.org/pdf/2401.18079

The contribution of this work is the silicon, not the math. KIVI and KVQuant are PyTorch/CUDA methods. There is, to our knowledge, no published streaming hardware codec that implements per-channel-key INT4 with calibrated, offline-fixed outlier isolation in RTL, integrated with an attention compute unit. ChannelQuant's novel pieces are exactly:

a residual-group buffer that makes per-channel-key scaling work in a streaming write path (you cannot scale by a column-max you have not seen yet — so the in-flight key group is held in FP16 until it fills, then quantized as a block);
a static outlier-channel ROM that turns KVQuant's runtime top-k detection into a per-layer mask lookup (valid because the outlier channels are a property of the trained weights, not the activation — see §c19 below);
the K/V-asymmetric datapath (buffered per-channel keys vs. streaming per-token values) and its synthesis on an open PDK.

Any claim of novelty in this repo or the paper is scoped to the hardware implementation and its evaluation. Overclaiming the algorithm is a hard failure of this project.

The evidence we build toward

HellaSwag acc_norm, n=250 (screening), Qwen2 GQA, quantizer applied uniformly to every K/V (no routing). Source: analysis/c17_{q05,q15,q7}_summary.json (reproduced from the predecessor APA c17 study; regenerated here in Phase 1). Values shown with Wilson 95% CIs in the source JSON.

4-bit variant	0.5B	1.5B	7B	verdict
FP16 (ref)	0.416	0.540	0.612	—
INT8 per-token (2×)	0.420	0.528	0.600	lossless 2× floor
naive INT4 per-token	0.372	0.248	0.212	collapses at scale
per-channel INT4	0.436	0.536	0.604	recovers ~FP16
KIVI (per-ch K / per-tok V)	0.408	0.540	0.600	recovers ~FP16
outlier (top-2 ch FP16 + per-ch)	0.428	0.552	0.616	best, ~FP16
MXFP4 (microscale FP4)	0.336	0.500	0.288	erratic — rejected
NVFP4 (microscale FP4)	0.384	0.412	0.380	erratic — rejected

The headline contrast — naive per-token INT4 collapses toward the 0.25 chance line at scale, while per-channel / KIVI / outlier recover ~FP16 — is large and robust. We do not over-claim per-model orderings among the recovering variants: at n=250 their CIs overlap (see §8 of REVAMP_SPEC.md). Headline paper numbers will be re-run at n≥1000.

Regenerate: MPLCONFIGDIR=/tmp/mpl /home/chaithu/lhs/.venv/bin/python analysis/c18b_quantizer_fig.py

Outlier-stability gate (c19) — PASSED at all scales. The static outlier ROM rests on key outlier channels being input-independent. Measured across 8 independent calibration batches, mean top-2 stability was 0.958 / 0.986 / 0.984 on Qwen2-{0.5B, 1.5B, 7B}; layer-0 (the noisiest layer) is perfectly pinned (1.00) at every scale; outlier-channel magnitude is 5.4–8.0× the median channel. Source: analysis/c19_{q05,q15,q7}_summary.json. → Build CQ-4+.

Tiers

Tier	Recipe	Combined ratio	Use
CQ-8	INT8 per-token (K and V)	~2×	safe lossless floor, all scales
CQ-4	per-channel-K / per-token-V INT4	~3.8×	the target, near-lossless
CQ-4+	CQ-4 + top-2 key outlier channels in FP16	~3.6×	best accuracy, GQA-heavy / 7B+

No routing, no per-token adaptive bit allocation — the lever is granularity, not bit-allocation (predecessor c16). See REVAMP_SPEC.md §4 for the full compression accounting.

Scope

ChannelQuant is the quantization algorithm + a bit-accurate reference model + accuracy/ablation experiments + the method paper. It is not the silicon. The RTL and tape-out live in the KVCE block (the kv-cache-engine repo — a silicon block of the Longhorn chip), which is being revamped separately to implement ChannelQuant. This repo's hardware-facing deliverable is docs/HW_CONTRACT.md (the algorithm-to-silicon interface KVCE implements against) plus golden test vectors (reference/testvectors/) for KVCE's Python↔C++↔SV parity. Silicon results (area/Fmax vs TurboQuant+) will come from the KVCE block and are noted as forthcoming — not produced here.

Repository layout

channelquant/
├── README.md                  # this file
├── NOTES.md                   # lab notebook — dated entries, full provenance
├── CLAIMS.md                  # claims ledger — every number → artifact + script
├── REVAMP_SPEC.md             # the design spec (source of truth)
├── analysis/                  # accuracy sweeps, calibration, group-size Pareto
│   ├── c17_quantizer_sweep.py        # the decisive 4-bit sweep (reproduced Phase 1)
│   ├── c19_outlier_stability.py      # static-mask gate (PASSED)
│   ├── c18b_quantizer_fig.py         # figure regen
│   ├── outlier_calibration.py        # offline per-layer outlier mask  (Phase 2)
│   ├── group_size_sweep.py           # G ∈ {32,64,128,256} Pareto      (Phase 2)
│   └── c1*_*_summary.json            # logged results
├── reference/                 # bit-accurate reference: CQ-8/CQ-4/CQ-4+ (Py, then C++)
│   └── testvectors/           # GOLDEN VECTORS for the KVCE block's 3-way parity
└── docs/
    ├── HW_CONTRACT.md                   # algorithm→silicon interface (KVCE consumes this)
    ├── research_kv_quant_landscape.md   # prior-art landscape (KIVI/KVQuant/vLLM)
    └── paper/                            # the method paper (maintained from Phase 1)

Build status

Phase	Deliverable	Status
0	Repo scaffold + README + NOTES + HW_CONTRACT	done
1	Reference model (CQ-8/4/4+) + reproduce c17 (±0.02) + golden vectors	pending
2	Outlier calibrator + group-size Pareto	pending
3	Generalization (non-Qwen GQA + ARC-Challenge)	pending
4	Method paper assembly	pending

Each phase gates the next (see REVAMP_SPEC.md §7). c19 (the static-mask gate) already passed; Phase 1's c17 reproduction is the next blocking gate. RTL and synthesis are out of scope — they belong to the KVCE block, which implements this repo's docs/HW_CONTRACT.md and validates against reference/testvectors/.

Environment

export HF_HOME=/home/chaithu/lhs/.hf_cache        # ~/.cache is root-owned
export MPLCONFIGDIR=/tmp/mpl                       # for figure scripts
/home/chaithu/lhs/.venv/bin/python <script>        # transformers 5.10.2, torch 2.12.0+cu130

The predecessor KVCE pq4 vector codec hung the GB10; ChannelQuant is plain integer arithmetic (no PolarQuant tables), expected GPU-safe — but small/CPU validation runs first and all GPU runs are monitored.

Lineage

ChannelQuant supersedes TurboQuant+ (PolarQuant + QJL + Walsh–Hadamard rotation; the kv-cache-engine repo), which reached a comparable ratio (~3.5×) but with a −0.10 HellaSwag acc_norm collapse on Qwen2 GQA. The diagnosis and decisive evidence originated in the adaptive-precision-attention lab notebook (studies c13–c19); ChannelQuant is a self-contained repo that re-derives its own artifacts. See docs/research_kv_quant_landscape.md for the prior-art landscape and REVAMP_SPEC.md for the full design.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChannelQuant

Attribution — read this first

The evidence we build toward

Tiers

Scope

Repository layout

Build status

Environment

Lineage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
analysis		analysis
docs		docs
graphs		graphs
reference		reference
.gitignore		.gitignore
CLAIMS.md		CLAIMS.md
NOTES.md		NOTES.md
README.md		README.md
REVAMP_SPEC.md		REVAMP_SPEC.md

Folders and files

Latest commit

History

Repository files navigation

ChannelQuant

Attribution — read this first

The evidence we build toward

Tiers

Scope

Repository layout

Build status

Environment

Lineage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages