From 949a9ae35d2139e6eae09e2137a46d76b1525c33 Mon Sep 17 00:00:00 2001 From: Nicholas Gates Date: Thu, 16 Apr 2026 11:51:45 -0400 Subject: [PATCH] Remove un-accepted RFCs from develop These RFCs were merged as "proposed" under the old directory structure but were never formally accepted. Removing them so they can be re-proposed as proper PRs. - RFC 0005: Extension Types - RFC 0023: File back compat testing - RFC 0029: Formalize the Vortex type system - RFC 0033: Block TurboQuant Co-Authored-By: Claude Opus 4.6 (1M context) --- rfcs/0005-extension.md | 253 ------ rfcs/0023-file-compat-testing.md | 394 -------- rfcs/0029-types.md | 406 --------- rfcs/0033-block-turboquant.md | 1438 ------------------------------ 4 files changed, 2491 deletions(-) delete mode 100644 rfcs/0005-extension.md delete mode 100644 rfcs/0023-file-compat-testing.md delete mode 100644 rfcs/0029-types.md delete mode 100644 rfcs/0033-block-turboquant.md diff --git a/rfcs/0005-extension.md b/rfcs/0005-extension.md deleted file mode 100644 index d35e109..0000000 --- a/rfcs/0005-extension.md +++ /dev/null @@ -1,253 +0,0 @@ -- Start Date: 2026-02-27 -- Authors: Connor Tsui -- RFC PR: [vortex-data/rfcs#5](https://github.com/vortex-data/rfcs/pull/5) - -# Extension Types - -## Summary - -We would like to build a more robust system for extension data types (or `DType`s). This RFC -proposes a direction for extending the `ExtVTable` trait to support richer behavior (beyond -forwarding to the storage type), lays out the completed and in-progress work, and identifies the -open questions that remain. - -## Motivation - -A limitation of the current type system in Vortex is that we cannot easily add new logical types. -For example, the effort to add `FixedSizeList` -([vortex#4372](https://github.com/vortex-data/vortex/issues/4372)) and also change `List` to -`ListView` ([vortex#4699](https://github.com/vortex-data/vortex/issues/4699)) was very intrusive. -It is much easier to add wrappers around canonical types (treating the canonical dtype as a -"storage type") and implement some additional logic than to add a new variant to the `DType` enum. - -### Storage DTypes - -Extension types work by wrapping an existing canonical `DType`, called the **storage dtype**. The -storage dtype is itself a logical type (e.g., `Primitive`, `Struct`, `List`), and the extension -type is a logical wrapper over it that layers on additional semantics such as validation, display -formatting, and (eventually) custom compute logic. - -For example, a `Timestamp` extension type has a `Primitive` storage dtype. Under the hood, a -timestamp array is just a primitive array of integers, but the extension layer knows that those -integers represent microseconds since the Unix epoch. Similarly, a `Union` extension type might -use `Struct` as its storage dtype, wrapping a struct of fields with union-specific dispatch logic. - -This separation means that adding a new logical type does not require changes to the core canonical -type system, the compressor, or the I/O layer. Extension types get compression for free because -data is always read from and written to disk as the underlying storage dtype. - -### Current State - -Vortex provides an `Extension` variant of `DType` to help with this. Currently, implementors can -add a new extension type by defining an extension ID (for example, `vortex.time` or `vortex.date`) -and specifying a storage dtype. For example, the time extension types use a primitive storage dtype, -meaning they wrap the primitive scalars or primitive arrays with some extra logic on top (mostly -validating that the timestamps are valid). - -We would like to add many more extension types. Some notable extension types (and their likely -storage types) include: - -- **Matrix / Tensor**: This would be an extension over `FixedSizeList`, where dimensions correspond - to levels of nesting. There are many open questions on the design of this, but that is out of - scope of this RFC. -- **Union**: The sum type of an algebraic data type, like a Rust enum. One approach is to implement - this with a type tag paired with a `Struct` (so `Struct { Primitive, Struct { types } }`). - Vortex is well suited to represent this because it can compress each of the type field arrays - independently, so we do not need to distinguish between a "Sparse" or "Dense" Union. -- **UUID**: Since this is a 128-bit number, we likely want to add `FixedSizeBinary`. This is out of - scope for this RFC. - -The issue with the current system is that it only forwards logic to the underlying storage type. -The only other behavior we support is serializing and pretty-printing extension arrays. This means -that we cannot define custom compute logic for extension types. - -Take the time extension types as an example of where this limitation does not matter. If we want to -run a `compare` expression over a timestamp array, we just run the `compare` over the underlying -primitive array. For simple types like timestamps, this is sufficient (and this is what we do right -now). For types like Tensors (which are simply type aliases over `FixedSizeList`), this is also -fine. - -However, for more complex types like UUID, Union, or JSON, forwarding to the storage type is likely -insufficient as these types need custom compute logic. Given that, we want a more robust -implementation path instead of wrapping `ExtensionArray` and performing significant internal -dispatch work. - -## Design - -### Background - -[vortex#6081](https://github.com/vortex-data/vortex/pull/6081) introduced vtables (virtual tables, -or Rust unit structs with methods) for extension `DType`s. Each extension type (e.g., `Timestamp`) -now implements `ExtDTypeVTable`, which handles validation, serialization, and metadata. -The type-erased `ExtDTypeRef` carries this vtable with it inside `DType::Extension`. - -There were a few blockers (detailed in the tracking issue -[vortex#6547](https://github.com/vortex-data/vortex/issues/6547)), -but now that those have been resolved, we can move forward. - -### Proposed Design - -Now that `vortex-scalar` and `vortex-dtype` have been merged into `vortex-array`, we can place -all extension logic (for types, scalars, and arrays) onto an `ExtVTable` (renamed from -`ExtDTypeVTable`). - -It will look something like the following: - -```rust -// Note: naming should be considered unstable. - -/// The public API for defining new extension types. -/// -/// This is the non-object-safe trait that plugin authors implement to define a new extension -/// type. It specifies the type's identity, metadata, serialization, and validation. -pub trait ExtVTable: 'static + Sized + Send + Sync + Clone + Debug + Eq + Hash { - /// Associated type containing the deserialized metadata for this extension type. - type Metadata: 'static + Send + Sync + Clone + Debug + Display + Eq + Hash; - - /// A native Rust value that represents a scalar of the extension type. - /// - /// The value only represents non-null values. We denote nullable values as `Option`. - type NativeValue<'a>: Display; - - /// Returns the ID for this extension type. - fn id(&self) -> ExtId; - - // Methods related to the extension `DType`. - - /// Serialize the metadata into a byte vector. - fn serialize_metadata(&self, metadata: &Self::Metadata) -> VortexResult>; - - /// Deserialize the metadata from a byte slice. - fn deserialize_metadata(&self, metadata: &[u8]) -> VortexResult; - - /// Validate that the given storage type is compatible with this extension type. - fn validate_dtype(&self, metadata: &Self::Metadata, storage_dtype: &DType) -> VortexResult<()>; - - // Methods related to the extension scalar values. - - /// Validate the given storage value is compatible with the extension type. - /// - /// By default, this calls [`unpack_native()`](ExtVTable::unpack_native) and discards the - /// result. - /// - /// # Errors - /// - /// Returns an error if the storage [`ScalarValue`] is not compatible with the extension type. - fn validate_scalar_value( - &self, - metadata: &Self::Metadata, - storage_dtype: &DType, - storage_value: &ScalarValue, - ) -> VortexResult<()> { - self.unpack_native(metadata, storage_dtype, storage_value) - .map(|_| ()) - } - - /// Validate and unpack a native value from the storage [`ScalarValue`]. - /// - /// Note that [`ExtVTable::validate_dtype()`] is always called first to validate the storage - /// [`DType`], and the [`Scalar`](crate::scalar::Scalar) implementation will verify that the - /// storage value is compatible with the storage dtype on construction. - /// - /// # Errors - /// - /// Returns an error if the storage [`ScalarValue`] is not compatible with the extension type. - fn unpack_native<'a>( - &self, - metadata: &'a Self::Metadata, - storage_dtype: &'a DType, - storage_value: &'a ScalarValue, - ) -> VortexResult>; - - // `ArrayRef` - - fn validate_array(&self, metadata: &Self::Metadata, storage_array: &ArrayRef) -> VortexResult<()>; - fn cast_array(&self, metadata: &Self::Metadata, array: &ArrayRef, target: &DType) -> VortexResult { ... } - // Additional compute methods TBD. -} -``` - -Most of the implementation work will be making sure that `ExtDTypeRef` (which we pass around as the -`Extension` variant of `DType`) has the correct methods that access the internal, type-erased -`ExtVTable`. - -Take extension scalars as an example. The only behavior we need from extension scalars is validating -that they have correct values, displaying them, and unpacking them into native types. So we added -these methods to `ExtDTypeRef`: - -```rust -impl ExtDTypeRef { - /// Formats an extension scalar value using the current dtype for metadata context. - pub fn fmt_storage_value<'a>( - &'a self, - f: &mut fmt::Formatter<'_>, - storage_value: &'a ScalarValue, - ) -> fmt::Result { ... } - - /// Validates that the given storage scalar value is valid for this dtype. - pub fn validate_storage_value(&self, storage_value: &ScalarValue) -> VortexResult<()> { ... } -} -``` - -**Open question**: What should the API for extension arrays look like? The answer will determine -what additional methods `ExtDTypeRef` needs beyond the scalar-related ones shown above. - -## Compatibility - -This should not break anything because extension types are mostly related to in-memory APIs (since -data is read from and written to disk as the storage type). - -## Drawbacks - -If forwarding to the storage type turns out to be sufficient for all extension types, the -additional vtable surface area adds complexity without clear benefit. - -## Alternatives - -We could have many `ExtensionArray` wrappers with custom logic. This approach would be clunky and -may not scale. - -## Prior Art - -Apache Arrow allows defining -[extension types](https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types) -and also provides a -[set of canonical extension types](https://arrow.apache.org/docs/format/CanonicalExtensions.html). - -## Unresolved Questions - -- Is forwarding to the storage type insufficient, and which extension types genuinely need custom - compute logic? -- What should the `ExtVTable` API for extension arrays look like? What methods beyond - `validate_array` are needed? -- How should compute expressions be defined and dispatched for extension types? - -## Future Possibilities - -If we can get extension types working well, we can add all of the following types: - -- `DateTimeParts` (`Primitive`) -- Matrix (`FixedSizeList`) -- Tensor (`FixedSizeList`) -- UUID (Do we need to add `FixedSizeBinary` as a canonical type?) -- JSON (`UTF8`) -- PDX: https://arxiv.org/pdf/2503.04422v1 (`FixedSizeList`) -- Union - - Sparse (`Struct { Primitive, Struct { types } }`) - - Dense[^1] -- Map (`List`) -- Tags: See this - [discussion](https://github.com/vortex-data/vortex/discussions/5772#discussioncomment-15279892), - where we think we can represent this with (`ListView`) -- `Struct` but with protobuf-style field numbers (`Struct`) -- **NOT** Variant: see [RFC 0015 (Variant Type)](../accepted/0015-variant-type.md). Variant cannot - be an extension type because there is no way to define a storage dtype when the schema is not - known ahead of time for each row. Instead, Variant will have its own `DType` variant. -- And likely more. - -[^1]: - `Struct` doesn't work here because children can have different lengths, but what we could do - is simply force the inner `Struct { types }` to hold `SparseArray` fields, which would - effectively be the exact same but with the overhead of tracking indices for each of the child - fields. In that case, it might just be better to always use a "sparse" union and let the - compressor decide what to do. diff --git a/rfcs/0023-file-compat-testing.md b/rfcs/0023-file-compat-testing.md deleted file mode 100644 index f22eac3..0000000 --- a/rfcs/0023-file-compat-testing.md +++ /dev/null @@ -1,394 +0,0 @@ -- Start Date: 2026-03-03 -- Authors: Joe Isaacs -- Tracking Issue: TBD - -## Summary - -A backward compatibility testing framework for the Vortex file format, consisting of a **generator** that writes fixture `.vortex` files and a **reader** that validates them. Both are maintained on `develop` and backported to selected release branches so that each version can produce fixtures with its writer and verify fixtures from all earlier versions with its reader. Fixtures are stored in a public S3 bucket and validated in a weekly CI job. - -## Motivation - -Vortex guarantees backward compatibility from release 0.36.0, but there are no tests validating this. Format-level changes can silently break old-file compatibility, and without automated checks we won't know until a user hits it in production. - -## Design - -### Overview - -We maintain one set of fixture `.vortex` files per release, from 0.36.0 through to the latest. Generation is manual (triggered per release or backfilled), so some intermediate versions may be skipped. The fixture sets are stored in a public S3 bucket, and a weekly CI job validates that the current reader can still open all of them. - -Two binaries in a standalone crate (`vortex-test/compat-gen/`), not a workspace member. The crate uses path deps to workspace crates, so it compiles against whatever version is checked out. - -``` - v0.36.0 v0.58.0 HEAD - ┌──────────┐ ┌──────────┐ ┌──────────┐ - │compat-gen│──upload──> │compat-gen│──upload──> │compat-gen│──upload──> - └──────────┘ │ └──────────┘ │ └──────────┘ │ - v v v - S3: v0.36.0/ S3: v0.58.0/ S3: vHEAD/ - │ │ │ - └────────────┬───────────┘────────────────────────┘ - v - ┌────────────┐ - │compat-test │ (at any version: reads ALL - │ │ fixtures from <= that version) - └────────────┘ -``` - -| Binary | Purpose | -| ------------- | ------------------------------------------------------------------------------- | -| `compat-gen` | Write fixture `.vortex` files + a `manifest.json` listing them | -| `compat-test` | Fetch fixtures from S3, read them, rebuild expected arrays, `assert_arrays_eq!` | - -When cherry-picked onto an old release branch the only thing that changes is a thin API adapter layer (~20 lines that call the version's write/read API). Everything else — fixture definitions, correctness checks — stays identical. - -### Fixture Suite - -**Synthetic fixtures** (deterministic, hardcoded values): - -| File | Schema | Data | Purpose | -| ---------------------- | ----------------------------------------------- | -------------------------------------- | -------------------------- | -| `primitives.vortex` | `Struct{u8, u16, u32, u64, i32, i64, f32, f64}` | Boundary values (0, min, max) per type | Primitive type round-trip | -| `strings.vortex` | `Struct{Utf8}` | `["", "hello", "こんにちは", "🦀"]` | String encoding round-trip | -| `booleans.vortex` | `Struct{Bool}` | `[true, false, true, true, false]` | Bool round-trip | -| `nullable.vortex` | `Struct{Nullable, Nullable}` | Mix of values and nulls | Null handling | -| `struct_nested.vortex` | `Struct{Struct{i32, Utf8}, f64}` | Nested struct | Nested type round-trip | -| `chunked.vortex` | Chunked `Struct{u32}` | 3 chunks of 1000 rows each | Multi-chunk files | - -Every stable array encoding should also contribute a fixture file — a struct with multiple columns, each using a different encoding of that array type. This ensures that encoding-specific read paths are exercised across versions. - -**Realistic fixtures** (real-world schemas and data distributions): - -| File | Source | Rows | Purpose | -| --------------------------- | ------------------------------------ | ---- | ------------------------------------------- | -| `tpch_lineitem.vortex` | TPC-H SF 0.01, `lineitem` table | ~60K | Real-world numeric + string schema | -| `tpch_orders.vortex` | TPC-H SF 0.01, `orders` table | ~15K | Date + decimal types | -| `clickbench_hits_1k.vortex` | First 1000 rows of ClickBench `hits` | 1000 | Wide table (105 columns), deep nested types | - -SF 0.01 is used instead of 0.1 to keep fixture file sizes small (~few MB) so downloads in tests are fast. - -### Fixture Trait - -Each fixture implements a common trait that the generator and tester both use: - -```rust -trait Fixture { - /// The filename for this fixture (e.g., "primitives.vortex"). - fn name(&self) -> &str; - - /// Build the expected array. Must be deterministic. - fn build(&self) -> ArrayRef; -} -``` - -A single `Fixture` impl is sufficient for both generation and validation: - -- `compat-gen` calls `build()` and writes the result to disk -- `compat-test` calls the same `build()` to produce the expected array and compares it against what was read from the old file via `assert_arrays_eq!` - -All fixture types — synthetic, TPC-H, ClickBench — implement the same trait. The registry is just a `Vec>`. - -```rust -// Synthetic: hardcoded values -struct PrimitivesFixture; -impl Fixture for PrimitivesFixture { - fn name(&self) -> &str { "primitives.vortex" } - fn build(&self) -> ArrayRef { - StructArray::from_fields(&[ - ("u8", vec![0u8, 128, 255].into_array()), - ("u16", vec![0u16, 32768, 65535].into_array()), - // ... - ]).into_array() - } -} - -// TPC-H: deterministic via tpchgen -struct TpchLineitemFixture; -impl Fixture for TpchLineitemFixture { - fn name(&self) -> &str { "tpch_lineitem.vortex" } - fn build(&self) -> ArrayRef { - // generate via tpchgen-arrow at SF 0.01 - } -} -``` - -### Correctness Strategy - -Correctness is validated by **comparing arrays in memory** — no checksums or spot-checks needed. - -For every fixture in every version: - -1. Download the old `.vortex` file from S3 (written by an older Vortex version) -2. Read it into an array with the current reader -3. Call `fixture.build()` to produce the expected array at the current version -4. `assert_arrays_eq!(actual, expected)` - -This works because all fixture builders are deterministic: synthetic fixtures use hardcoded values, TPC-H uses `tpchgen` (deterministic per SF), and ClickBench uses an immutable public parquet file. - -### Manifest Format - -Each version's fixture set includes a `manifest.json` sidecar that lists the fixtures available for that version. This allows `compat-test` to discover what to download and handles the case where newer versions add new fixture types. - -```json -{ - "version": "0.36.0", - "generated_at": "2025-01-15T10:30:00Z", - "fixtures": [ - "primitives.vortex", - "strings.vortex", - "booleans.vortex", - "nullable.vortex", - "struct_nested.vortex", - "chunked.vortex", - "tpch_lineitem.vortex", - "tpch_orders.vortex", - "clickbench_hits_1k.vortex" - ] -} -``` - -### API Adapter Layer - -The only part that changes per version. When cherry-picking onto an old branch, you adapt this module (~20 lines). - -```rust -// ---- adapter.rs (current API, HEAD) ---- -use vortex::VortexSession; - -pub fn write_file(path: &Path, stream: impl ArrayStream) -> Result<()> { - let session = VortexSession::default(); - let rt = tokio::runtime::Runtime::new()?; - rt.block_on(async { - let mut file = tokio::fs::File::create(path).await?; - session.write_options().write(&mut file, stream).await?; - Ok(()) - }) -} - -pub fn read_file(bytes: Bytes) -> Result { - let session = VortexSession::default(); - session.open_options().open_buffer(bytes) -} -``` - -```rust -// ---- adapter.rs (0.36.0 API) ---- -pub fn write_file(path: &Path, stream: impl ArrayStream) -> Result<()> { - let rt = tokio::runtime::Runtime::new()?; - rt.block_on(async { - let mut file = tokio::fs::File::create(path).await?; - VortexWriteOptions::default().write(&mut file, stream).await?; - Ok(()) - }) -} - -pub fn read_file(bytes: Bytes) -> Result { - VortexOpenOptions::in_memory().open(bytes) -} -``` - -### S3 Layout (Public Bucket) - -Fixtures are stored in a **public S3 bucket** so that anyone can run `compat-test` locally without credentials, and CI doesn't need special S3 auth for reads. Only uploads (from `compat-gen`) require write credentials. - -``` -s3://vortex-compat-fixtures/ (public read) - v0.36.0/ - manifest.json - primitives.vortex - strings.vortex - ... - v0.58.0/ - manifest.json - ... -``` - -Fixtures are also accessible via plain HTTPS (`https://vortex-compat-fixtures.s3.amazonaws.com/v0.36.0/primitives.vortex`), so `compat-test` can use either anonymous S3 access or plain HTTP — no AWS SDK configuration required. - -### Adding New Fixtures in Future Releases - -When a future release adds support for a new type or feature (e.g., list arrays, extension types), we want to add a fixture that exercises it. - -The manifest handles this naturally. Each version's `manifest.json` lists exactly which fixtures exist. `compat-test` only validates what's listed: - -``` -v0.36.0/manifest.json → ["primitives.vortex", "strings.vortex", ...] -v0.65.0/manifest.json → ["primitives.vortex", "strings.vortex", ..., "list.vortex"] -``` - -Adding a new fixture: - -1. Add the builder function in `fixtures/` (e.g., `build_list_array()`) -2. Register it in `fixtures/mod.rs` so `compat-gen` includes it -3. Tag a release — the pre-release CI job generates fixtures including the new one -4. Old versions are untouched — their manifests don't mention the new fixture - -The `FIXTURE_REGISTRY` maps fixture names to builder functions. If a fixture name from an old manifest isn't in the current registry (e.g., a fixture was retired), it's skipped with a warning rather than failing. - -```rust -for version in discover_versions_from_s3() { - let manifest = fetch_manifest(version); - for fixture_name in manifest.fixtures { - if let Some(builder) = FIXTURE_REGISTRY.get(fixture_name) { - let old_bytes = fetch_fixture(version, fixture_name); - let old_array = read_file(old_bytes); - let expected = builder(); - assert_arrays_eq!(old_array, expected); - } else { - warn!("Unknown fixture {fixture_name} in {version}, skipping"); - } - } -} -``` - -### CI Workflow - -**Pre-release upload** (`compat-gen-upload.yml`): Triggered automatically when a version tag is pushed, or manually via `workflow_dispatch` with a tag input. Generates fixtures at that version and uploads to the public S3 bucket, replacing any existing files under that version's prefix only (other versions are untouched). - -```yaml -name: Compat Fixture Upload -on: - push: - tags: ["[0-9]+.[0-9]+.[0-9]+"] - workflow_dispatch: - inputs: - tag: - description: "Git tag to generate fixtures for (e.g. 0.58.0)" - required: true - -jobs: - upload-fixtures: - runs-on: ubuntu-latest - permissions: - id-token: write - steps: - - uses: actions/checkout@v4 - with: - ref: ${{ github.event.inputs.tag || github.ref_name }} - - - uses: dtolnay/rust-toolchain@stable - - - name: Generate fixtures - run: | - VERSION=${{ github.event.inputs.tag || github.ref_name }} - cargo run --manifest-path vortex-test/compat-gen/Cargo.toml \ - --bin compat-gen -- --version "$VERSION" --output /tmp/fixtures/ - - - name: Upload to S3 - run: | - VERSION=${{ github.event.inputs.tag || github.ref_name }} - aws s3 cp /tmp/fixtures/ \ - s3://vortex-compat-fixtures/v${VERSION}/ --recursive -``` - -For backfilling old versions (0.36.0, etc.) that predate the framework, use `workflow_dispatch` manually — the cherry-picked `adapter.rs` handles the old API. - -**Weekly compat check** (`compat-test-weekly.yml`): Runs weekly and on-demand. Downloads all fixture versions from S3 (public, no credentials needed) and validates them against the current reader at HEAD. - -```yaml -name: Compat Test -on: - schedule: - - cron: "0 6 * * 1" - workflow_dispatch: {} - -jobs: - compat-test: - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - uses: dtolnay/rust-toolchain@stable - - name: Run compat tests - run: | - cargo run --manifest-path vortex-test/compat-gen/Cargo.toml \ - --bin compat-test -``` - -### Crate Layout - -``` -vortex-test/compat-gen/ - Cargo.toml # standalone binary crate, path deps to workspace - src/ - main.rs # CLI entry point (--bin compat-gen) - adapter.rs # version-specific write/read API (~20 lines to adapt) - fixtures/ - mod.rs # fixture registry — maps name → builder function - synthetic.rs # build_primitives(), build_strings(), etc. - tpch.rs # build_tpch_lineitem(), build_tpch_orders() - clickbench.rs # build_clickbench_hits_1k() - manifest.rs # manifest.json serde (just a list of fixture names) - test_main.rs # --bin compat-test entry point - validate.rs # fetch from S3 + assert_arrays_eq! logic -``` - -The `fixtures/` module is the shared core: `compat-gen` calls each builder and writes to disk; `compat-test` calls the same builders to produce expected arrays and compares them against what was read from old files. - -The `Cargo.toml` is not listed in workspace members, so it doesn't affect the main build: - -```toml -[package] -name = "vortex-compat" -version = "0.1.0" - -[[bin]] -name = "compat-gen" -path = "src/main.rs" - -[[bin]] -name = "compat-test" -path = "src/test_main.rs" - -[dependencies] -vortex = { path = "../../vortex" } -vortex-array = { path = "../../vortex-array" } -vortex-file = { path = "../../vortex-file" } -vortex-buffer = { path = "../../vortex-buffer" } -tokio = { version = "1", features = ["full"] } -serde = { version = "1", features = ["derive"] } -serde_json = "1" -object_store = { version = "0.11", features = ["aws", "http"] } -clap = { version = "4", features = ["derive"] } -tpchgen = "2" -tpchgen-arrow = "2" -arrow = "57" -``` - -## Compatibility - -This RFC does not change the file format, wire format, or any public APIs. It is purely additive testing infrastructure. - -The `compat-gen` crate is standalone and not a workspace member, so it has no impact on the build or dependency graph of the main project. - -The only operational requirement is a public S3 bucket for fixture storage. Read access is anonymous; write access is restricted to CI with OIDC credentials. - -## Drawbacks - -- **S3 dependency**: Tests require network access to fetch fixtures. If S3 is unreachable, the weekly check skips rather than fails, but this means a full week could pass without validation. -- **Cherry-pick maintenance**: Backporting to old releases requires adapting `adapter.rs` to each version's write/read API. This is a small one-time cost per version (~20 lines) but does require someone to do it manually for versions that predate the framework. -- **Fixture storage cost**: Each version adds ~10–20 MB of fixtures to S3. At one version per release, this grows slowly, but over many years it accumulates. -- **`tpchgen` determinism assumption**: If the `tpchgen` crate changes its output for the same scale factor in a future version, the TPC-H comparison will fail. This is mitigable by pinning the crate version or regenerating fixtures. - -## Prior Art - -- **Apache Parquet**: The `parquet-testing` repo stores fixture files in git. Works because Parquet fixtures are small, but doesn't scale well. The Parquet project also has a formal compatibility test suite that validates readers against writers from different language implementations. -- **Apache Arrow IPC**: The `arrow-integration` project generates IPC files from each language implementation and cross-validates them. Similar to our approach but tests cross-language compat rather than cross-version. -- **Protocol Buffers**: Google maintains a `conformance` test suite that validates proto2/proto3 encoding across versions. The test runner is a separate binary, similar to our `compat-test`. -- **SQLite**: Maintains a set of test databases going back to very early versions. Their `sqldiff` tool can compare databases for equality. - -## Related RFCs - -This RFC depends on or is closely related to several topics that warrant their own RFCs: - -- **Stable array encodings**: A separate RFC should define what it means for an array encoding to be "stable" — i.e., the encoding's serialized format is frozen and the reader must support it across versions. This includes criteria for promoting an encoding to stable, the process for deprecating one, and what guarantees stable implies (e.g., bit-level format stability, metadata schema stability). -- **File format versioning**: How does the file format itself evolve? If the footer layout, segment format, or layout metadata changes, how do we version that and maintain backward compat? This RFC tests the outcome but doesn't define the versioning mechanism. -- **Encoding registry and discovery**: When the reader encounters an encoding ID it doesn't recognize (e.g., a file written by a newer version with a new encoding), what happens? Should it fail, skip, or fall back? This affects how we handle forward compatibility. - -## Unresolved Questions - -- **Bucket name and region**: The exact S3 bucket name (`vortex-compat-fixtures`) and region need to be decided. It should be in `us-east-1` for lowest latency from GitHub Actions runners. -- **Which versions to backfill**: We need to decide which historical versions to generate fixtures for. At minimum 0.36.0 (the first stable version) and the latest release, but intermediate versions (0.45.0, 0.50.0, 0.58.0) would increase coverage. - -## Future Possibilities - -- **Automated release pipeline**: When cutting a new release, the CI pipeline could automatically run `compat-gen` and upload fixtures, removing the manual step entirely. -- **Cross-language compat**: Once the Python and Java bindings have file readers, extend the framework to validate that Python/Java can read files written by the Rust writer (and vice versa). diff --git a/rfcs/0029-types.md b/rfcs/0029-types.md deleted file mode 100644 index cda5533..0000000 --- a/rfcs/0029-types.md +++ /dev/null @@ -1,406 +0,0 @@ -- Start Date: 2026-03-06 -- Authors: Connor Tsui -- RFC PR: [vortex-data/rfcs#29](https://github.com/vortex-data/rfcs/pull/29) - -# Vortex Type System - -## Summary - -Vortex separates logical types (`DType`) from physical encodings, but the boundary between them has -been defined informally. This has led to recurring debates, such as whether `FixedSizeBinary` -warrants a new `DType` variant or is merely `FixedSizeList` under a different name. More -fundamentally, we lack a shared vocabulary for reasoning about what makes two types "different" at -the logical level. - -This RFC formalizes the Vortex type system by grounding `DType` as a quotient type over physical -encodings: each `DType` variant names an equivalence class of encodings that decode to the same -logical values. It then uses refinement types to establish a decision framework for when new `DType` -variants are justified. A new logical type requires either semantic distinctness (a genuinely -different equivalence class) or a refinement predicate that gates operations unavailable on the -parent type. If justified, a second step determines whether the type belongs in core `DType` or as -an extension type. - -## Overview - -Vortex defines a set of logical types, each of which can represent many physical data encodings, and -a set of `Canonical` encodings that represent the different targets that arrays can decompress into. - -### Logical vs. Physical Types - -A **logical type** (`DType`) describes what the data means, independent of how it is stored (e.g., -`Primitive(I32)`, `Utf8`, `List(Primitive(I32))`). A **physical encoding** describes how data is -laid out in memory or on disk (e.g., flat buffer, dictionary-encoded, run-end-encoded, bitpacked). -Many physical encodings can represent the same logical type. - -Vortex separates these two concepts so that encodings and compute can evolve independently. Without -this separation, implementing `M` operations across `N` encodings requires `N * M` implementations. -With it, each encoding only needs to decompress itself and each operation only needs to target -decompressed forms, reducing the cost to `N + M`. See this -[blog post](https://spiraldb.com/post/logical-vs-physical-data-types) for more information. - -### What is a `Canonical` encoding? - -The `N + M` argument relies on a common decompression target that operations are implemented -against. A **canonical encoding** is a physical encoding chosen as the representation for a logical -type (e.g., `VarBinView` for `Utf8`, `ListView` for `List`). The choice is deliberate and -optimized for the common compute case, but not fundamental: nothing in theory privileges `ListView` -over `List` for list data, for example. - -### Extension Types - -Vortex's built-in set of logical types will not cover every use case. Extension types allow external -consumers to define their own logical types on top of existing `DType`s without modifying the core -type system. See [RFC 0005](./0005-extension.md) for the full design. - -## Motivation - -The separation between logical types and physical encodings described above has mostly worked well -for us. However, several recent discussions have revealed that this loose definition may be -insufficient. - -For example, we would like to add a `FixedSizeBinary` type, but it is unclear if this is -necessary when it is mostly equivalent to `FixedSizeList`. Are these actually different -logical types? What does a "different" logical type even mean? - -Another discussion we have had is if the choice of a canonical `ListView` is better or worse than a -canonical `List` ([vortex#4699](https://github.com/vortex-data/vortex/issues/4699)). Both have the -exact same logical type (same domain of values), but we are stuck choosing a single "canonical" -encoding that we force every array of type `List` to decompress into. - -This RFC formalizes the Vortex type system definitions, and this formalization serves as the -foundation for reasoning about questions like these. - -## Type Theory Background - -This section introduces the type-theoretic concepts that underpin Vortex's `DType` system and its -relationship to physical encodings. To reiterate, most of the maintainers understand these concepts -intuitively, but there is value in mapping these implicit concepts to explicit theory. - -Note that this section made heavy use of LLMs to help research and identify terms and definitions, -as the author of this RFC is notably _not_ a type theory expert. - -### Equivalence Classes and `DType` as a Quotient Type - -#### In Theory - -An **equivalence relation** `~` on a set `S` is a relation that is reflexive (`a ~ a`), symmetric -(`a ~ b` implies `b ~ a`), and transitive (`a ~ b` and `b ~ c` implies `a ~ c`). An equivalence -relation partitions `S` into disjoint subsets called **equivalence classes**, where each class -contains all elements that are equivalent to one another. - -A **quotient type** `A / ~` is formed by taking a type `A` and collapsing it by an equivalence -relation `~`. The elements of the quotient type are the equivalence classes themselves: not -individual values, but entire groups of values that are considered "the same." - -The critical property of a quotient type is that operations on it must be **well-defined**: they -must produce the same result regardless of which member of the class you operate on. Formally, if -`f : A → B` respects the equivalence relation (`a ~ a'` implies `f(a) = f(a')`), then `f` descends -to a well-defined function on the quotient `f' : A/~ → B`. - -#### In Vortex - -Consider the set of all physical array representations / encodings in Vortex: a dictionary-encoded -`i32` array, a run-end-encoded `i32` array, a bitpacked `i32` array, a flat Arrow `i32` buffer, -etc. - -Two physical encodings are logically equivalent if and only if they produce the same logical -sequence of values when decoded / decompressed. This equivalence relation partitions the space of -all physical encodings into equivalence classes, where each class corresponds to a single logical -column of data. - -A Vortex `DType` like `Primitive(I32, NonNullable)` **names** one of these equivalence classes. It -tells us what logical data we are working with, but says nothing about which physical encoding is -representing it. Thus, we can say that logical types in Vortex form equivalence classes, and `DType` -is the set of equivalence classes. More formally, `DType` is the quotient type over the space of -physical encodings, collapsed by the decoded / decompressed equivalence relation. - -This quotient structure imposes a concrete requirement: any operation defined on `DType` must -produce the same result regardless of which physical encoding backs the data. - -For example, operations like `filter`, `take`, and `scalar_at` all satisfy this: they depend only on -the logical values, not on how those values are stored. However, an operation like "return the -`ends` buffer" is not well-defined on the quotient type as that only exists for run-end encoding. - -### Sections and Canonicalization - -Observe that every physical array (a specific encoding combined with actual data) maps back to a -`DType`. A run-end-encoded `i32` array maps to `Primitive(I32)`, as does a dictionary-encoded `i32` -array. A `VarBinView` array can map to either `Utf8` or `Binary`, depending on whether its contents -are valid UTF-8. Call this projection `π : Array → DType`. - -A **section** is a right-inverse of this projection: a function `s : DType → Array` that injects -each logical type back into the space of physical encodings, such that projecting back recovers the -original `DType` (`π(s(d)) = d`). In other words, a section answers the question: "given a logical -type, which physical encoding should I use to represent it?" - -**In Vortex**, the current `to_canonical` function is a section. For each `DType`, it selects -exactly one canonical physical form: - -```rust -/// The different logical types in Vortex (the different equivalence classes). -/// This is the quotient type! -pub enum DType { - Null, - Bool(Nullability), - Primitive(PType, Nullability), - Decimal(DecimalDType, Nullability), - Utf8(Nullability), - Binary(Nullability), - List(Arc, Nullability), - FixedSizeList(Arc, u32, Nullability), - Struct(StructFields, Nullability), - Extension(ExtDTypeRef), -} - -/// We "choose" the set of representations for each of the logical types. -/// This is the image/result of the `to_canonical` function (where `to_canonical` is the section). -pub enum Canonical { - Null(NullArray), - Bool(BoolArray), - Primitive(PrimitiveArray), - Decimal(DecimalArray), - VarBinView(VarBinViewArray), // Note that both `Utf8` and `Binary` map to `VarBinView`. - List(ListViewArray), - FixedSizeList(FixedSizeListArray), - Struct(StructArray), - Extension(ExtensionArray), -} -``` - -More formally, `Canonical` enumerates the **image** of the section `to_canonical`. Note that the -section is _not_ a bijection between `DType` and `Canonical`: multiple logical types can share the -same canonical form. For example, both `Utf8` and `Binary` canonicalize to `VarBinView`. - -This non-bijection is deliberate. If `DType` and `Canonical` were in bijection, the physical type -system would "leak" into the logical types. - -For example, if two logically distinct types coincidentally share the same physical layout, a -bijective section would conflict with having both as separate logical `DType`s since "there is no -physical reason for the second." But this reasoning is backwards: logical types are justified by -their _semantics_ (the operations they gate and the refinement predicates they carry), not by -whether they coincidentally share a physical representation. - -`Canonical` also represents several arbitrary **choices**. Nothing in the theory privileges -`ListView` over `List` as the canonical representation for variable-length list data. Both are valid -sections (both pick a representation from the same equivalence class), and both satisfy -`π(s(d)) = d`. Even dictionary encoding or run-end encoding are theoretically valid sections. The -fact that we choose flat, uncompressed forms as canonical is a design choice optimized for compute, -not a theoretical requirement. - -### The Church-Rosser Property (Confluence) - -A rewriting system has the **Church-Rosser property** (or is **confluent**) if, whenever a term can -be reduced in two different ways, both reduction paths can be continued to reach the same final -result (called a **normal form**). For example, the expression `(2 + 3) * (1 + 1)` can be reduced -by evaluating the left subexpression first (`5 * (1 + 1)`) or the right first (`(2 + 3) * 2`), but -both paths arrive at `10`. - -**In current Vortex**, `to_canonical` is confluent by construction. Applying `take`, `filter`, or -`scalar_at` before or after canonicalization produces the same logical values. There is one normal -form per `DType`, and every reduction path reaches it. - -A **non-confluent** rewriting system is one where two reduction paths from the same starting point -can arrive at different normal forms. Non-confluent systems are well-studied, and the standard -approach is to define **separate reduction relations**, each of which is internally confluent. - -For example, instead of one global set of reduction rules, you define two strategies: strategy A -always reduces to normal form X, and strategy B always reduces to normal form Y. Each strategy -satisfies Church-Rosser independently, the only difference is which normal form they target. This -is relevant for future work: if Vortex were to support multiple canonical targets (e.g., both `List` -and `ListView`), each target would define an internally confluent reduction strategy. - -### Refinement Types - -A **refinement type** `{ x : T | P(x) }` is a type `T` restricted to values satisfying a predicate -`P`. Refinement types express subtypes without changing the underlying representation, instead they -add constraints that gate operations or impose invariants. - -For example, in Vortex, `Utf8` is a refinement of `Binary`: - -``` -Utf8 ~= { b : Binary | valid_utf8(b) } -``` - -Every `Utf8` value is a valid `Binary` value, but not every `Binary` value is valid `Utf8`. The -predicate `valid_utf8` is what justifies the separate `DType` variant: it gates string operations -(like substring, regex matching, case conversion) that are not meaningful on arbitrary binary data. -Without this predicate, `Utf8` and `Binary` would be the same type, and maintaining both would be -redundant. - -Similarly, `FixedSizeList` is a refinement of `List`: - -``` -FixedSizeList ~= { l : List | len(l) = n } -``` - -The predicate `len(l) = n` constrains the domain of values, and it does gate certain operations: -knowing the size at the type level enables static indexing, fixed-shape tensor operations, and -reshaping without runtime length checks. These two examples (`Utf8` and `FixedSizeList`) illustrate -how refinement predicates can justify new logical types through operation gating, which is central -to the decision framework detailed in the next section. - -## What justifies a new logical type? - -The formalizations above give us a two-step decision framework. The first step decides whether a new -logical type is justified at all, and the second decides whether it should be a core type (a new -variant of `DType`) or an extension type. - -**Step 1: Is a new logical type justified?** A new logical type is justified when it is -distinguishable from existing types by one of the following criteria: - -1. **Is it semantically distinct?** (Quotient type criterion.) The values must form a genuinely - different equivalence class, not merely a different physical layout. -2. **Does it have a refinement predicate that gates different operations?** (Refinement type - criterion.) A predicate that restricts the domain of values _and_ enables or disables specific - operations justifies a new logical type. For example, `valid_utf8` gates string operations that - are not meaningful on arbitrary binary data, so `Utf8` is a distinct logical type from `Binary`. - -If neither criterion applies, the difference is purely physical and belongs in the encoding layer. - -**Step 2: Core type or extension type?** Once a new logical type is justified, the question is -whether it belongs in the core `DType` enum or as an extension type (see -[RFC 0005](./0005-extension.md)). - -This must be a pragmatic decision: if the operations gated by the type are owned by core Vortex -(e.g., string kernels for `Utf8`), it should be a first-class `DType`. If the gated operations are -specific to external consumers (e.g., UUID-specific operations), an extension type suffices (see the -comparison to programming language built-in types below). - -### Pragmatic Choices: Core Types in Programming Languages - -Every programming language must decide which types are built-in primitives and which are -user-defined. This is a universal design decision, and different languages draw the boundary in -different places: - -- **OCaml**: `int`, `float`, `char`, `string`, `bool`, `unit`, `list`, `array`, `option`, `ref`, - and tuples are built-in. Notably, `char` is not a refinement of `int` at the language level even - though it is represented as one. -- **Haskell**: `Int`, `Integer`, `Float`, `Double`, `Char`, `Bool`, tuples, lists, `Maybe`, - `Either`, `IO`. `String` is defined as `[Char]` (a type alias, not a built-in), yet it is - pervasive in the language ecosystem. -- **Rust**: `i8` through `i128`, `u8` through `u128`, `f32`, `f64`, `bool`, `char`, `str`, tuples, - arrays, slices, and references. Rust distinguishes each integer width as a separate type rather - than having a single `Integer` type parameterized by bit-width. -- **Agda, Coq, Lean**: Proof assistants based on dependent type theory with essentially no built-in - data types. `Nat`, `Bool`, `List`, and everything else are defined inductively in standard - libraries. However, even these systems pragmatically add primitive types for performance: Coq - added `Int63` and `Float64` as kernel primitives, and Lean has opaque runtime types (`Nat`, - `UInt8`-`UInt64`, `Float`, `String`) that bypass the inductive definitions. The type theory is - maximally minimal, but practical implementations end up adding built-in representations anyway. - -Several observations are relevant to the Vortex type system: - -- Numeric types are universally built-in (even the proof assistants add them back), even though they - are derivable from more primitive constructs. The justification is performance and ergonomics. - This parallels Vortex having `Primitive(PType)` as a first-class `DType` rather than encoding - integers as `List` with a width constraint. -- String types are almost universally built-in, even though they are refinements of byte sequences. - Haskell's `String` is `[Char]`, and Rust's `str` is `[u8]` with a UTF-8 invariant. The - justification is the same as Vortex's `Utf8` over `Binary`: the `valid_utf8` predicate gates - enough core operations to warrant a dedicated type. -- Rust has separate types for `i8`, `i16`, `i32`, `i64`, and `i128` rather than a single `Integer` - type with a bit-width refinement. This is analogous to Vortex having separate `PType` variants for - each primitive width: each width gates different operations (e.g., SIMD lanes, overflow behavior) - and has different performance characteristics. - -The takeaway is that _every_ type system could in principle be collapsed to a minimal core (as the -proof assistants demonstrate), but no practical system does this. - -The question of "should X be a core type or a user-defined type?" is always answered by pragmatics: -does the type gate enough core operations and is it central enough to the system's compute model to -justify built-in support? Vortex's decision framework above is how we answer this question for -`DType` variants. - -### Should `FixedSizeList` be a type? - -We can apply this framework to an existing type, `FixedSizeList`: - -**Is it semantically distinct from an existing `DType`?** No. `FixedSizeList` has a different -_physical_ layout (no offsets buffer), but this is a physical difference, not a logical one. -Logically, it is a refinement type over `List` (as shown in the Refinement Types section above). - -**Does it gate different query operations?** Yes. Knowing the size at the type level enables -operations like static indexing, fixed-shape tensor reshaping, and compile-time size checks that are -not available on variable-size lists. - -By the decision framework, `FixedSizeList` is a justified logical type (it has a refinement -predicate that gates operations). - -The remaining question is Step 2: should it be a core `DType` or an extension type? The argument for -a core type (which is what we decided in the past) is that the fixed-size invariant is such a -pervasive feature in types (for example, fixed-shape tensors like vectors and matrices) that it -doesn't make sense to ship `FixedSizeList` as a second-class extension type. - -### Should `FixedSizeBinary` be a type? - -A similar question applies to `FixedSizeBinary` vs. `FixedSizeList`: - -**Is it semantically distinct from an existing `DType`?** This is debatable. `FixedSizeBinary` -and `FixedSizeList` have the same physical layout (a flat buffer of `n`-byte elements), but -semantically a "4-byte opaque blob" (e.g., a UUID prefix or hash) is arguably different from "a list -of 4 individual bytes." Whether this semantic distinction is strong enough to constitute a different -equivalence class is not obvious. - -**Does it have a refinement predicate that gates different operations?** This is unclear. -`FixedSizeBinary` signals "opaque binary data" (UUIDs, hashes, IP addresses) whereas -`FixedSizeList` signals "a list of bytes that happens to have a fixed length." One could -argue that `FixedSizeBinary` carries the invariant that individual bytes are not independently -meaningful, which would gate byte-level list operations. However, this invariant is weaker than -something like `valid_utf8`, and it is not obvious that the core Vortex library would ship any -operations gated by it. - -Both Step 1 criteria are inconclusive for `FixedSizeBinary`. There is a plausible semantic -distinction and a plausible (but weak) refinement predicate, but neither is clear-cut. If the answer -to either is **yes**, then we move to Step 2: does Vortex core own enough operations gated by this -type to justify a first-class `DType`, or should it be an extension type? This question remains -unresolved. See [Unresolved Questions](#unresolved-questions). - -## Prior Art - -- **Arrow** has a physical type system, defining both `List` and `ListView` (and - `LargeList`/`LargeListView`) as separate type IDs in its columnar format. Arrow also - distinguishes `FixedSizeBinary` from `FixedSizeList` at the type level. -- **Parquet** also has a physical type system, with `FIXED_LEN_BYTE_ARRAY` as a distinct primitive - type separate from repeated groups. This is a nominal distinction similar to the - `FixedSizeBinary` question. - -## Unresolved Questions - -- Should `FixedSizeBinary` be a `DType` variant (refinement type) or extension type metadata? - See the [analysis above](#should-fixedsizebinary-be-a-type) for the case for and against. It is - not so easy to claim one argument here is better than the other. Comments would be appreciated! - -## Future Possibilities - -- We can have extension types be a generalization of the refinement type pattern: for example we - could enforce user-defined predicates that gate custom operations on existing `DType`s. -- Using the decision framework to audit existing `DType` variants and determine if any should be - consolidated or split, as well as to make decisions about logical types we want to add (namely - `FixedSizeBinary` and `Variant`). - -## Further Reading - -- **Equivalence classes and partitions.** - [Wikipedia: Equivalence class](https://en.wikipedia.org/wiki/Equivalence_class). -- **Quotient types in type theory.** - [nLab: quotient type](https://ncatlab.org/nlab/show/quotient+type). - Altenkirch, Anberree, Li, "Quotient Types for Programmers" - ([arXiv:1901.01006](https://arxiv.org/abs/1901.01006)). -- **Sections in category theory.** - [Wikipedia: Section (category theory)](). -- **Church-Rosser property and confluence.** - [Wikipedia: Church-Rosser theorem](https://en.wikipedia.org/wiki/Church%E2%80%93Rosser_theorem). - [Wikipedia: Confluence](). - Baader & Nipkow, _Term Rewriting and All That_ (Cambridge University Press, 1998). -- **Refinement types.** - [Wikipedia: Refinement type](https://en.wikipedia.org/wiki/Refinement_type). - Rondon, Kawaguci, Jhala, "Liquid Types" - ([DOI:10.1145/1375581.1375602](https://doi.org/10.1145/1375581.1375602)). -- **Abstract types and existential quantification.** - Mitchell & Plotkin, "Abstract Types Have Existential Type" - ([DOI:10.1145/44501.45065](https://doi.org/10.1145/44501.45065)). -- **Type theory textbook.** - Pierce, _Types and Programming Languages_ (MIT Press, 2002). Chapters on existential types, - subtyping, and type equivalence. -- **Arrow columnar format.** - [Apache Arrow Columnar Format](https://arrow.apache.org/docs/format/Columnar.html). diff --git a/rfcs/0033-block-turboquant.md b/rfcs/0033-block-turboquant.md deleted file mode 100644 index 8faa851..0000000 --- a/rfcs/0033-block-turboquant.md +++ /dev/null @@ -1,1438 +0,0 @@ -# Block-Decomposed TurboQuant with PDX Layout - -**Authors:** Will Manning -**Status:** Proposal -**Date:** 2026-04-02 (updated 2026-04-06) - -## Summary - -We propose evolving the [TurboQuant vector quantization encoding][current-impl] -in stages: - -1. **MSE-only TurboQuant** (in progress — [PR #7269][current-impl]): a complete, - self-contained building block. 8-bit default, internal zero-padding for - non-power-of-2 dimensions, `FixedSizeListArray` rotation signs supporting - variable SORF rounds. -2. **Block decomposition**: for dimensions where a valid B exists - (greatest power-of-2 ≥ 64 dividing d), split into blocks of size B. For - power-of-2 dimensions, B = d (single block). Dimensions with no qualifying - B fall back to internal zero-padding to power-of-2. Per-block norms stored as internal - children. -3. **PDX layout** (later): transpose codes into dimension-major order within - groups of 64 vectors for SIMD scan performance. - -QJL correction is deferred to a later stage and may ultimately be dropped. -Community findings from multiple independent TurboQuant implementations -often show that MSE-only outperforms MSE+QJL for KV-cache attention [8]. -For ANN ranking and vector-search workloads, the evidence is currently less -complete, so QJL should remain an empirical question rather than a settled -conclusion. - -[current-impl]: https://github.com/spiraldb/vortex/pull/7269 -[original-impl]: https://github.com/spiraldb/vortex/pull/7167 - -## Background - -### TurboQuant - -TurboQuant [1] is a lossy vector quantization algorithm for high-dimensional -embeddings. It works by: - -1. Randomly rotating a unit-norm vector so that each coordinate follows a known - marginal distribution — specifically `(1 - x²)^((d-3)/2)` on [-1, 1], a - concentrated Beta distribution (Lemma 1 in [1]). -2. Applying an MSE-optimal scalar quantizer (Max-Lloyd centroids) independently - to each coordinate. -3. Optionally adding a 1-bit QJL (Quantized Johnson-Lindenstrauss) correction - on the residual for unbiased inner product estimation (Theorem 2 in [1]). - -The paper prescribes a full random orthogonal rotation (QR decomposition of a -matrix with i.i.d. N(0,1) entries, yielding a Haar-uniform orthogonal matrix) -for the MSE stage — O(d²) storage and O(d²) per-vector. For the QJL stage, the -paper uses a random Gaussian projection matrix S with i.i.d. N(0,1) entries (not -an orthogonal rotation); this distinction matters for the unbiasedness proof. - -**Comparison to Product Quantization.** TurboQuant's block decomposition (Stage -2 of this RFC) is structurally similar to Product Quantization (PQ) [9]: both -partition a vector into sub-vectors and quantize each independently. The key -differences are: - -| | TurboQuant | PQ | -| ---------------------- | --------------------------------------------------------------- | -------------------------------------------------------- | -| Quantization type | Scalar (per-coordinate, after rotation) | Vector (per-sub-vector, learned codebook) | -| Codebook | Analytically derived from Beta distribution; **data-oblivious** | Learned via k-means on training data; **data-dependent** | -| Rotation | Random orthogonal within each sub-vector | Typically none (OPQ [10] adds a learned rotation) | -| Theoretical guarantees | Provable data-oblivious MSE bound (Theorem 1 [1]) | No comparable data-oblivious bound | -| Codebook training | None (centroids derived from theory) | Requires training pass over data | -| Bits per sub-vector | Scalar: b bits per coordinate | Vector: typically 8 bits per sub-vector (256 codewords) | - -TurboQuant trades PQ's flexibility (data-dependent codebooks can exploit -structure) for data-obliviousness (no training, provable bounds, no offline -index-training phase). Encode-time work (rotation + quantization) still applies. -In return, PQ and OPQ retain a major advantage in expressivity: they learn -sub-vector codebooks from data rather than applying an analytic scalar quantizer. -In practice this means TurboQuant is attractive when training-free operation, -simple deployment, and theoretical guarantees matter most, while PQ or OPQ may -still win empirically when a learned vector codebook can exploit dataset-specific -structure. - -### Comparison to HIGGS - -HIGGS [12] (Malinovskii et al., 2024) is a data-free quantization method for LLM -weight matrices that shares TurboQuant's core idea — Hadamard rotation followed by -MSE-optimal grid quantization — but targets a different application domain and makes -different design trade-offs: - -| | TurboQuant | HIGGS | -| -------------------- | ------------------------------------------------------------------------ | -------------------------------------------------------------------------- | -| Application domain | ANN embedding search (per-vector, online) | LLM weight quantization (per-layer, offline) | -| Rotation | 3-round SORF (HD₃·HD₂·HD₁): high-quality random orthogonal approximation | Single RHT (H·D): one Hadamard × random diagonal signs | -| Target distribution | Beta marginal (1-x²)^((d-3)/2) on unit sphere | Approximate Gaussian N(0,1) | -| Quantization grid | Max-Lloyd centroids (scalar, p=1), analytically derived for Beta | CLVQ grids (Pagès & Printems 2003), supports vector quantization p∈{1,2,4} | -| Error metric | Pure MSE (reconstruction error) | MSE + Hessian-weighted per-layer coefficients αₗ (Linearity Theorem) | -| Calibration data | None | None for quantization; small calibration set for αₗ estimation | -| Non-uniform bitwidth | No (uniform across all vectors) | Yes (DP solver for per-layer bit allocation) | -| Distance computation | Quantized-domain scan kernel (PDX layout, SIMD over 64 vectors) | GPU matrix multiply (FLUTE kernel) | -| Norm storage | Explicit per-block norms for distance computation | Per-group scales folded into weight reconstruction | - -**Key design differences explained:** - -- **Rotation depth.** TurboQuant normalizes to the unit sphere first, so - coordinates must follow the specific Beta marginal for Max-Lloyd centroids to - be optimal — this requires a high-quality random orthogonal approximation - (3-round SORF). HIGGS operates on raw (group-normalized) weights and only - needs approximate Gaussianity, so a single RHT suffices. -- **VQ dimension.** HIGGS's CLVQ grids support multi-dimensional vector - quantization (p>1), where groups of p coordinates are quantized jointly to an - optimal multi-dimensional grid. At 3-4 bits, p=2 or p=4 achieves better - rate-distortion than scalar (p=1) quantization by exploiting residual - correlations between coordinates. TurboQuant is currently scalar-only (p=1); - p>1 would require changes to the PDX scan kernel (per-subvector codebook - lookup instead of per-coordinate). See Future work for discussion. -- **Error metric.** HIGGS's Linearity Theorem (perplexity increase ≈ Σ αₗ·tₗ²) - enables Hessian-aware optimization specific to LLM inference. For ANN search, - MSE is the natural metric — it directly bounds distance distortion — and - non-uniform bit allocation has no analogue (all vectors share the same - encoding). -- **Beta vs. Gaussian at high d.** As d grows, the Beta distribution - (1-x²)^((d-3)/2) concentrates and becomes approximately Gaussian with - variance ~1/d. At d=256+, the practical difference between Beta-optimal and - Gaussian-optimal grids shrinks. Whether Gaussian grids (simpler: one grid per - bitwidth, no dimension dependence) match Beta Max-Lloyd for ANN recall is an - empirical question — see Experimental plan. - -**Domain mismatch.** Comparisons of TurboQuant vs. HIGGS on LLM perplexity -benchmarks are misleading: HIGGS's Hessian-aware optimization naturally dominates -for that task, but TurboQuant was never designed for LLM weight quantization. The -relevant comparison is ANN recall@k on embedding datasets, where TurboQuant's -block decomposition, PDX scan layout, and per-vector encode/decode are the -critical features. - -### Comparison to RotorQuant / IsoQuant - -RotorQuant [13] replaces TurboQuant's full-dimension SORF with Clifford algebra -rotors in Cl(3,0), chunking vectors into 3-dimensional groups and applying SO(3) -sandwich products. IsoQuant extends this to SO(4) via quaternions, and PlanarQuant -uses SO(2) Givens rotations. All three are block-diagonal rotation strategies with -very small blocks (2-4 dimensions). - -On real KV-cache tensors (Qwen2.5-3B), these small-block rotations showed severe -quality regressions: RotorQuant at 3-bit measured 3.843 MSE vs. TurboQuant's -0.354 (10.8× worse), and IsoQuant at 4-bit incurred +36% perplexity impact vs. -TurboQuant's +11.7% [13]. Independent analysis attributed this to the fundamental -decorrelation limitation: block-diagonal rotations in SO(2)/SO(3)/SO(4) provide -no cross-group coordinate mixing, while WHT/SORF mixes all coordinates -simultaneously. Real embedding vectors exhibit full-dimension correlations that -small-block rotations cannot break. - -| | TurboQuant (SORF) | RotorQuant (SO(3)) | IsoQuant (SO(4)) | -| ---------------------- | --------------------------------------------- | -------------------------- | --------------------------- | -| Decorrelation | Full dimension (3-round SORF, all coords mix) | Block-diagonal (3D groups) | Block-diagonal (4D groups) | -| Params (d=128) | 384 sign bits (3 × 128) | 186 rotor params | ~500 quaternion params | -| MSE at 3-bit (Qwen KV) | 0.354 | 3.843 (10.8× worse) | Not reported at 3-bit | -| Speed vs. WHT | Baseline (896 FMAs at d=128) | 2,408 FMAs (2.7× slower) | ~3.6× slower (CUDA prefill) | - -**Relevance to our design.** RFC 0033's Stage 2 block decomposition is also -block-diagonal — each B-dim block has an independent SORF with no cross-block -mixing. The critical difference is block size: B=256 with 3-round SORF provides -24 butterfly stages of within-block mixing (comparable to the current B=1024's -30 stages), vs. RotorQuant's 3-4 coordinate groups with no structured mixing at -all. The RotorQuant/IsoQuant data provides empirical evidence that the quality -cliff for block-diagonal rotations is steep at very small B and validates the -RFC's minimum B ≥ 64 constraint. Whether B=256 is large enough to avoid -meaningful decorrelation loss is an empirical question addressed in the -Experimental plan. - -### Current Vortex implementation - -The [current implementation][current-impl] (Rust, in the `vortex-tensor` crate, -merged via [PR #7269][current-impl]) implements MSE-only TurboQuant as a Vortex -array encoding that compresses `FixedSizeList` arrays — the storage -format of `Vector` and `FixedShapeTensor` extension types. The -[original QJL-inclusive PR][original-impl] was closed in favor of this MSE-only -approach. Key design choices and characteristics: - -**Rotation.** Instead of the paper's O(d²) QR rotation, we use a 3-round -Structured Orthogonal Random Features (SORF) transform `HD₃·HD₂·HD₁` [5], -giving O(d) storage (3d sign bits, bitpacked) and O(d log d) per-vector. The -rotation signs are stored as a bitpacked child array rather than recomputed from -a seed at decode time. The 3-round SORF was introduced for kernel approximation -[5] and approximates a random orthogonal matrix. It is distinct from the -single-round SRHT (`R·H·D`) analyzed by Tropp [3] and the FJLT (`P·H·D`) of -Ailon-Chazelle [2], both of which are dimensionality-reducing projections rather -than rotation approximations. - -**Centroids.** Max-Lloyd centroids are computed via numerical integration -(trapezoid rule, 1000 points per interval) of the marginal Beta distribution at -the padded dimension, using the `HalfIntExponent` type for exact integer/half- -integer exponent arithmetic. Centroids are cached in a global `DashMap` keyed by -`(dimension, bit_width)` and stored as a shared `PrimitiveArray` child. - -**Array structure.** The `TurboQuantArray` stores 4 child slots: codes -(`FixedSizeListArray`, one per vector, list_size = padded_dim), norms -(`PrimitiveArray`), centroids (`PrimitiveArray`, shared), and MSE -rotation signs (`PrimitiveArray`, shared, bitpacked). Codes are stored as -u8 centroid indices; the cascade compressor (BitPacked encoding) handles packing -to the actual bit width on disk. - -**Compute pushdowns.** Slice and take propagate to per-row children (codes, -norms) while sharing rotation signs and centroids. Quantized cosine similarity -and dot product operate directly on codes and centroids without decompression. -L2 norm returns the stored norm directly (O(1) readthrough). - -**Compression scheme.** `TurboQuantScheme` implements the `Scheme` trait for the -BtrBlocks cascading compressor. It matches `Vector` and `FixedShapeTensor` -extension arrays with non-nullable float elements and dimension ≥ 128, -using 8-bit MSE-only as the default (256 centroids, near-lossless with -normalized MSE ~4e-5, achieving ~4× compression on f32). - -**Input handling.** All float types (f16, f32, f64) are converted to f32 before -quantization. Per-vector L2 norms are computed and stored as f32. Non-power-of-2 -dimensions are zero-padded to the next power of 2 for SORF compatibility. The -minimum dimension for scheme auto-selection is 128; the array-level minimum -remains 3 (at d=2 the marginal is the arcsine distribution, which is U-shaped -and unsuitable for Max-Lloyd centroids designed for concentrated distributions). - -**Metadata.** Currently serialized as a raw single byte (bit_width). This lacks -framing and versioning and cannot be extended backward-compatibly; migrating to -a structured/extensible format is a Stage 1 item (the upcoming vtable refactor -may eliminate the need for separate serialized metadata entirely). - -The Eviox corrections study [7] identified several bugs in the paper's reference -Python implementation; none affect our implementation (see Appendix A). There is -also a notational ambiguity in the MSE bound constant; we use `√3·π/2 ≈ 2.72` -(see Appendix A for the full analysis). - -Multiple independent TurboQuant implementations report that MSE-only often -outperforms MSE+QJL for KV-cache attention at the same bit budget [8], likely -due to softmax amplifying QJL variance. For ANN ranking the evidence is less -settled; MSE-only is the default pending dedicated benchmarks (see Appendix B -for details). - -### Current limitations - -The SORF requires power-of-2 input dimension. The TQ array handles this by -zero-padding non-power-of-2 dimensions to the next power of 2 internally -(e.g., 768 → 1024). For non-power-of-2 dimensions, this means: - -- **33% storage overhead** for 768-d vectors: 1024 codes stored vs. 768 useful - (equivalently, 25% of stored codes are wasted on zero-padded dimensions). -- **No scan-optimized layout**: row-major code storage prevents SIMD-over-vectors - distance computation. - -Stage 2's block decomposition eliminates this padding for dimensions with a -qualifying B (e.g., 768 → 3×256 blocks), since each block is natively -power-of-2. - -### PDX - -PDX [4] is a data layout for vector similarity search. The paper (SIGMOD '25) -describes a dimension-major layout within fixed-size blocks of 64 vectors, -enabling the compiler to auto-vectorize the inner distance loop over vectors -rather than dimensions. The paper reports an average 2× speedup for -auto-vectorized PDX distance kernels vs. explicitly SIMD-optimized row-major -baselines (SimSIMD, FAISS) across four architectures, with larger gains at low -dimensionality (5.5× at D ≤ 32) and ~1.5× at D > 32 [4, Table 4]. -Dimension-pruning methods (ADSampling, BSA) recover much larger end-to-end -gains (2-7×) when paired with the PDX layout [4]. The block size of 64 is -empirically optimal across AVX-512, AVX2, and NEON architectures [4, Table 5]. - -**PDX open-source implementation.** The [open-source implementation][pdx-impl] -has evolved beyond the paper in several ways relevant to this RFC. _Note: the -following describes the code repository, not the paper — the paper operates -exclusively on float32 and does not discuss int8 layouts._ - -- **8-bit scalar quantization** (`IndexPDXIVFTreeSQ8`): Maps floats to 0-255 via - linear min-max scaling. The int8 layout differs from float32: dimensions are - packed in groups of 4 ("4 dims × 16 vecs") to leverage hardware dot-product - instructions (VPDPBUSD on x86, UDOT/SDOT on ARM) that process 4 byte pairs - per operation. This is a different tiling than the paper's "1 dim × 64 vecs." -- **ADSampling with random rotation**: The pruner applies a random orthogonal - rotation to the entire collection as a preprocessing step. This makes - coordinates approximately independent, enabling dimension-by-dimension - hypothesis testing for early pruning. The rotation serves a similar purpose - to TurboQuant's rotation — making the coordinate distribution known — but for - pruning rather than quantization. -- **Dimension zones**: Consecutive dimensions are grouped into zones; at query - time, zones are ranked by "distance-to-means" and the most discriminative - zones are scanned first, enabling faster pruning (~30% faster than - per-dimension pruning [4]). - -**Implications for our design.** The PDX paper's float32 layout ("1 dim × 64 -vecs") maps cleanly to our quantized-code scan kernel, where the inner loop -gathers from a centroid-product distance table over 64 vectors. However, if we -pursue direct int8 arithmetic (b_mse=8 with linear centroids, see GPU section), -the "4 dims × 16 vecs" int8 layout from the PDX implementation may be more -appropriate, as it enables hardware dot-product instructions. - -Additionally, ADSampling's dimension-pruning approach is complementary to -TurboQuant's block structure: when scanning with block decomposition, the pruner -could skip entire TQ blocks (B dimensions at a time) if the partial distance -already exceeds the candidate threshold. This combines the storage efficiency of -quantization with the computational savings of early termination. - -[pdx-impl]: https://github.com/cwida/PDX "specific files: `include/pdx/quantizers/scalar.hpp` for SQ8, `include/pdx/pruners/adsampling.hpp` for ADSampling, `include/pdx/layout.hpp` for int8 interleaving, `include/pdx/distance_computers/avx512_computers.hpp` for VPDPBUSD kernels" - -## Proposal - -### Block size strategy - -For each dimension d, choose B = the greatest power-of-2 ≥ 64 that evenly -divides d. If no such B exists (e.g., d=96), the TQ array falls back to -internal zero-padding (single padded block, as in Stage 1). For common embedding -dimensions, this rule always produces a valid B and avoids padding entirely: - -| Dimension d | Block size B | Blocks k | Notes | -| ----------- | ------------ | -------- | ---------------------------- | -| 512 | 512 | 1 | Single block (= current TQ) | -| 768 | 256 | 3 | Greatest dividing power-of-2 | -| 1024 | 1024 | 1 | Single block | -| 1536 | 512 | 3 | | -| 2048 | 2048 | 1 | Single block | -| 3072 | 1024 | 3 | | -| 4096 | 4096 | 1 | Single block | - -**Key observations:** - -- **Power-of-2 dimensions** (512, 1024, 2048, 4096) use B = d — a single block, - identical to the current implementation except with PDX underneath (Stage 3). - No block decomposition overhead, no per-block norms. These dimensions are - already well-served by the current design. -- **Non-power-of-2 dimensions** (768, 1536, 3072) decompose into k=3 blocks at - B=256 or B=512. No padding waste. - Each block has its own SORF rotation and shares a single centroid set. -- **No qualifying B is rare** for common embedding dimensions. Dimensions where - no power-of-2 ≥ 64 divides d (e.g., 96, 100) fall back to internal - zero-padding. A future straggler-block extension could handle these - without padding (see Stage 2: Straggler blocks). These dimensions are uncommon - in modern model architectures. -- **The SORF approximation at B=256+ is expected to be adequate**: 3 rounds at - B=256 provides 24 butterfly stages, and at B=512 provides 27 — both comparable - to the current B=1024 (30 stages). This needs empirical validation; see - Experimental plan. - -### Minimum dimension - -The compression scheme should only select TurboQuant for vectors with -dimension ≥ 128. Below this threshold, several factors degrade quality and -efficiency: - -- **SORF mixing quality:** 3-round SORF at d=64 provides only 18 butterfly - stages (vs. 21 at d=128, 30 at d=1024). The coordinate distribution deviates - more from the analytical Beta, making Max-Lloyd centroids less optimal. - Stage 1's variable-round rotation signs (see Stage 1) may allow compensating with - additional SORF rounds at lower dimensions — this should be benchmarked. -- **Practical MSE:** At smaller d, the SORF mixing quality and coordinate- - independence approximations are weaker, potentially worsening practical - quantization quality beyond what the dimension-free theoretical bound - captures. The actual MSE at each d is an empirical question. -- **Overhead ratio:** Per-vector norm (32 bits) is a larger fraction of the - compressed representation at small d. At d=32, b=5: codes=160 bits, - norm=32 bits, total=192 — norm is ~17% of compressed size. At d=768: <1%. -- **Diminishing returns for high bit widths:** With fewer coordinates, the - fine-grained centroid structure of high-b quantization has less to exploit. - -The threshold of 128 is conservative: - -- d=128 (SIFT) is the smallest dimension in our recommended benchmark table. -- SORF at d=128 has 21 butterfly stages — tested and adequate in the current - implementation. -- The block-size rule produces B=128 for d=128 (single block, no decomposition). - -Whether TQ works well at all below d=64 is an open question — SORF mixing -quality degrades rapidly at small dimensions, and the overhead ratio makes TQ -increasingly uncompetitive vs. simpler scalar quantization. The scheme minimum -of 128 is conservative; the experimental plan should determine the true -minimum (likely in the 64-128 range). Padding modest amounts (e.g., 96 → 128) -is probably acceptable; padding large fractions (e.g., 32 → 64) is not. - -The exact threshold should be validated experimentally — see Experimental plan. - -### Stage 1: MSE-only TurboQuant (in progress — [PR #7269][current-impl]) - -Stage 1 delivers MSE-only TurboQuant as a complete, self-contained building -block. The [initial implementation][current-impl] is merged; the -[original QJL-inclusive PR][original-impl] was closed in favor of this MSE-only -approach. Work remaining to complete Stage 1 is described below. - -The goal is to arrive at a wire format that we believe is ready for -backward-compatibility guarantees — one we would be comfortable freezing — without -formally committing to stability until confirmed by Stage 2 implementation and -benchmarking. - -**Target properties:** - -- **MSE-only, no QJL.** 4 child slots: codes, norms, centroids, rotation_signs. - QJL code can be resurrected from the [original PR][original-impl] if Phase 4 - is pursued. -- **8-bit default** (256 centroids). Near-lossless: normalized MSE ~4e-5, - ~4× compression on f32. Lower bit widths available via `TurboQuantConfig`. -- **Power-of-2 block size with internal padding.** The TQ array requires - `block_size` to be a power of 2. Non-power-of-2 dimensions are zero-padded - internally to the next power of 2 (e.g., 768 → 1024), so `codes.list_size` - (= `padded_dim`) may exceed `dimension`. Stage 2's block decomposition - eliminates this padding for dimensions with a qualifying B (e.g., 768 → - 3×256 blocks, each natively power-of-2). -- **Variable-round SORF rotation.** Rotation signs are stored as a - `FixedSizeListArray` where each element is a - `FixedSizeList(u8, padded_dim, NonNullable)` — one bitpacked diagonal per - SORF round. The array length R equals the number of rounds (default 3). This - makes the round count a property of the array shape rather than a hard-coded - constant. More rounds may improve mixing quality at lower dimensions or lower - bit widths (see Experimental plan: "Test 3, 4, 5 SORF rounds at each B"). - Signs are stored in inverse-friendly (read-optimized) order. -- **Scheme auto-selection** for dimension ≥ 128 (see Minimum dimension). - Smaller power-of-2 dimensions remain available via explicit construction. -- **Compute pushdowns**: slice/take/scalar_at, quantized cosine similarity and - dot product, compression scheme integration. -- **Dtype-matching norms**: f64 for f64 input, f32 for f32/f16. -- **Codes and centroids remain separate children.** The codes - (`FixedSizeListArray`) and centroids (`PrimitiveArray`) are - independent child slots. Operations that need a unified view (e.g., - `canonicalize`) can construct a `DictArray` from codes and centroids and - apply the inverse rotation to produce a canonical decoded form. - -**Forward-compatible metadata:** `dimension: u32`, `block_size: u32` (= -padded_dim in Stage 1), `num_blocks: u32` (always = 1 in Stage 1), -`num_rounds: u32` (= R, default 3). These fields are inert in Stage 1 but -enable Stage 2 decoders to read Stage 1 -files. The serialization format is TBD — the upcoming vtable refactor may make -the current raw-byte metadata unnecessary by encoding these fields directly in -the vtable. If the refactor does not land first, a structured format (e.g., -protobuf) is needed. (PDX is handled via the codes child type, not a metadata -flag — see Stage 3.) - -**Remaining work** (relative to the [initial implementation][current-impl]): - -- Restructure rotation signs from flat `PrimitiveArray` to - `FixedSizeListArray` (variable SORF rounds, as described above). -- Dtype-matching norms (currently always f32). -- Structured metadata (currently a raw single byte). -- Restrict `new_unchecked` visibility to `pub(crate)`. -- f64-to-f32 truncation in encode path: add comment or checked cast. -- CENTROID_CACHE: document intentional unbounded-ness. -- Note MSE bound caveat: Theorem 1 is proved for Haar matrices, not SORF. - -### Stage 2: Block decomposition - -Block decomposition splits a `FixedSizeListArray` vertically by dimension into -fixed-size blocks, each encoded independently. This is structurally analogous -to `ChunkedArray` (which splits horizontally by rows) — both are general-purpose -structural transforms over arrays, not specific to any particular encoding. Like -PDX (Stage 3), block decomposition is a layout concern that can wrap arbitrary -child encodings. - -In the initial implementation, block decomposition is embedded inside -`TurboQuantArray` — all blocks use TQ MSE-only encoding with independent SORF -rotations, and TQ-specific children (centroids, rotation signs) are stored -alongside the blocks. However, the _concept_ of block decomposition is -encoding-agnostic: a future refactor could extract it into a general-purpose -`BlockDecomposedFSLArray` that wraps k independently-encoded child arrays. This -matters for straggler-block support (see below), where the straggler may use a -different encoding than the main blocks. - -For dimensions where the block-size rule produces a valid B (see table above), -the scheme splits the input into k = d/B blocks of size B. Each block is a -power-of-2 TQ array with an independent B-dim SORF rotation. - -**Changes vs. Stage 1 (with TQ blocks):** - -| Aspect | Stage 1 | Stage 2 | -| --------------------- | ---------------------------------------- | ---------------------------------------------------------------------------- | -| Block count | k = 1 (single power-of-2 block) | **k = d/B** (multiple blocks) | -| SORF dimension | padded_dim (next power-of-2 ≥ dim) | **B** (e.g., 256 for d=768) | -| Rotation signs | `FSL`, len = R, element dim = padded_dim | **`FSL`, len = k × R**, element dim = B | -| Centroids | Computed for padded_dim distribution | **Computed for B-dim distribution** (different codebook!) | -| Norms child | `PrimitiveArray`, 1 per vector | **`PrimitiveArray` (k=1) or `FixedSizeListArray` (k>1)**, same dtype F | -| Codes list_size | padded_dim | **k × B** (= d) | -| Scheme compress() | Single SORF → quantize | **Choose B → split → per-block normalize/rotate/quantize** | -| Quantized dot product | Single sum over padded_dim centroids | **Per-block weighted sum** (Σ_k norm_a_k · norm_b_k · unit_dot_k) | -| L2 norm readthrough | O(1) — return stored norm | **O(k)** — compute √(Σ_k norm_k²) | - -**Unchanged from Stage 1:** SORF construction (R-round HD, default R=3), -Max-Lloyd algorithm, f32 internal quantization, slice/take semantics (per-row -data sliced, shared data cloned), `FixedSizeListArray` rotation sign storage, -compression scheme trait. - -**For power-of-2 dimensions**: B = d, k = 1. The encoding produces an identical -wire format to Stage 1 (single norm, single SORF, single codes block). A -Stage 2 encoder writing k=1 data is fully backward-compatible with Stage 1 -decoders. - -**Key design properties:** - -- **Structural, not encoding-specific.** The block decomposition itself is a - vertical split of a `FixedSizeListArray` by dimension. Each block is an - independently-encoded child. In the initial implementation all blocks are TQ - MSE-only, but the structure allows heterogeneous child encodings in future. -- **One shared centroid set** for all TQ blocks at the same B-dim distribution. -- **Per-block SORF rotation signs.** Each block's SORF is independent (different - seed). Signs are R × B bits per block (R = number of SORF rounds, default 3), - stored as a `FixedSizeListArray` with len = k × R. - -#### Straggler blocks (future work) - -The current block-size rule requires B to evenly divide d, so dimensions with no -qualifying power-of-2 B ≥ 64 (e.g., d=96) fall back to internal zero-padding -(single padded block, as in Stage 1). -A natural extension is **straggler blocks**: allow k blocks where k-1 are -full-size B and the final block covers the remaining d - (k-1)×B dimensions. - -Because the block decomposition is encoding-agnostic (each block is an -independently-encoded child array), the straggler block need not use the same -encoding as the main blocks. For example, d=800 could be decomposed as 3×256 -= 768 TQ-encoded dimensions plus a 32-dimension straggler. SORF is unlikely -to be effective at such small straggler dimensions (see Minimum dimension), -so the straggler would use a different strategy: - -- **Uncompressed**: store the straggler dimensions as raw floats. Simplest; - the overhead is modest (32 × 4 = 128 bytes per vector for a 32-dim - straggler). -- **Padded TQ**: pad the straggler to the next power-of-2 (e.g., 32 → 64), - encode with standard TQ. Only viable if the padded dimension is large enough - for SORF to be effective (≥ 64, probably ≥ 128). -- **Exact-rotation TQ**: use a dense random orthogonal matrix (QR of Gaussian) - instead of SORF for the straggler block. Eliminates the power-of-2 constraint - at the cost of O(B_s²) rotation, where B_s is the straggler size. -- **Scalar quantization or PQ**: the block decomposition structure supports - heterogeneous child encodings. - -Note that for some dimensions (e.g., d=800), padding the entire vector to the -next power-of-2 (1024) may be preferable to block decomposition with a -straggler, depending on the overhead tradeoff. This is an empirical question. - -This is deferred: the block-size rule already handles all common embedding -dimensions (768, 1024, 1536, etc.) without stragglers, and the rare -no-qualifying-B case (d=96) is adequately served by internal zero-padding for -now. - -#### Norm architecture - -Per-block norms are stored as an **internal child** of the TurboQuant array: - -- For k = 1 (power-of-2 dims): `PrimitiveArray` with len = num_rows - (identical to Stage 1's single-norm layout). -- For k > 1: `FixedSizeListArray` with list_size = k, len = num_rows. - -The norm dtype `F` matches or widens the input element type: - -| Input dtype | Norm dtype | Rationale | -| ----------- | ---------- | ---------------------------------------------- | -| f16 | f32 | f16 has insufficient range/precision for norms | -| f32 | f32 | Same type | -| f64 | f64 | Preserve full precision | - -Norms are stored as plain child arrays; the cascading compressor handles -secondary encoding (ALP, Pco, etc.). - -Note: centroids and quantization always operate in f32 internally (the -[current implementation][current-impl] converts all input to f32 before -quantization). For f64 input, decode produces f32 unit-direction reconstructions -scaled by f64 norms — a mixed-precision multiply that preserves norm precision. - -#### Zero-norm sub-vectors - -When splitting a vector into B-dim blocks, some blocks may have zero norm. The -encoding handles ‖xₖ‖ = 0 explicitly: skip rotation and quantization, store -norm = 0, decode as all zeros. - -#### Theoretical MSE bound - -The paper's MSE bound (Theorem 1 in [1]) is: - -``` -E[‖x - x̂‖² / ‖x‖²] ≤ (√3 · π / 2) / 4^b ≈ 2.72 / 4^b -``` - -**Crucially, Theorem 1 is proved for true random orthogonal matrices (QR of -Gaussian), not SORF.** Our SORF is an approximation. The bound holds exactly -only with a true random orthogonal rotation or with empirical SORF validation -(see Experimental plan). - -Assuming the per-block MSE bound holds, for a vector split into blocks the -first line is an **algebraic** identity (exact); the inequality on the second -line applies Theorem 1's **probabilistic** bound to each block and should be -read as holding in **expectation** over independent per-block rotations, not -almost surely: - -``` -‖x - x̂‖² / ‖x‖² = Σ_k (‖xₖ‖² / ‖x‖²) × (‖xₖ - x̂ₖ‖² / ‖xₖ‖²) (exact) - E[...] ≤ MSE_bound × Σ_k (‖xₖ‖² / ‖x‖²) = MSE_bound (in expectation) -``` - -The conclusion: `E[‖x - x̂‖² / ‖x‖²] ≤ MSE_bound` assuming independent -per-block rotations. (Theorem 1 applies because each block is normalized to -unit norm before rotation and quantization; the per-block encoding pipeline is: -split → normalize → rotate → quantize, matching the theorem's unit-sphere -assumption.) Note that TurboQuant's original analysis uses a single -global rotation in high-d where coordinates are nearly independent; with -smaller block dimension B, within-block coordinate dependence after rotation may -be stronger even when marginals are correct — this is an additional motivation -for the experimental plan's comparison of block sizes. - -**Empirical evidence from small-block rotations.** The RotorQuant/IsoQuant -experiments [13] provide direct evidence of this decorrelation failure mode: -block-diagonal rotations in SO(3) (3-dim groups) and SO(4) (4-dim groups) -caused 10× MSE regressions on real KV-cache vectors, attributed to complete -absence of cross-group coordinate mixing. Our Stage 2 design operates at a -fundamentally different scale — B=256 blocks with 3-round SORF provide 24 -butterfly mixing stages within each block, vs. RotorQuant's 3-4 raw coordinates -with no structured mixing — so the decorrelation loss should be far less severe. -Nevertheless, the experimental plan includes explicit cross-block correlation -measurement on real embeddings to quantify any residual decorrelation gap -between block-decomposed (B=256) and single-block (B=d) SORF. - -The actual MSE may depend on block dimension B: at larger B the coordinate -distribution is more concentrated (variance ~1/B), giving the Max-Lloyd -quantizer more to exploit. See Experimental plan. - -**SORF approximation.** The R-round SORF `HD_R·...·HD₂·HD₁` [5] provides -log₂(B) butterfly stages per round × R rounds = R·log₂(B) total. At R=3 -(default): 18 at B=64, 24 at B=256, 27 at B=512. At R=5: 30 at B=64, 40 at -B=256. Counting butterfly stages is a rough heuristic for mixing quality with -no theoretical backing: [5] proves near-unbiasedness for kernel approximation -(Theorem 3) and pairwise near-orthogonality (Theorem 4), but does **not** prove -distributional closeness to Haar measure, does not analyze convergence rate as -a function of rounds × dimension, and leaves tight variance bounds for SORF as -an open problem. The variable-round rotation signs (Stage 1) enable testing -more rounds at smaller B or lower bit widths where mixing quality matters more. -Empirical validation is needed. - -**Fallback: dense rotation.** If SORF proves insufficient at the chosen B, use a -B × B random orthogonal matrix (QR of Gaussian). Storage at B=256: 256 KB per -block. For d=768 with k=3: 768 KB total. Amortizes for large columns (100K+ -vectors). Each block must have an **independent** rotation matrix. - -DCT and other fixed structured transforms are not suitable for TurboQuant's -rotation (they do not produce the required Beta marginal). Sharing a rotation -with ADSampling-style pruning is a speculative future direction. See Appendix C -for details on both. - -#### Quantized-domain operations - -All quantized operations read per-block norms from the internal child array: - -- **L2 distance**: `‖a-b‖² = Σ_k ‖aₖ‖² + Σ_k ‖bₖ‖² - 2·Σ_k ‖aₖ‖·‖bₖ‖· -unit_dotₖ`. Primary ANN metric; reuses per-block dot product and norms. -- **Dot product**: ` ≈ Σ_k ‖aₖ‖·‖bₖ‖ · Σ_j centroids[code_aₖ[j]] · -centroids[code_bₖ[j]]`. -- **Cosine similarity**: `cos(a,b) ≈ dot(a,b) / (‖a‖·‖b‖)` where - `‖a‖ = √(Σ_k ‖aₖ‖²)`. -- **L2 norm**: `√(Σ_k ‖xₖ‖²)`. O(k) per vector — a regression from the - current O(1) single-norm readthrough, but modest. - -#### Encoding algorithm - -``` - -Input: x ∈ ℝ^d, b_mse bits per coordinate, block_size B -k = d / B (exact division, no straggler for chosen B) -num_centroids = 2^b_mse - -# Block split and normalize - -for i in 0..k: -xᵢ = x[i*B .. (i+1)*B] -nᵢ = ‖xᵢ‖ -if nᵢ > 0: -ûᵢ = xᵢ / nᵢ -else: -ûᵢ = zeros(B) - -# MSE stage (per block, SORF rotation) - -for i in 0..k: -if nᵢ > 0: -rᵢ = SORFᵢ(ûᵢ) -cᵢ[j] = nearest_centroid(rᵢ[j]) -else: -cᵢ[j] = 0 - -Store (all as internal children): -codes (k × B per vector), norms (k per vector), -centroids (2^b_mse, shared), SORF signs (k × R × B, shared; R = SORF rounds) - -``` - -#### Decoding algorithm - -``` - -for i in 0..k: -r̂ᵢ[j] = centroids[cᵢ[j]] -ûᵢ = SORF⁻¹ᵢ(r̂ᵢ) -x̂ᵢ = nᵢ × ûᵢ (nᵢ read from internal norms child) -x̃ = concat(x̂₀, ..., x̂ₖ₋₁) - -``` - -### Stage 3: PDX dimension-major layout - -Introduce a new `PDXArray` encoding type that wraps any `FixedSizeListArray` -with a dimension-major layout within groups of 64 vectors [4]. Like block -decomposition (Stage 2), PDXArray is a **structural transform** over -`FixedSizeListArray`, not specific to any particular encoding — it is a -general-purpose layout optimization for any FixedSizeList of scalar elements -(raw float vectors, scalar-quantized vectors, TurboQuant codes, etc.). - -**Changes vs. Stage 2:** - -| Aspect | Stage 2 | Stage 3 | -| ---------------- | ------------------------------------------------ | ------------------------------------------------------------------------------- | -| Codes child type | `FixedSizeListArray` | **`PDXArray`** (wraps FSL with transposed layout) | -| Codes detection | N/A (codes always FSL) | **TQ checks child type**: FSL → row-major decode, PDXArray → un-transpose first | -| Distance kernel | Per-vector loop with per-element centroid lookup | **SIMD-friendly 64-vector inner loop with distance-table lookup** | -| Decode path | Direct inverse SORF per vector | **PDXArray.to_fsl() first**, then inverse SORF | - -**Unchanged from Stage 2:** Block size B, centroid computation, norm storage, -SORF rotation, all encoding logic. The encode path produces row-major codes -(FSL), then the compressor wraps them in a PDXArray; the decode path converts -PDXArray back to FSL then decodes. - -**PDXArray design:** - -``` - -PDXArray (general-purpose dimension-major layout for FixedSizeList) -├── metadata: { list_size, chunk_size (= 64) } -├── elements: PrimitiveArray # transposed: 64 values per dim, contiguous -├── validity: ... # same as FSL validity - -``` - -- `PDXArray::try_new(fsl)` — transposes a FixedSizeListArray into PDX layout -- `PDXArray::to_fsl()` — un-transposes back to row-major FSL (for decode, - scalar_at, or non-aligned slice/take) -- `PDXArray::elements_for_dim(dim, chunk)` — O(1) access to a contiguous slice - of 64 values for one dimension within one chunk -- Slice/take: un-transpose to FSL (simplest). Un-transpose cost is - O(rows × list_size) per operation; consider 64-row-aligned fast paths for - hot scan workloads. Preserving PDX layout is possible only for - 64-vector-aligned ranges. -- The cascade compressor treats PDXArray as a valid encoding of FSL-typed data. - -**Benefits of PDXArray as a separate type:** - -- PDX logic tested and maintained independently of TurboQuant -- Other encodings (raw float vectors, scalar quantization, future encodings) - get PDX scan performance for free -- TurboQuant doesn't need an `is_pdx` metadata flag — it checks its codes - child's type at runtime -- The distance kernel operates on PDXArray's dimension-contiguous slices - -Within each 64-vector chunk, codes are stored dimension-major: - -``` - -TQ block 0, dim 0: [v0 v1 v2 ... v63] -TQ block 0, dim 1: [v0 v1 v2 ... v63] -... -TQ block 0, dim (B - 1): [v0 v1 v2 ... v63] -TQ block 1, dim 0: [v0 v1 v2 ... v63] -... - -``` - -The inner SIMD loop (64 vectors) has no inter-vector dependencies. TQ block -boundaries only affect where norm weighting occurs — they don't affect the -transpose. - -**Quantized distance kernel (dot product):** - -```rust -let dist_table = precompute_product_table(¢roids); -// At b_mse=4: 16×16 = 256 floats = 1KB, fits in L1 - -let mut distances = [0.0f32; 64]; -let mut unit_dots = [0.0f32; 64]; -let mut offset = 0; - -for tq_block in 0..k { - for dim in 0..B { - let qd = query_codes[tq_block * B + dim]; - let row = &dist_table[qd as usize]; - for v in 0..64 { // SIMD-friendly: no inter-vector deps - unit_dots[v] += row[codes[offset] as usize]; - offset += 1; - } - } - // Weight per-block unit-norm dot product by both vectors' block norms - for v in 0..64 { - distances[v] += query_norms[tq_block] * data_norms[v][tq_block] - * unit_dots[v]; - unit_dots[v] = 0.0; // reset for next TQ block - } -} -``` - -**Int8 layout variant.** The PDX implementation [pdx-impl] uses a different -tiling for int8 data: "4 dims × 16 vecs" to leverage VPDPBUSD/UDOT hardware -dot-product instructions (which process 4 unsigned×signed byte pairs per -operation). For TurboQuant codes at b_mse ≤ 8, codes are uint8 centroid indices, -so VPDPBUSD doesn't apply directly — we need the distance-table-lookup path -shown above. However, at b_mse=8 with high B, the Max-Lloyd centroids are -near-uniformly spaced (see GPU section), potentially enabling direct hardware -dot-product on the codes. Whether this requires a separate linear quantization -mode or works with the existing Max-Lloyd centroids is an empirical question. The -"4 dims × 16 vecs" layout would be a Stage 3 optimization to evaluate alongside -the "1 dim × 64 vecs" float-style layout. - -**ADSampling integration.** The PDX dimension-pruning approach (ADSampling [4]) -is complementary to TurboQuant's block structure. During a scan, the pruner -could evaluate partial distances after each TQ block (B dimensions) and skip -remaining blocks if the partial L2 distance already exceeds the candidate -threshold. This requires the per-block norm weighting to happen at block -boundaries (as shown in the kernel above), which our design already provides. - -**Open design questions:** - -- Should PDXArray live in `vortex-array` (general infrastructure) or - `vortex-tensor` (vector-specific)? -- Should the cascade compressor automatically PDX-transpose FSL children when - it detects a scan-heavy workload, or should PDX be opt-in? -- Should we support the "4 dims × 16 vecs" uint8 layout variant (for hardware - dot-product) alongside the "1 dim × 64 vecs" float-style layout? - -### QJL correction (deferred — experimental) - -Based on community findings [8], QJL is deferred to after the MSE stages are -validated. - -**Changes vs. MSE-only (if pursued):** - -| Aspect | MSE-only | MSE + QJL | -| ---------------------- | -------------------------------- | --------------------------------------------------------------- | -| Bit budget | All b bits → MSE (2^b centroids) | b-1 bits MSE + 1 bit QJL (2^(b-1) centroids) | -| Inner product estimate | Biased (MSE quantization noise) | Unbiased (QJL correction; see TurboQuant_prod in [1]) | -| Additional children | None | QJL signs, QJL residual norms, QJL projection params | -| Encode cost | SORF only | SORF + QJL projection (O(B²) for Gaussian, O(B log B) for SORF) | -| Decode cost | Inverse SORF only | Inverse SORF + QJL inverse projection | - -If pursued, four strategies should be compared: - -| Strategy | Theoretical | Speed | Storage | -| ------------------ | --------------------- | ---------------- | ------------ | -| Per-block Gaussian | Correct (Lemma 4 [1]) | O(B²)/block | k×B²×4 bytes | -| Per-block SORF | Approximate | O(B log B)/block | k×R×B bits | -| Full-dim SORF | Approximate | O(d log d) total | R×d bits | -| MSE-only (no QJL) | N/A | 0 | None | - -The paper's QJL uses Gaussian S (not SORF); Lemma 4 [1] is proved specifically -for Gaussian. SORF for QJL is an additional approximation (the -[original QJL implementation][original-impl] used SORF for QJL). Per-block QJL can -incur up to d/B times larger variance bound than full-dimension QJL (Lemma 4 -[1]), depending on how query and residual energy are distributed across blocks. - -Community reports indicate MSE-only often wins for KV-cache attention at all -tested bit widths [8]. Whether this extends to ANN ranking is an empirical -question (see Experimental plan); QJL may not be worth the complexity. Note: -the [original QJL PR][original-impl] flagged a known SORF-related QJL bias for -non-power-of-2 padded dimensions (#7245); the merged MSE-only encoding avoids -this path. - -## Array layout - -### Stage 1 (MSE-only single block) - -``` -TurboQuantArray -├── metadata: { dimension, b_mse, -│ block_size (= padded_dim, next power-of-2 ≥ dimension), -│ num_blocks (= 1), num_rounds (= R, default 3) } -│ -│ # Per-row children -├── codes: FixedSizeListArray # list_size = padded_dim -│ (or PDXArray after Stage 3) -├── norms: PrimitiveArray # len = num_rows (F = f64 for f64, f32 otherwise) -│ -│ # Shared children -├── centroids: PrimitiveArray # len = 2^b_mse -├── mse_rotation_signs: FixedSizeListArray # len = R (default 3) -│ element dtype: FixedSizeList(u8, padded_dim, NonNullable) -│ # each element = one bitpacked sign diagonal, inverse-friendly order -``` - -For power-of-2 dimensions, `padded_dim = dimension` (no waste). For -non-power-of-2 (e.g., d=768), `padded_dim = 1024` (33% overhead, eliminated -by Stage 2 block decomposition). - -The codes child is `FixedSizeListArray` in Stages 1-2 and may be swapped to -`PDXArray` in Stage 3 — TurboQuant checks the child type at runtime, not via -a metadata flag. - -### Stage 2 (block decomposition) - -``` -TurboQuantArray (self-contained, handles blocks internally) -├── metadata: { dimension, b_mse, block_size, num_blocks, -│ num_rounds } -│ -│ # Per-row children (sliced/taken on row operations) -├── codes: FixedSizeListArray # list_size = k × B -│ (or PDXArray after Stage 3) -├── norms: PrimitiveArray # len = num_rows (k=1) -│ or FixedSizeListArray # list_size = k (k>1) -│ -│ # Shared children (cloned on row operations, not sliced) -├── centroids: PrimitiveArray # len = 2^b_mse -├── mse_rotation_signs: FixedSizeListArray # len = k × R -│ element dtype: FixedSizeList(u8, B, NonNullable) -│ # k blocks × R rounds, each element = one bitpacked sign diagonal -``` - -## Compression ratio - -For f32 input, b_mse bits MSE, k = d/B blocks, N vectors (for f64 input, -replace 32 with 64 in the norms row — ratios decrease accordingly): - -| Component | Bits per vector | -| ----------- | --------------- | -| MSE codes | k × B × b_mse | -| Block norms | k × 32 | - -| Component | Shared bits | -| ---------- | ------------ | -| Centroids | 2^b_mse × 32 | -| SORF signs | k × R × B | - -### Worked examples (f32, N=1000) - -**At b_mse=8 (default, near-lossless):** - -| d | B | k | Per-vec bits | Ratio | Notes | -| ------------ | ---- | --- | --------------------- | ----- | ------------------------ | -| 768 | 256 | 3 | 3×256×8 + 3×32 = 6240 | 3.9× | Block decomp; no padding | -| 1024 | 1024 | 1 | 1024×8 + 32 = 8224 | 4.0× | Single block (= current) | -| 768 (padded) | 1024 | 1 | 1024×8 + 32 = 8224 | 3.0× | Padded; 33% overhead | - -**At b_mse=5 (32 centroids):** - -| d | B | k | Per-vec bits | Ratio | Notes | -| ------------ | ---- | --- | --------------------- | ----- | ------------------------ | -| 768 | 256 | 3 | 3×256×5 + 3×32 = 3936 | 6.2× | Block decomp; no padding | -| 1024 | 1024 | 1 | 1024×5 + 32 = 5152 | 6.4× | Single block (= current) | -| 768 (padded) | 1024 | 1 | 1024×5 + 32 = 5152 | 4.8× | Padded; 33% overhead | - -Block decomposition improves the compression ratio at both bit widths. At b=8 -for d=768: from ~3.0× (padded) to ~3.9× (block decomp). At b=5 for d=768: from -~4.8× to ~6.2×. For d=1024, the encoding is identical to current (single block). - -**Shared overhead note:** centroids and SORF signs are amortized over N vectors; -for small N, per-column shared metadata is significant — report totals with and -without amortization when publishing ratios. - -## Performance analysis - -### Encode/decode throughput - -SORF at B dimensions (heuristic — real cost is dominated by memory bandwidth -and constant factors): R × B × log₂(B) butterflies + R × B sign applications -per block (R = SORF rounds, default 3; plus B normalization multiplies, -omitted). For k blocks, R=3: - -| B | SORF FLOPs/block | k (d=768) | Total MSE FLOPs | -| -------------- | ------------------------- | --------- | --------------- | -| 256 | 3×256×8 + 768 = 6,912 | 3 | 20,736 | -| 512 | 3×512×9 + 1536 = 15,360 | — | — | -| 1024 (current) | 3×1024×10 + 3072 = 33,792 | 1 | 33,792 | - -Block decomposition at d=768 is ~40% fewer FLOPs than the padded single-block -approach, despite more blocks, because each block is smaller. - -### Benchmarking plan - -1. Encode/decode throughput: block TQ vs. current TQ at d=128, 768, 1024 -2. Quantized cosine similarity: block vs. current -3. L2 norm readthrough: O(k) vs. O(1) -4. PDX scan throughput vs. row-major (Stage 3) - -## Experimental plan - -### Minimum dimension threshold - -Test TurboQuant quality at d ∈ {32, 64, 96, 128, 256} to validate the scheme -minimum of 128: - -- Compare TurboQuant MSE distortion and ANN recall@k against scalar - quantization matched on **total compressed bits per vector** (codes + norm + - amortized shared metadata), not just bits-per-coordinate — this is critical - at small d where norm overhead is significant -- Plot the crossover point: at what d does TurboQuant's recall@k drop below - the rate-matched scalar baseline? -- Test SORF coordinate distribution quality at each d (histogram vs. Beta) -- Measure overhead ratio (norm bits / total compressed bits) at each d - -The scheme minimum should be set at the smallest d where TurboQuant reliably -beats the scalar baseline on recall@k across the benchmarking datasets. Default -scalar baseline: per-dimension linear min-max quantization at b bits per -coordinate plus an f32 norm (matching TurboQuant's norm overhead). Report -results at a reference N (e.g., N=100K vectors) where shared metadata is -amortized; optionally show sensitivity to small N where shared costs dominate. -The current proposal of 128 is conservative; experiments may justify lowering -to 64 or raising to 256. - -### MSE quality and scan performance vs. block size - -- Compare actual normalized MSE at B ∈ {64, 128, 256, 512} vs. single-block at - full power-of-2 dimension, at bit widths b ∈ {2, 3, 4, 5, 8} -- Compare ANN recall@k and scan throughput at fixed d (e.g., d=3072) across - B ∈ {256, 512, 1024} — smaller B gives more pruning checkpoints for - ADSampling-style early termination but increases norm overhead -- Test SORF coordinate distribution at each B: histogram vs. analytical Beta -- Test 3, 4, 5 SORF rounds at each B -- Determine if the practical MSE constant is worse at smaller B -- Measure cross-block coordinate correlation on real embeddings (Contriever, - OpenAI) before and after per-block SORF rotation: compute the average - absolute Pearson correlation between coordinates in different blocks. Compare - block-decomposed (B=256, k=3) vs. single-block (B=d) SORF at d=768 to - quantify how much cross-block dependence survives block decomposition. The - RotorQuant/IsoQuant experiments [13] showed that very small block-diagonal - rotations (3-4 dims) leave full-dimension correlations intact; this test - determines where on the block-size spectrum the decorrelation gap becomes - negligible - -The block-size rule ("greatest qualifying B") is a starting heuristic that -maximizes per-block quality and minimizes norm count. Experiments may show that -smaller B with more pruning checkpoints yields better end-to-end scan -performance despite higher per-block overhead. - -### Gaussian-optimal vs. Beta-optimal grids - -HIGGS [12] demonstrates that Gaussian-optimal grids (computed via CLVQ for N(0,1)) -work well after a single Hadamard rotation. Since the Beta marginal converges to -Gaussian at high d, test whether Gaussian grids can replace Beta Max-Lloyd centroids -for ANN search: - -- **Grid comparison**: At B ∈ {64, 128, 256, 512} and b ∈ {2, 3, 4, 5, 8}, - compare ANN recall@k and normalized MSE for (a) Beta Max-Lloyd centroids at - B-dim, (b) Gaussian-optimal scalar grids (Normal Float style), and - (c) CLVQ-computed Gaussian grids. Report the crossover point where the grids - become practically equivalent. -- **Rotation depth**: If Gaussian grids match Beta Max-Lloyd at a given B, test - whether 1-round RHT (H·D with random signs) achieves comparable quality to - 3-round SORF. A single round would reduce rotation cost by ~3× and simplify - the transform. Test at B ∈ {64, 128, 256, 512} on the benchmarking datasets. -- **Simplification potential**: If Gaussian grids + 1-round RHT match quality at - B ≥ 256, this eliminates the dimension-dependent centroid computation (one grid - per bitwidth, shared across all block sizes) and reduces rotation overhead. - This would be a significant implementation simplification for Stage 2+. - -The expectation is that at B=256+ the difference is negligible, but at B=64-128 -the Beta-optimal grids may still win due to stronger non-Gaussian effects. Results -should inform whether the centroid computation strategy changes in Phase 2. - -### QJL strategy comparison (if pursued) - -- Per-block Gaussian QJL vs. per-block SORF QJL vs. full-dim SORF QJL - vs. MSE-only -- Key metric: ANN recall@k on the datasets above (Contriever, OpenAI, SIFT) -- Per community findings for attention, MSE-only is expected to win [8]; ANN - ranking is the key open question - -### Benchmarking datasets - -The current test suite uses i.i.d. Gaussian vectors as a theory anchor and -sanity check: for isotropic data, a random orthogonal transform is -distributionally neutral, which cleanly validates theoretical bounds. This is -not a universal "worst case" for all production workloads — heavy-tailed or -clustered embeddings can behave differently. Recent work -(VIBE [11]) argues that traditional benchmarks (SIFT, GloVe) are no longer -representative of modern ANN workloads. - -**Recommended datasets:** - -| Dataset | Dim | Size | Source | Why | -| ----------------------------- | ------ | ------ | ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | -| Contriever | 768 | ~1M | PDX paper [4] | Key non-power-of-2 target; real embeddings | -| OpenAI text-embedding-3-large | 1536 | ~1M | Common in RAG | High-d production embeddings | -| SIFT | 128 | 1M | Classic | Low-d power-of-2 baseline, well-studied recall numbers | -| arXiv embeddings | 768 | 2.25M | PDX paper [4] | Same dim as Contriever, larger scale | -| DEEP | 96 | 10M | Image embeddings | Large scale; d=96 < scheme min (128) and has no B ≥ 64 — requires explicit TurboQuantArray construction or benchmark-only scheme override | -| Synthetic Gaussian | varies | varies | Internal | Theory anchor / sanity check; not universal worst case | - -**Metrics** (at b_mse ∈ {2, 3, 4, 5, 8}): - -- Recall@10, Recall@100 (ANN ranking quality) -- Normalized MSE distortion (reconstruction quality) -- Inner product mean signed relative error (bias measurement) -- Encode/decode throughput (vectors/sec) - -The Gaussian baseline validates that theoretical bounds hold. The real-embedding -datasets measure practical quality — which may be **better** than Gaussian -(structured data benefits more from rotation) or **worse** (if the data has -adversarial properties for the specific rotation). - -### Dimensions with no qualifying B - -Rare for common embedding dimensions (e.g., d=96). Currently these fall back to -internal zero-padding to the next power-of-2 (single padded block). See -"Straggler blocks (future work)" in Stage 2 for a potential alternative using -heterogeneous per-block encodings. - -## Phasing - -**Phase 1** (in progress) — MSE-only single-block TurboQuant: Initial -implementation merged as [PR #7269][current-impl]. Remaining: -`FixedSizeListArray` rotation signs (variable SORF rounds), dtype-matching -norms, structured metadata, and review items (see Stage 1: Remaining work). - -**Phase 2** — Block decomposition: Add block splitting for dimensions where a -valid B exists (greatest power-of-2 ≥ 64 dividing d). Per-block norms stored as -internal children. The `TurboQuantScheme::compress()` method must be updated to: -(a) choose B based on d, (b) split input into blocks, (c) normalize per-block, -(d) encode each block, and (e) store per-block norms as an internal child array. - -**Phase 3** — PDXArray + scan kernels: Introduce `PDXArray` as a general-purpose -dimension-major layout for `FixedSizeListArray`. TurboQuant's codes child is -swapped from FSL to PDXArray by the compressor. Distance computation kernels -operate on PDXArray's dimension-contiguous slices. - -**Phase 4** (experimental) — QJL: If the experimental plan shows QJL improves -recall@k beyond MSE-only, add per-block Gaussian or SORF QJL. Based on -KV-cache community reports [8], this may not be pursued. - -## Practical recommendations - -For common model dimensions, the most promising configurations are: - -| Dimension | Recommendation | Rationale | -| ---------------------- | --------------------------- | -------------------------------------------------------------------------- | -| 512, 1024, 2048, 4096 | Single-block MSE-only + PDX | B=d, no decomposition needed. Same as current TQ but with PDX scan layout. | -| 768, 1536, 3072 | 3-block MSE-only + PDX | B=256 or 512. No padding waste. 3 blocks, shared centroids. | -| No qualifying B (rare) | Padded single-block | Internal zero-padding to next power-of-2, single SORF. | - -In all cases, MSE-only is the recommended starting point. QJL should only be -added if experiments demonstrate clear recall@k improvements for the target -workload. - -## Future work: Multi-dimensional vector quantization (p>1) - -HIGGS [12] demonstrates that vector quantization with dimension p>1 (quantizing -groups of p coordinates jointly to an optimal multi-dimensional grid) achieves -better rate-distortion than scalar quantization (p=1) at the same bit budget. For -TurboQuant, this would mean replacing the per-coordinate Max-Lloyd centroid lookup -with a per-subvector codebook lookup, where each group of p rotated coordinates -maps to one of n codewords in a p-dimensional CLVQ grid. - -**Benefits:** - -- Improved rate-distortion: at 3-4 bits, p=2 or p=4 captures residual - correlations between coordinates that scalar quantization misses. -- Simpler centroid computation: CLVQ grids for Gaussian inputs are computed once - per (n, p) pair and reused across all block sizes (no dimension dependence). - -**Costs and constraints:** - -- **Distance kernel redesign.** The PDX scan kernel (Stage 3) is built around - per-coordinate centroid lookups with a (2^b)²-entry distance table. At p=2 - with b=4 bits per coordinate, the codebook has 2^(4×2)=256 entries, and the - distance table becomes 256×256=64K entries (256 KB) — still fits in L1/L2 but - much larger than the current 1 KB at b=4 scalar. At p=4 the table is - infeasible; alternative distance strategies (asymmetric distance computation, - partial codebook scans) would be needed. -- **GPU shared memory.** HIGGS notes total grid points 2^(b×p) must fit GPU - shared memory (~2^10 points practical limit), constraining (b, p) pairs. -- **PDX layout interaction.** The current "1 dim × 64 vecs" PDX layout assumes - per-coordinate independence. At p>1, the layout would need to group p - consecutive dimensions together per lookup, changing the transpose structure. - -**Recommendation:** Evaluate p=2 VQ experimentally after Stage 3 (PDX) is -validated. Compare ANN recall@k at matched bit budgets: p=1 at b bits vs. p=2 at -b bits. If p=2 shows meaningful recall improvement (>2% recall@10), design the -kernel changes as a Stage 4 extension. CLVQ grids for p=2 can be precomputed -offline using the Pagès & Printems (2003) algorithm [12]. - -## Future work: GPU decode and fused distance computation - -The B-dim block structure maps naturally to GPU tile sizes and tensor cores. -For a single block (k=1; Stage 2 generalizes to k independent per-block GEMMs) -with a batch of N vectors sharing the same rotation matrix R⁻¹: - -``` -decoded_batch = diag(norms) × R⁻¹ × codebook_lookup_batch(codes) - ↑ B×N matrix - ↑ B×B × B×N = GEMM -``` - -The codebook gather + inverse rotation + norm scaling can be fused into a single -kernel using an IO-aware streaming pattern analogous to Flash-KMeans [6] — not -the same algorithm (Flash-KMeans is GPU k-means), but a similar systems goal: -reduce HBM traffic and avoid full materialization. -For distance computation without full decode, a precomputed (2^b_mse)²-entry -distance table fits in shared memory at low bit widths (1 KB at b_mse=4, 4 KB -at b_mse=5). At the default b_mse=8, the table is 256² × 4 = 256 KB, which -exceeds typical GPU shared memory (48-228 KB); the distance-table approach is -therefore practical only at b ≤ 5 on GPU, or requires tiling/streaming for -b=8. On CPU, the table fits in L2 at all bit widths. The kernel streams code -bytes from HBM with gather-reduce accumulation, using 4-8× less bandwidth -than full float vectors. - -At b_mse=8, codes are uint8 indices (0-255). Direct low-precision GEMM on -hardware accelerators (tensor cores on GPU, byte-dot-product instructions on -CPU) requires approximately linear -centroids — but at high B the Max-Lloyd centroids are already near-uniform -(the Beta distribution is highly concentrated, approaching Gaussian, for which -high-resolution optimal quantization is approximately uniform). Whether the -existing Max-Lloyd centroids are "linear enough" for hardware dot-product -instructions is an empirical question worth testing before introducing a -separate linear quantization mode. - -## Integration with Vortex scan engine - -TurboQuant's quantized-domain operations must integrate with Vortex's expression -evaluation and scan pushdown infrastructure. The current implementation provides -this via `ScalarFnVTable` implementations in `vortex-tensor`. - -**Current integration path.** The `CosineSimilarity`, `DotProduct`, and `L2Norm` -scalar functions check whether their input storage arrays are TurboQuant-encoded -(via `TurboQuant::try_match()`). If both operands are TurboQuant and the -`ApproxOptions::Approximate` flag is set, the scalar function dispatches to the -quantized-domain kernel (e.g., `cosine_similarity_quantized_column`), bypassing -full decompression. Otherwise, it falls back to the exact path (decompress → -compute on floats). - -**Stage 2 changes.** With block decomposition, the quantized kernels must be -updated to iterate over TQ blocks, weighting by per-block norms: - -- `cosine_similarity_quantized_column`: currently computes a single unit-norm - dot product per row pair. Must change to `Σ_k norm_a_k · norm_b_k · -unit_dot_k / (‖a‖ · ‖b‖)` with `‖a‖ = √(Σ_k norm_a_k²)`. -- `dot_product_quantized_column`: same per-block weighting. -- `l2_norm`: currently returns the stored norm directly (O(1)). Must change to - `√(Σ_k norm_k²)` — read the norms child (`PrimitiveArray` for k=1, - `FixedSizeListArray` for k>1) and compute. -- Both operands must have the **same block size B**, compatible centroids (same - `b_mse` and B-dim codebook), and **bit-identical MSE rotation parameters** - (`mse_rotation_signs` and same SORF construction) for the quantized - inner-product path to be valid. Two stored columns with different rotations - must **fall back to exact** (decompress → float). The common **column vs. - constant query** path avoids this: the query is re-encoded with the column's - rotation and centroids at query time. - -**Stage 3 changes.** The PDX distance kernel (shown in Stage 3 pseudocode) is a -new execution path that operates on `PDXArray`-typed codes. It should be exposed -as an alternative `ScalarFnVTable` implementation that activates when the codes -child is a `PDXArray` and the scan is over a contiguous 64-vector-aligned range. -For non-aligned ranges or single-vector access (`scalar_at`), the PDXArray is -converted to FSL first via `PDXArray::to_fsl()`. - -**Expression tree integration.** The typical ANN scan expression is: - -``` -top_k(cosine_similarity(column, constant_query), k=10) -``` - -The `constant_query` is broadcast to match the column length. The -`CosineSimilarity` scalar function receives both the column (TurboQuant-encoded) -and the query (ConstantArray wrapping a single vector). For the quantized path, -the query is first encoded with the column's rotation and centroids to produce -query codes and query block norms, then the PDX kernel runs over the column's -codes without decompressing them. - -## Migration and compatibility - -TurboQuant has not been included in a release yet, so the wire format can still -change freely. The Stage 1 target wire format is intended to be one we believe -is ready for backward-compatibility guarantees, without formally committing to -stability until confirmed by Stage 2 implementation and benchmarking. - -**Strategy: single array ID, versioned metadata.** All stages use the same array -ID (`vortex.turboquant`). The metadata includes `block_size`, `num_blocks`, and -`num_rounds` fields. Stage 1 always writes `num_blocks=1`, but the field exists -so that Stage 2 decoders can read Stage 1 files without migration. - -**Decoder invariant:** `block_size` is always power-of-2. -`codes.list_size` = `num_blocks × block_size`. Note that `dimension` (the -original input dimension) may differ from `codes.list_size` in Stage 1 when -internal padding applies (e.g., dimension=768, block_size=1024, list_size=1024). -In Stage 2, `dimension = num_blocks × block_size` (no padding, since B is -chosen to divide d exactly). The decoder **validates** that -`codes.list_size == num_blocks × block_size` (reject files where this does not -hold). `num_rounds` must equal `rotation_signs.len / num_blocks`. - -**Norms are always internal children.** The TurboQuant array is self-contained — -it stores norms as a child slot, not in a parent encoding. This means: - -- Stage 1: norms child is `PrimitiveArray`, one norm per vector (F = f64 - for f64 input, f32 otherwise). -- Stage 2 with k=1 (power-of-2 dims): same as Stage 1, identical wire format. -- Stage 2 with k>1: norms child is `FixedSizeListArray`, k norms per vector. - -The decoder distinguishes k=1 from k>1 by reading `num_blocks` from metadata. -A k=1 decoder is backward-compatible with Stage 1 files. A k>1 decoder is a -new code path that only applies to files written by Stage 2+. - -**Stage 3 (PDXArray) is additive.** PDX is not a TurboQuant metadata flag — it's -a separate array type (`PDXArray`) that wraps the codes child. Stage 1/2 files -have `FixedSizeListArray` codes; Stage 3 files have `PDXArray` codes. The -TurboQuant decoder checks the child type and un-transposes PDXArray on decode if -needed. `PDXArray` itself is registered as a new encoding, independent of -TurboQuant. - -**Incremental shipping:** - -| Stage | Ships to users? | Reads prior stage files? | Notes | -| ---------- | --------------- | -------------------------- | ---------------------------------- | -| 1 (MSE) | Yes | N/A (first stable version) | Single block, variable SORF rounds | -| 2 (blocks) | Yes | Yes (k=1 is identical) | k>1 files need Stage 2+ decoder | -| 3 (PDX) | Yes | Yes (FSL codes still work) | PDX codes need PDXArray registered | - -Each stage is independently shippable. Users can upgrade incrementally. Files -written by earlier stages are always readable by later decoders. - -## References - -_All lemma, theorem, and definition numbers for [1] refer to arXiv:2504.19874v1. -The ICLR 2026 camera-ready proceedings may use different numbering._ - -[1] Zandieh, A., Daliri, M., Hadian, M. and Mirrokni, V. "TurboQuant: Online -Vector Quantization with Near-optimal Distortion Rate." ICLR 2026. -arXiv:2504.19874, April 2025. - -[2] Ailon, N. and Chazelle, B. "The Fast Johnson-Lindenstrauss Transform and -Approximate Nearest Neighbors." SIAM J. Comput. 39(1):302-322, 2009. - -[3] Tropp, J.A. "Improved Analysis of the Subsampled Randomized Hadamard -Transform." Adv. Adaptive Data Analysis 3(1-2):115-126, 2011. - -[4] Kuffo, L., Krippner, E. and Boncz, P. "PDX: A Data Layout for Vector -Similarity Search." SIGMOD '25. arXiv:2503.04422, March 2025. - -[5] Yu, F.X., Suresh, A.T., Choromanski, K., Holtmann-Rice, D. and Kumar, S. -"Orthogonal Random Features." NeurIPS 2016. arXiv:1610.09072. - -[6] Yang, S. et al. "Flash-KMeans: Fast and Memory-Efficient Exact K-Means." -arXiv:2603.09229, March 2026. - -[7] Pathare, T. et al. "TurboQuant: Implementation Corrections, Production -Hardening, and Deployment Infrastructure." Eviox Tech Report v1.2.0, -March 2026. https://eviox.tech/nexus/eviox_turboquant_corrections_study.pdf -_(Note: this URL may require Eviox account access; not publicly indexed.)_ - -[8] Community TurboQuant implementation reports (primarily KV-cache attention): - -- https://github.com/tonbistudio/turboquant-pytorch — MSE-only (V3) vs - MSE+QJL (V2); reports MSE-only wins for attention and generation quality. -- https://github.com/ggml-org/llama.cpp/discussions/20969 — TurboQuant - discussion; quantized attention analysis and MSE vs Prod comparison. -- https://github.com/0xSero/turboquant — Triton kernels; paper validation. -- https://github.com/scos-lab/turboquant — Reference reproduction; MSE vs - Prod/QJL comparison. - Multiple groups report MSE-only beating MSE+QJL for attention metrics at tested - bit widths. ANN ranking conclusions remain preliminary pending dedicated - benchmarks. - -[9] Jégou, H., Douze, M. and Schmid, C. "Product Quantization for Nearest -Neighbor Search." IEEE Trans. PAMI 33(1):117-128, 2011. - -[10] Ge, T., He, K., Ke, Q. and Sun, J. "Optimized Product Quantization." -IEEE Trans. PAMI 36(4):744-755, 2014. - -[11] Jääsaari, E., Hyvönen, V., Ceccarello, M., Roos, T. and Aumüller, M. -"VIBE: Vector Index Benchmark for Embeddings." arXiv:2505.17810, May 2025. - -[12] Malinovskii, V., Panferov, A., Ilin, I., Guo, H., Richtárik, P. and -Alistarh, D. "Pushing the Limits of Large Language Model Quantization via the -Linearity Theorem." arXiv:2411.17525, November 2024. - -[13] johndpope et al. "RotorQuant: Clifford algebra vector quantization." PR #34, -TheTom/turboquant_plus, March-April 2026. -https://github.com/TheTom/turboquant_plus/pull/34 -Explores SO(2)/SO(3)/SO(4) block-diagonal rotations as alternatives to -full-dimension SORF. Rejected due to 10×+ MSE regressions on real KV-cache -tensors, attributed to insufficient cross-group decorrelation. - -## Appendix A: Reference implementation bugs and Theorem 1 constant - -### Reference implementation bugs - -The Eviox corrections study [7] identified six material bugs in the paper's -reference Python implementation. The most critical is a mathematical error in -the QJL scale factor: the reference code used `√(π/(2d))` instead of -`√(π/2)/d` (Definition 1 in [1]), differing by a factor of √d (≈11× at d=128). -Our [current implementation][current-impl] uses the correct formula -(`sqrt(FRAC_PI_2) / padded_dim` in Rust), so this bug does **not** affect us. - -Other notable Eviox findings: (a) the reference code recomputes codebooks at -every instantiation (we cache in a `DashMap`); (b) the reference uses float16 -for codebook distance computation, causing misassignment at small centroid -spacings (we cast to f32 before quantization). See [7] for the full list. - -### Theorem 1 constant - -There is an ambiguity in the paper's notation for the MSE bound constant. The -formal proof gives `(√3 · π / 2) · 4^{-b}` where the constant √3·π/2 ≈ 2.72. -The Eviox report [7] (Item 7) deliberately adopts the alternative parsing -`√(3π)/2 ≈ 1.535`, claiming it is "consistent with the formal proof." We treat -`√3·π/2 ≈ 2.72` as the theorem constant because: (a) the paper's prose -describes the constant as "≈ 2.7," which matches 2.72 not 1.535; and (b) the -paper's reported distortion values (b=2: 0.117, b=3: 0.03) exceed the 1.535- -based bound (b=2: 0.096, b=3: 0.024), ruling out `√(3π)/2` as a valid -**upper** bound on the measured quantity. The definitive resolution requires -checking the exact LaTeX grouping in the ICLR 2026 camera-ready proof. The -paper's "explicit values" (0.36, 0.117, 0.03, 0.009) are the actual computed -distortion of the optimal quantizer, not the bound itself — they are well below -the 2.72/4^b bound. - -## Appendix B: Community findings on QJL - -Multiple independent TurboQuant implementations have repeatedly reported a -practical finding for **KV-cache attention**: MSE-only often outperforms MSE+QJL -at the same bit budget. The likely mechanism is a variance-bias tradeoff: QJL -removes bias in raw inner-product estimation but adds variance, and the softmax -nonlinearity amplifies variance more than it penalizes bias. In that setting, -allocating all bits to MSE (more centroids, lower quantization variance) can beat -splitting the budget between MSE + QJL. This behavior has been reported by -multiple groups across Python, C, and Rust implementations [8]. - -For ANN search, cosine ranking, and other non-softmax vector-search workloads, -the evidence is currently less settled. MSE-only is still a reasonable default -because it is simpler and better supported by the current implementation work, -but the ANN question should be treated as empirical until evaluated on ANN -datasets with recall@k and ranking metrics (see Experimental plan). - -## Appendix C: Alternative rotation strategies - -### Why not DCT? - -DCT is O(B log B) and invertible, but it is a **fixed structured transform**, -not a random rotation — it does not produce the Beta marginal distribution -`(1-x²)^((B-3)/2)` (in block dimension B) that TurboQuant's Max-Lloyd centroids -are optimized for. ADSampling only needs approximate coordinate independence -(for hypothesis-testing pruning), so a fixed orthogonal transform like DCT -suffices there. TurboQuant needs a specific known marginal distribution, so only -random orthogonal rotations (QR or SORF) are suitable. - -### Shared rotation with ADSampling (speculative) - -Both TurboQuant and ADSampling apply a random orthogonal rotation to make -coordinates independent. If we integrate ADSampling-style dimension pruning -(see Stage 3), the same rotation could in principle serve both purposes. -However, this is not automatic under the Stage 2 block-decomposed design: -ADSampling is formulated around a single full-dimensional random projection -whose coordinates can be sequentially sampled, whereas Stage 2 introduces -per-block rotations and per-block norm weighting. Reusing one rotation across -both systems should be treated as a **future research direction** that requires -new analysis or direct empirical validation. If it proves viable, it would avoid -rotating the data twice. The query would also need to be rotated at query time -with the same stored transform.