Skip to content

BaseModelAI/clostera

Repository files navigation

clostera: The Billion-Vector Resurrection

clostera benchmark summary

They told you that clustering massive high-dimensional vector collections on a single machine was a fool's errand. They said you needed a cluster, a distributed headache, and a cloud bill large enough to ruin your week. They were wrong.

clostera is a from-scratch Rust rebuild of the original pqkmeans repository, aimed at the workloads that made that project exciting in the first place: extremely large vector collections, high dimensionality, single-machine practicality, and performance that is measured rather than hoped for.

This is not a thin wrapper around old code. It is a modern rewrite with a new Rust core, a NumPy-first Python layer, parquet and out-of-core workflows, deterministic benchmarks, automatic number-of-clusters (K) selection, Apple Silicon support, and wheels that install like a normal Python package.

Rust coreRayonOpenBLAS/LAPACKAVX2/SSEApple Silicon NEONNumPy + parquetmanylinux + macOS wheels

pip install clostera
Why billion-scale clustering?

The short answer is that it is genuinely useful. If you work with embeddings, recommendations, retrieval, representation learning, semantic search, or large behavioral datasets, clustering at very large scale is not academic theater. It is operationally important.

But for 🦋 clostera, that is only part of the story.

The deeper reason is historical and conceptual. The extreme efficiency and mathematical elegance of the original pqkmeans algorithm indirectly helped inspire the development of EMDE, and later a much stronger internal family of TREMDE algorithms. Together with internal proprietary evolutions of 🦋 Cleora, those ideas form a major part of the conceptual foundation behind Synerise's flagship product, BaseModel.AI, developed by Synerise.

That is why this rewrite exists. The original project mattered. It influenced real systems, real products, and real lines of research. Left unmaintained, it deserved a modern successor: faster, cleaner, easier to install, easier to use, and built for current hardware instead of the past.

Origins of the Clostera name

At Synerise, we have a tradition of finding algorithmic inspiration in the natural world, specifically, the quiet, hyper-efficient mechanics of the moth.

Just as we look to 🦋 Cleora to capture the geometry and distance calculations of our hyperspherical embeddings, we turned to the 🦋 Clostera moth to represent the colossal mechanics of billion-scale clustering.

In taxonomy, 🦋 Clostera is a genus of prominent moths known for their robust build and rapid flight. But the true magic lies in the origin of the name. Derived from the ancient Greek word klostir (κλωστήρ), "🦋 Clostera" literally translates to the spindle.

A spindle's sole purpose is to take raw, chaotic, disconnected fibers and rapidly rotate them, pulling them tightly around a central core to spin them into structured, organized threads.

In machine learning, your billion-scale dataset is that chaotic fleece.

🦋 Clostera is your algorithmic spindle. It acts as a high-speed rotational force, drawing billions of isolated vectors toward a shared center of mass, the centroid. It takes the noise, finds the pattern, and binds your scattered data into structured clusters.

Fast, robust, and mathematically grounded. Welcome to the 🦋 Clostera era.

⚡️ Quick Start: It just works

The zero-tuning path

import numpy as np
import clostera

vectors = np.load("vectors.npy").astype(np.float32)
clusterer = clostera.Clusterer(k=None)  # choose the number of clusters (K) automatically
labels = clusterer.fit_transform(vectors)

print(clusterer.selected_k_)  # selected K = selected number of clusters

That is the default story: one object, raw vectors in, labels out, OPQ-enabled quality path by default, and automatic number-of-clusters (K) selection when you do not know the answer up front.

The fastest path

clusterer = clostera.Clusterer(k=256, fastest=True)  # K = number of clusters
labels = clusterer.fit_transform(vectors)

fastest=True turns off OPQ and uses the plain PQ path. That is the right choice when end-to-end throughput matters more than reconstruction quality. The main speed win is in encoder training and encoding; the final compressed assignment stage itself is already fast in both modes.

Out-of-core from parquet

clusterer = clostera.Clusterer(k=None)  # choose the number of clusters (K) automatically
labels = clusterer.fit_transform("vectors.parquet")

If the original float vectors do not fit comfortably in RAM, add max_ram_bytes=.... If they do fit, you do not need to think about it.

⚡️ The Miracle of 30.8x: Bending Time

The original repository proved a powerful idea: by clustering in PQ code space instead of dense float space, single-machine clustering suddenly stops sounding ridiculous. That idea aged well. The surrounding implementation did not.

clostera asks the obvious follow-up question:

what happens if you rebuild the original pqkmeans project properly for modern hardware and modern Python workflows?

On the committed deterministic 10M x 2048 checkpoint, the answer is not subtle.

Metric (10M x 2048) original clostera-fastest clostera-quality
Encode time 222.94s 7.24s 131.34s
Cluster time 80.19s 4.50s 4.39s
Reconstruction MSE 0.15160 0.12354 0.05494
Purity 0.6573 1.0000 1.0000

That means:

  • 30.8x faster encoding than the original implementation on the headline checkpoint.
  • 17.8x faster clustering on the same full-core run.
  • Better clustering quality even on the fastest path.
  • A quality-first OPQ mode that dramatically lowers reconstruction error when fidelity matters more than raw throughput.

10M by 2048 benchmark figure

💾 The Alchemy of Memory: Zero-RAM Scaling

At billion-vector scale, the algorithm is only half the story. Memory movement is usually the real bottleneck.

clostera is built around that reality:

  • raw numpy.ndarray input works out of the box
  • parquet is a first-class input format
  • fixed-size-list vector columns and plain numeric scalar columns are both supported
  • max_ram_bytes bounds the working set when the original float vectors do not fit
  • raw vectors can be streamed while PQ codes spill to disk automatically when needed
  • numpy.memmap fits naturally into the same workflow

This is the practical difference between a paper result and a pipeline you can actually operate.

A 2D example using k-means, clostera-quality, and clostera-fastest

2D comparison of k-means, clostera-quality, and clostera-fastest

Large-scale evaluation

Large-scale evaluation summary table

🧠 The Oracle of K: Automatic number of clusters without guesswork

Choosing K (the number of clusters) used to mean elbow plots, trial-and-error, and pretending you were more certain than you really were.

clostera lets you pass k=None to Clusterer, PQKMeans, or OPQMeans when you do not know the number of clusters in advance. The candidate analysis runs in Rust, reuses the already-trained encoder and the already-encoded PQ code matrix, and does not regenerate the expensive intermediate artifacts for each candidate number of clusters (K).

On the committed deterministic benchmark sweep, the default centroid_silhouette selector recovered the exact true cluster count in 20/20 cases.

  • centroid_silhouette: 20/20 exact matches, 0.00 mean absolute error
  • davies_bouldin: 18/20 exact matches, 0.90 mean absolute error
  • elbow: 18/20 exact matches, 1.60 mean absolute error
  • bic: 3/20 exact matches, 50.40 mean absolute error

Automatic number-of-clusters (K) selection benchmark figure

💎 The Obsidian Core: Engineered for modern silicon

clostera is built for people who care about practical speed, reproducibility, and a sane deployment story.

  • Clusterer is the simple default API for normal use.
  • fastest=True gives you the maximum-throughput plain-PQ path.
  • The default path keeps OPQ on and favors quality.
  • The advanced split into PQEncoder / PQKMeans and OPQEncoder / OPQMeans is still there when you need it.
  • The hot paths use full-core Rust + Rayon, BLAS/LAPACK-backed dense math, x86 SIMD, and Apple Silicon NEON kernels.
  • Wheels are built for manylinux_2_28 x86_64 and aarch64, plus macOS x86_64 and arm64.
  • Deterministic seeds, deterministic synthetic datasets, and committed benchmark artifacts make the claims inspectable.

End-to-end clustering pipeline time and quality tradeoff across deterministic benchmark families

🔁 From research repo to production rewrite

The original project matters because it proved the idea. clostera exists because that idea deserved a modern implementation.

Area Original pqkmeans clostera
Core implementation Older Python/C++ reference stack Rust core with PyO3 bindings and maturin packaging
PQ codebook initialization Basic point-picked initialization Deterministic PCA-quantile seeding with deterministic fallback
Cluster initialization Random center picking in PQ code space Deterministic farthest-first seeding in PQ code space
Quality modes Plain PQ Default OPQ-backed quality path plus an explicit fastest plain-PQ mode
Choosing K (number of clusters) User supplies K User supplies K or lets Rust-side auto-selection choose it with k=None
CPU path OpenMP-era reference implementation Rayon-parallel hot paths, BLAS/LAPACK-backed math, x86 SIMD, Apple Silicon NEON
Python workflows NumPy-centric NumPy arrays, parquet streaming, memmapped code output, RAM-bounded out-of-core workflows, deterministic synthetic datasets
Packaging Source build expectations manylinux_2_28 x86_64 and aarch64, macOS x86_64 and arm64, CPython 3.10 through 3.13
Benchmarking Research notebooks and limited comparison artifacts Deterministic benchmark suite with throughput and clustering-quality metrics, plots, and a showcase notebook

📊 The Benchmarks of Truth

The README carries committed, deterministic benchmarks because this project should win on numbers, not adjectives.

Large-scale checkpoint: 10,000,000 x 2048

This is the scale checkpoint the rewrite has to answer for: 64 clusters, one machine, and a dataset large enough that hand-waving stops being useful.

Thread settings used for the max-throughput configuration:

  • 24 BLAS threads
  • 24 OpenMP threads
  • 24 Rayon threads
Variant Encode s Cluster s Recon MSE Purity
original 222.94 80.19 0.15160 0.6573
clostera-fastest 7.24 4.50 0.12354 1.0000
clostera-quality 131.34 4.39 0.05494 1.0000

How to read that table:

  • clostera-fastest is the throughput configuration. It is the answer when raw encode speed matters most.
  • clostera-quality is the quality configuration. It spends more time on rotation but cuts reconstruction MSE by 2.25x versus clostera-fastest and by 2.76x versus the original implementation.
  • Even before OPQ, the Rust rewrite already beats the original implementation on both throughput and cluster quality.

10M by 2048 benchmark figure

K sweep: how the number of clusters changes runtime

We also ran a deterministic K sweep on the same 200k x 2048 block-mixed family used in the benchmark suite. Here K means the number of clusters. This isolates the clustering stage: each implementation trains and encodes once, then we sweep K = 16, 32, 64, 128, 256 over the same PQ codes.

K (number of clusters) original cluster s clostera-fastest cluster s original / clostera-fastest speedup
16 1.088 0.047 22.92x
32 1.404 0.064 21.83x
64 1.488 0.111 13.43x
128 1.597 0.205 7.80x
256 1.646 0.315 5.22x

What this sweep says:

  • The original implementation slows steadily as K rises and stays well behind clostera-fastest at every point in the published sweep.
  • The important point is not just the ranking. It is that clostera-fastest keeps clustering comfortably sub-second through K = 256 clusters on 200k x 2048, while the original implementation stays well above the one-second mark.

Clustering time versus K (number of clusters) on deterministic block mixed data

N sweep: how runtime scales with dataset size

We also fixed the algorithm configuration at K = 64 clusters, M = 64, Ks = 64 and swept the deterministic 2048-dimensional block-mixed dataset from 50k to 800k rows. Each point below uses a 16,384-row warm-up and reports the median of 3 timing runs, so the curve reflects steady-state runtime rather than first-call overhead.

N original encode s clostera-fastest encode s Encode speedup original cluster s clostera-fastest cluster s Cluster speedup
50k 0.680 0.037 18.39x 0.295 0.032 9.11x
100k 1.925 0.073 26.41x 0.602 0.057 10.64x
200k 3.697 0.145 25.47x 1.258 0.109 11.58x
400k 6.921 0.298 23.25x 2.851 0.185 15.41x
800k 12.873 0.641 20.09x 5.680 0.372 15.28x

What this sweep says:

  • Encode cost is close to linear in N for every implementation, but the slope is radically different: clostera-fastest holds roughly 1.25M to 1.54M vectors/s once the warm-up is out of the way, while the original implementation stays near 52k to 74k vectors/s.
  • At fixed K = 64 clusters, clustering also scales cleanly with dataset size. clostera-fastest stays about 9x to 15x faster than the original implementation across the full sweep.
  • The main point for capacity planning is that scaling by N looks predictable, not erratic. That matters when you are extrapolating from pilot runs to hundreds of millions or billions of vectors.

Encoding and clustering time versus dataset size on deterministic block mixed data

Distribution suite: speed and quality across different data families

We do not benchmark on one flattering Gaussian and declare victory. The committed suite now runs deterministic 10M-vector workloads for:

  • Gaussian data
  • anisotropic Gaussian data
  • Student-t heavy-tailed data
  • block-mixed 2048-dimensional data

For each scenario we track:

  • encode throughput
  • clustering throughput
  • reconstruction MSE
  • purity
  • adjusted Rand index
  • normalized mutual information
  • v-measure
  • assigned-center MSE

Across the suite:

  • clostera-fastest improves encode throughput over the original implementation by 25.35x to 32.72x.
  • clostera-quality reduces reconstruction error by 2.40x to 3.74x relative to clostera-fastest.
  • on end-to-end pipeline time, clostera-quality is faster than the original implementation on every committed 10M-vector suite scenario.
  • the original implementation is slower and has visibly worse clustering quality on every committed scenario.

Reconstruction error across deterministic datasets

Clustering purity across deterministic datasets

🍏 Apple Silicon is a first-class target

Modern ARM machines are not a side quest. clostera treats them like real production hardware.

  • aarch64 uses native NEON distance kernels for the common PQ subvector sizes 4, 8, 16, 32, and 64.
  • The PQ assignment path is no longer “build a buffer and scan it later”. It now uses a fused lookup-accumulate-and-select kernel plus SIMD-backed argmin, which matters on Apple Silicon because clustering on PQ codes is often dominated by assignment rather than raw distance evaluation.
  • The release workflow builds macOS arm64 wheels alongside x86_64 wheels.
  • The same wheel matrix also covers manylinux_2_28 x86_64 and aarch64.
  • The release configuration uses openblas-static so published wheels are as self-contained as practical.

If you are running on Apple Silicon, this is not a Rosetta fallback story. There is architecture-specific code in the hot path and packaging support in the release pipeline.

🔧 Under the hood: better initialization, less luck

One of the quietest but most important differences from the original repository is that clostera treats initialization like a real engineering problem instead of a footnote.

  • PQEncoder uses deterministic PCA-quantile initialization per subspace, rather than hoping random point picks land in a good configuration.
  • PQKMeans uses deterministic farthest-first seeding in PQ code space for better initial coverage.
  • The default quality path refines an orthogonal rotation before final PQ training, which is where the large OPQ quality gains come from on correlated high-dimensional data.

That shows up as more stable training, fewer pathological runs, and better quality at the same code budget. The headline speedups are not coming from luckier random seeds.

Installation

PyPI

pip install clostera

Optional extras:

pip install "clostera[benchmarks]"
pip install "clostera[notebook]"

Build from source

System BLAS/LAPACK build:

python -m pip install maturin
python -m maturin develop --release

Static OpenBLAS build:

python -m maturin develop --release --no-default-features --features openblas-static

More common workflows

Simple workflow

import numpy as np
import clostera

rng = np.random.default_rng(7)
vectors = rng.normal(size=(100_000, 128)).astype(np.float32)

clusterer = clostera.Clusterer(k=None)  # choose the number of clusters (K) automatically
labels = clusterer.fit_transform(vectors)

print(clusterer.selected_k_)  # selected K = selected number of clusters

Known number-of-clusters (K) workflow

clusterer = clostera.Clusterer(k=known_k)  # known_k = desired number of clusters
labels = clusterer.fit_transform(vectors)

Fastest throughput workflow with a known number of clusters (K)

clusterer = clostera.Clusterer(k=known_k, fastest=True)  # known_k = desired number of clusters
labels = clusterer.fit_transform(vectors)

Predict on new vectors

clusterer = clostera.Clusterer(k=known_k)  # known_k = desired number of clusters
clusterer.fit(vectors)
labels = clusterer.transform(vectors[:1024])

Parquet workflow

clusterer = clostera.Clusterer(k=None)  # choose the number of clusters (K) automatically
labels = clusterer.fit_transform("vectors.parquet")

Out-of-core raw-vector workflow

When the original float vectors do not fit in RAM, pass a parquet path or a numpy.memmap-backed matrix and set max_ram_bytes.

clusterer = clostera.Clusterer(k=None)  # choose the number of clusters (K) automatically
labels = clusterer.fit_transform(
    "vectors.parquet",
    max_ram_bytes=8 << 30,
)

With max_ram_bytes, clostera keeps the training sample bounded, streams raw vectors in batches during encoding, and automatically spills PQ codes to a temporary memmap when needed. The raw vector matrix no longer needs to fit in RAM all at once. If you already materialized the data as a normal in-memory ndarray, clostera can only bound its own additional working set; for truly out-of-core runs, use parquet or numpy.memmap.

Advanced API

Most users should start with Clusterer. The lower-level building blocks are still available when you want to:

  • reuse encoded PQ codes across many clustering runs
  • fit encoders and clusterers separately
  • switch explicitly between plain PQ and OPQ
  • tune encoder-specific and clusterer-specific parameters independently

Use Clusterer(fastest=True) when you want the fastest high-level path. Use plain PQEncoder and PQKMeans when you need that same plain-PQ behavior with explicit control. Use OPQEncoder and OPQMeans when reconstruction fidelity matters more and the data has strong cross-subspace correlation.

If you omit num_subquantizers, clostera infers a sensible default from the input dimensionality. For typical embeddings that lands near sqrt(D) code bytes while keeping each subvector wide enough to stay stable.

encoder = clostera.PQEncoder()
encoder.fit(vectors)
codes = encoder.transform(
    vectors,
)

clusterer = clostera.PQKMeans(encoder=encoder, k=None)  # choose the number of clusters (K) automatically
labels = clusterer.fit_transform(codes)

Showcase notebook

The repository includes a walkthrough notebook designed for readers who want the full visual story:

The committed notebook embeds its static figures directly, so the visuals render in GitHub and standalone notebook viewers without depending on external image paths.

It covers:

  • the high-level Clusterer workflow
  • automatic number-of-clusters (K) selection with k=None
  • parquet workflows
  • toy clustering visualization
  • plain PQ versus OPQ reconstruction quality
  • the advanced encoder/clusterer split when you need it
  • cross-dataset benchmark comparisons
  • the large-scale 10M x 2048 checkpoint
  • K (number of clusters) and N scaling sweeps

Parameter reference

In the API tables below, PathLike means a plain path string or a pathlib.Path object.

Clusterer

Clusterer is the default high-level API. It hides the encoder/clusterer split and gives the common workflow a simple fit, transform, fit_transform, fit_predict, and predict surface. By default it uses the quality-first OPQ path; pass fastest=True when you want the maximum-throughput plain-PQ path instead.

Parameter Type Default Meaning
k int | None None Number of target clusters. Here K means the number of clusters. None enables automatic number-of-clusters selection.
fastest bool False Turn off OPQ and use the maximum-throughput plain-PQ path. This usually lowers reconstruction quality but can reduce total fit time substantially on large runs.
num_subquantizers int | None None Optional PQ subspace count. When omitted, clostera infers a deterministic default from the input dimensionality.
codebook_size int 256 Number of codewords per subspace.
iterations int 20 Shared iteration budget for the simple high-level API.
seed int 0 Deterministic seed.
opq_iterations int 3 OPQ refinement steps used on the default quality-first path. When fastest=True, the current code always uses plain PQ and ignores this setting.
verbose bool False Emit inertia diagnostics during fitting.
lookup_table_bytes int 1 << 30 Memory budget for code-domain lookup tables. Larger budgets favor faster assignment.
auto_k_method str "centroid_silhouette" Automatic-number-of-clusters (K) scoring rule. Supported values are "centroid_silhouette", "davies_bouldin", "elbow", and "bic".
auto_k_candidates list[int] | tuple[int, ...] | np.ndarray | None None Explicit candidate K values, meaning candidate cluster counts, to test when k=None. If omitted, clostera builds a default candidate template automatically, including practical values such as 4, 6, 8, 12, 16, 24, and 32 when the dataset size supports them.
auto_k_min int 2 Lower bound for automatically generated candidate values when auto_k_candidates is omitted.
auto_k_max int | None None Upper bound for automatically generated candidate values when auto_k_candidates is omitted.
auto_k_step int | None None Optional arithmetic step for generated candidates. If omitted, clostera uses a baked-in candidate template.
auto_k_sample_rows int 16_384 Number of PQ codes sampled for the Rust-side candidate analysis pass.

Clusterer.fit(...), transform(...), fit_transform(...), fit_predict(...), predict(...)

Parameter Type Default Meaning
data np.ndarray | PathLike required Raw float vectors as an array, parquet path, or numpy.memmap-backed matrix.
parquet_column str | None None Specific parquet vector column.
batch_size int 65_536 Parquet streaming batch size.
codes_output_path PathLike | None None Optional memmap destination when raw parquet input must be encoded first.
max_ram_bytes int | None None Optional RAM budget for bounded-memory raw-vector workflows.

Advanced access after fitting:

  • encoder_: the fitted PQEncoder or OPQEncoder
  • clusterer_: the fitted PQKMeans or OPQMeans
  • labels_, cluster_centers_, inertia_history_, selected_k_, k_selection_

Advanced low-level API

The classes below expose the encoder/clusterer split directly. Reach for them when you want to reuse PQ codes, separate training phases, or tune encoder-specific and clusterer-specific parameters independently.

PQEncoder

Parameter Type Default Meaning
num_subquantizers int | None None Number of PQ subspaces M. When omitted, clostera infers a deterministic default from the input dimensionality. Explicit values still require the dimensionality to be divisible by M.
codebook_size int 256 Number of codewords per subspace Ks. Supported range is 2..=256.
iterations int 20 Number of Lloyd iterations for subspace k-means training.
seed int 0 Deterministic seed used for initialization fallback and reproducible training behavior.
opq_iterations int 0 Number of OPQ refinement steps. 0 keeps plain PQ, >0 learns an orthogonal rotation before final PQ training.

OPQEncoder

OPQEncoder has the same API and runtime methods as PQEncoder, but defaults opq_iterations to 3.

PQEncoder.fit(...)

Parameter Type Default Meaning
data np.ndarray | PathLike required A dense float32 matrix or a parquet path.
parquet_column str | None None Specific parquet column to treat as the vector column.
batch_size int 65_536 Batch size for parquet streaming.
train_rows int | None None Number of deterministic training rows to sample. With in-memory arrays, omitting this uses the full matrix unless max_ram_bytes is set.
max_ram_bytes int | None None Optional RAM budget for the training sample plus OPQ workspace. When set, large parquet or memmap-backed inputs are trained from a bounded deterministic sample.

PQEncoder.transform(...)

Parameter Type Default Meaning
data np.ndarray | PathLike required Dense vectors or parquet input.
parquet_column str | None None Specific parquet vector column.
batch_size int 65_536 Parquet streaming batch size.
output_path PathLike | None None Optional destination for a memory-mapped uint8 code matrix.
max_ram_bytes int | None None Optional RAM budget for batched encoding. Large raw-vector inputs are processed in chunks; if codes would not fit in RAM, provide output_path or call PQKMeans.fit(...) directly.

PQEncoder.fit_transform(...)

Parameter Type Default Meaning
data np.ndarray | PathLike required A dense float32 matrix or a parquet path.
parquet_column str | None None Specific parquet column to treat as the vector column.
batch_size int 65_536 Parquet streaming batch size.
train_rows int | None None Number of deterministic training rows to sample before encoding.
output_path PathLike | None None Optional destination for a memory-mapped uint8 code matrix produced by the transform phase.
max_ram_bytes int | None None Optional RAM budget applied to both training and encoding.

PQEncoder.inverse_transform(...)

Parameter Type Default Meaning
codes np.ndarray required A 2D PQ code matrix with shape (rows, num_subquantizers). Returns decoded float32 vectors.

PQKMeans

Parameter Type Default Meaning
encoder PQEncoder required Trained encoder that defines the codebooks.
k int | None None Number of target clusters. Here K means the number of clusters. None enables Rust-side automatic number-of-clusters selection over candidate values in PQ code space.
iterations int 20 Number of clustering update rounds.
seed int 0 Deterministic seed for cluster-center initialization.
verbose bool False Emit inertia diagnostics during fitting.
lookup_table_bytes int 1 << 30 Memory budget for code-domain lookup tables. Larger budgets favor faster assignment.
auto_k_method str "centroid_silhouette" Automatic-number-of-clusters (K) scoring rule. Supported values are "centroid_silhouette", "davies_bouldin", "elbow", and "bic".
auto_k_candidates list[int] | tuple[int, ...] | np.ndarray | None None Explicit candidate K values, meaning candidate cluster counts, to test when k=None. If omitted, clostera builds a default candidate template automatically, including practical values such as 4, 6, 8, 12, 16, 24, and 32 when the dataset size supports them.
auto_k_min int 2 Lower bound for automatically generated candidate values when auto_k_candidates is omitted.
auto_k_max int | None None Upper bound for automatically generated candidate values when auto_k_candidates is omitted.
auto_k_step int | None None Optional arithmetic step for generated candidates. If omitted, clostera uses a baked-in candidate template.
auto_k_sample_rows int 16_384 Number of PQ codes sampled for the Rust-side candidate analysis pass.

OPQMeans

OPQMeans mirrors PQKMeans, but treats OPQ as the default rather than an extra knob. If you do not pass encoder=, it lazily creates and fits an OPQEncoder from the raw vectors or parquet source on first fit(...), fit_predict(...), or fit_transform(...). If you do pass encoder=, the current code requires it to have been trained with opq_iterations > 0.

Parameter Type Default Meaning
encoder PQEncoder | None None Optional pre-trained OPQ encoder. If omitted, OPQMeans builds one automatically.
num_subquantizers int | None None Optional encoder-side PQ subspace count when encoder is omitted.
codebook_size int 256 Optional encoder-side codebook size when encoder is omitted.
encoder_iterations int 20 Encoder training iterations used when encoder is omitted.
seed int 0 Deterministic seed shared by the implicit encoder and the clusterer.
opq_iterations int 3 OPQ refinement steps used by the implicit encoder.
k int | None None Number of target clusters. Here K means the number of clusters. None enables Rust-side automatic number-of-clusters selection over candidate values in PQ code space.
iterations int 20 Number of clustering update rounds.
verbose bool False Emit inertia diagnostics during fitting.
lookup_table_bytes int 1 << 30 Memory budget for code-domain lookup tables. Larger budgets favor faster assignment.
auto_k_method str "centroid_silhouette" Automatic-number-of-clusters (K) scoring rule. Supported values are "centroid_silhouette", "davies_bouldin", "elbow", and "bic".
auto_k_candidates list[int] | tuple[int, ...] | np.ndarray | None None Explicit candidate K values, meaning candidate cluster counts, to test when k=None. If omitted, clostera builds a default candidate template automatically, including practical values such as 4, 6, 8, 12, 16, 24, and 32 when the dataset size supports them.
auto_k_min int 2 Lower bound for automatically generated candidate values when auto_k_candidates is omitted.
auto_k_max int | None None Upper bound for automatically generated candidate values when auto_k_candidates is omitted.
auto_k_step int | None None Optional arithmetic step for generated candidates. If omitted, clostera uses a baked-in candidate template.
auto_k_sample_rows int 16_384 Number of PQ codes sampled for the Rust-side candidate analysis pass.

OPQMeans uses the same runtime method signatures as PQKMeans: fit(...), transform(...), fit_transform(...), fit_predict(...), and predict(...).

PQKMeans.fit(...), transform(...), fit_transform(...), fit_predict(...), predict(...)

Parameter Type Default Meaning
data np.ndarray | PathLike required Either raw vectors or precomputed PQ codes.
parquet_column str | None None Specific parquet vector column.
batch_size int 65_536 Parquet streaming batch size.
codes_output_path PathLike | None None Optional memmap destination when raw parquet input must be encoded first.
max_ram_bytes int | None None Optional RAM budget for encoding raw vectors into PQ codes before clustering. When set and no codes_output_path is supplied, clostera creates a temporary memmap automatically.

When k=None, fitting also populates:

  • selected_k_: the final chosen cluster count (K)
  • k_selection_: the full Rust-side selection report, including the tested candidate values and per-method scores

Advanced runtime knob

Environment variable Meaning
CLOSTERA_ROTATION_BATCH_MIB Override the default OPQ rotation batch target in MiB for benchmarking or machine-specific tuning.

Reproducing the benchmark artifacts

Generate a deterministic synthetic dataset

python scripts/generate_synthetic_dataset.py \
  --output-dir .artifacts/block-mixed-200k-2048 \
  --distribution block_mixed \
  --rows 200000 \
  --dim 2048 \
  --clusters 64 \
  --seed 11

Compare the original repo and clostera

python scripts/compare_impls.py \
  --dataset-dir .artifacts/block-mixed-200k-2048 \
  --original-python "$(which python)" \
  --enhanced-python "$(which python)" \
  --train-rows 32768 \
  --metric-sample-rows 32768 \
  --num-subquantizers 64 \
  --codebook-size 64 \
  --pq-iterations 6 \
  --cluster-k 64 \
  --cluster-iterations 4 \
  --opq-iterations 3 \
  --blas-threads 24 \
  --omp-threads 24 \
  --rayon-threads 24 \
  --rotation-batch-mib 32 \
  --output-json .artifacts/block-mixed-200k-2048/compare.json

Run the K (number of clusters) sweep

python scripts/benchmark_k_sweep.py \
  --dataset-dir .artifacts/k-sweep-block-mixed-200k-2048 \
  --output-json benchmarks/results/k-sweep.json \
  --original-python "$(which python)" \
  --enhanced-python "$(which python)" \
  --force

Run the N sweep

python scripts/benchmark_n_sweep.py \
  --dataset-dir .artifacts/n-sweep-block-mixed-800k-2048 \
  --output-json benchmarks/results/n-sweep.json \
  --original-python "$(which python)" \
  --enhanced-python "$(which python)" \
  --force

Run the full deterministic distribution suite

python scripts/benchmark_suite.py \
  --output-dir .artifacts/benchmark-suite \
  --original-python "$(which python)" \
  --enhanced-python "$(which python)" \
  --blas-threads 24 \
  --omp-threads 24 \
  --rayon-threads 24 \
  --rotation-batch-mib 32 \
  --force

Run the automatic number-of-clusters (K) selection sweep

python scripts/evaluate_auto_k_methods.py \
  --output-json benchmarks/results/auto-k-methods.json \
  --force

Render the README and notebook figures

python scripts/render_benchmark_assets.py \
  --suite-json benchmarks/results/benchmark-suite.json \
  --large-json benchmarks/results/large-scale-10m.json \
  --k-sweep-json benchmarks/results/k-sweep.json \
  --n-sweep-json benchmarks/results/n-sweep.json \
  --auto-k-json benchmarks/results/auto-k-methods.json \
  --output-dir docs/assets

Packaging and release

The repository already includes publication artifacts for:

  • manylinux_2_28 wheels for x86_64 and aarch64
  • macOS wheels for x86_64 and arm64
  • CPython 3.10 through 3.13
  • source distributions

Relevant files:

  • .github/workflows/ci.yml
  • .github/workflows/release.yml
  • rust-toolchain.toml

The release workflow builds wheels with openblas-static enabled so binary installs are as self-contained as practical.

Releasing to PyPI

The PyPI project name is clostera.

Once the one-time PyPI Trusted Publisher setup is done for:

  • owner: BaseModelAI
  • repository: clostera
  • workflow: .github/workflows/release.yml
  • environment: pypi

the normal release path is:

python scripts/release.py 1.0.3 --commit --tag --push

That updates the version in the release metadata, creates the release commit, creates tag v1.0.3, and pushes both to origin. The tag push triggers the GitHub release workflow, which builds the wheels and publishes them to PyPI.

Original project and related work

Original implementation

Core papers behind this repo

Useful related reading

Verification

Current local verification commands:

python -m maturin develop --release
cargo test --release
pytest -q
cargo check --no-default-features --features openblas-static
cargo bench --bench core_bench

About

Billion scale vector clustering. One Machine. Zero GPUs.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors