They told you that clustering massive high-dimensional vector collections on a single machine was a fool's errand. They said you needed a cluster, a distributed headache, and a cloud bill large enough to ruin your week. They were wrong.
clostera is a from-scratch Rust rebuild of the original pqkmeans repository, aimed at the workloads that made that project exciting in the first place: extremely large vector collections, high dimensionality, single-machine practicality, and performance that is measured rather than hoped for.
This is not a thin wrapper around old code. It is a modern rewrite with a new Rust core, a NumPy-first Python layer, parquet and out-of-core workflows, deterministic benchmarks, automatic number-of-clusters (K) selection, Apple Silicon support, and wheels that install like a normal Python package.
Rust core • Rayon • OpenBLAS/LAPACK • AVX2/SSE • Apple Silicon NEON • NumPy + parquet • manylinux + macOS wheels
pip install closteraWhy billion-scale clustering?
The short answer is that it is genuinely useful. If you work with embeddings, recommendations, retrieval, representation learning, semantic search, or large behavioral datasets, clustering at very large scale is not academic theater. It is operationally important.
But for 🦋 clostera, that is only part of the story.
The deeper reason is historical and conceptual. The extreme efficiency and mathematical elegance of the original pqkmeans algorithm indirectly helped inspire the development of EMDE, and later a much stronger internal family of TREMDE algorithms. Together with internal proprietary evolutions of 🦋 Cleora, those ideas form a major part of the conceptual foundation behind Synerise's flagship product, BaseModel.AI, developed by Synerise.
That is why this rewrite exists. The original project mattered. It influenced real systems, real products, and real lines of research. Left unmaintained, it deserved a modern successor: faster, cleaner, easier to install, easier to use, and built for current hardware instead of the past.
Origins of the Clostera name
At Synerise, we have a tradition of finding algorithmic inspiration in the natural world, specifically, the quiet, hyper-efficient mechanics of the moth.
Just as we look to 🦋 Cleora to capture the geometry and distance calculations of our hyperspherical embeddings, we turned to the 🦋 Clostera moth to represent the colossal mechanics of billion-scale clustering.
In taxonomy, 🦋 Clostera is a genus of prominent moths known for their robust build and rapid flight. But the true magic lies in the origin of the name. Derived from the ancient Greek word klostir (κλωστήρ), "🦋 Clostera" literally translates to the spindle.
A spindle's sole purpose is to take raw, chaotic, disconnected fibers and rapidly rotate them, pulling them tightly around a central core to spin them into structured, organized threads.
In machine learning, your billion-scale dataset is that chaotic fleece.
🦋 Clostera is your algorithmic spindle. It acts as a high-speed rotational force, drawing billions of isolated vectors toward a shared center of mass, the centroid. It takes the noise, finds the pattern, and binds your scattered data into structured clusters.
Fast, robust, and mathematically grounded. Welcome to the 🦋 Clostera era.
import numpy as np
import clostera
vectors = np.load("vectors.npy").astype(np.float32)
clusterer = clostera.Clusterer(k=None) # choose the number of clusters (K) automatically
labels = clusterer.fit_transform(vectors)
print(clusterer.selected_k_) # selected K = selected number of clustersThat is the default story: one object, raw vectors in, labels out, OPQ-enabled quality path by default, and automatic number-of-clusters (K) selection when you do not know the answer up front.
clusterer = clostera.Clusterer(k=256, fastest=True) # K = number of clusters
labels = clusterer.fit_transform(vectors)fastest=True turns off OPQ and uses the plain PQ path. That is the right choice when end-to-end throughput matters more than reconstruction quality. The main speed win is in encoder training and encoding; the final compressed assignment stage itself is already fast in both modes.
clusterer = clostera.Clusterer(k=None) # choose the number of clusters (K) automatically
labels = clusterer.fit_transform("vectors.parquet")If the original float vectors do not fit comfortably in RAM, add max_ram_bytes=.... If they do fit, you do not need to think about it.
The original repository proved a powerful idea: by clustering in PQ code space instead of dense float space, single-machine clustering suddenly stops sounding ridiculous. That idea aged well. The surrounding implementation did not.
clostera asks the obvious follow-up question:
what happens if you rebuild the original
pqkmeansproject properly for modern hardware and modern Python workflows?
On the committed deterministic 10M x 2048 checkpoint, the answer is not subtle.
Metric (10M x 2048) |
original | clostera-fastest |
clostera-quality |
|---|---|---|---|
| Encode time | 222.94s |
7.24s |
131.34s |
| Cluster time | 80.19s |
4.50s |
4.39s |
| Reconstruction MSE | 0.15160 |
0.12354 |
0.05494 |
| Purity | 0.6573 |
1.0000 |
1.0000 |
That means:
30.8xfaster encoding than the original implementation on the headline checkpoint.17.8xfaster clustering on the same full-core run.- Better clustering quality even on the fastest path.
- A quality-first OPQ mode that dramatically lowers reconstruction error when fidelity matters more than raw throughput.
At billion-vector scale, the algorithm is only half the story. Memory movement is usually the real bottleneck.
clostera is built around that reality:
- raw
numpy.ndarrayinput works out of the box - parquet is a first-class input format
- fixed-size-list vector columns and plain numeric scalar columns are both supported
max_ram_bytesbounds the working set when the original float vectors do not fit- raw vectors can be streamed while PQ codes spill to disk automatically when needed
numpy.memmapfits naturally into the same workflow
This is the practical difference between a paper result and a pipeline you can actually operate.
Choosing K (the number of clusters) used to mean elbow plots, trial-and-error, and pretending you were more certain than you really were.
clostera lets you pass k=None to Clusterer, PQKMeans, or OPQMeans when you do not know the number of clusters in advance. The candidate analysis runs in Rust, reuses the already-trained encoder and the already-encoded PQ code matrix, and does not regenerate the expensive intermediate artifacts for each candidate number of clusters (K).
On the committed deterministic benchmark sweep, the default centroid_silhouette selector recovered the exact true cluster count in 20/20 cases.
centroid_silhouette:20/20exact matches,0.00mean absolute errordavies_bouldin:18/20exact matches,0.90mean absolute errorelbow:18/20exact matches,1.60mean absolute errorbic:3/20exact matches,50.40mean absolute error
clostera is built for people who care about practical speed, reproducibility, and a sane deployment story.
Clustereris the simple default API for normal use.fastest=Truegives you the maximum-throughput plain-PQ path.- The default path keeps OPQ on and favors quality.
- The advanced split into
PQEncoder/PQKMeansandOPQEncoder/OPQMeansis still there when you need it. - The hot paths use full-core Rust + Rayon, BLAS/LAPACK-backed dense math, x86 SIMD, and Apple Silicon NEON kernels.
- Wheels are built for
manylinux_2_28x86_64andaarch64, plus macOSx86_64andarm64. - Deterministic seeds, deterministic synthetic datasets, and committed benchmark artifacts make the claims inspectable.
The original project matters because it proved the idea. clostera exists because that idea deserved a modern implementation.
| Area | Original pqkmeans |
clostera |
|---|---|---|
| Core implementation | Older Python/C++ reference stack | Rust core with PyO3 bindings and maturin packaging |
| PQ codebook initialization | Basic point-picked initialization | Deterministic PCA-quantile seeding with deterministic fallback |
| Cluster initialization | Random center picking in PQ code space | Deterministic farthest-first seeding in PQ code space |
| Quality modes | Plain PQ | Default OPQ-backed quality path plus an explicit fastest plain-PQ mode |
Choosing K (number of clusters) |
User supplies K |
User supplies K or lets Rust-side auto-selection choose it with k=None |
| CPU path | OpenMP-era reference implementation | Rayon-parallel hot paths, BLAS/LAPACK-backed math, x86 SIMD, Apple Silicon NEON |
| Python workflows | NumPy-centric | NumPy arrays, parquet streaming, memmapped code output, RAM-bounded out-of-core workflows, deterministic synthetic datasets |
| Packaging | Source build expectations | manylinux_2_28 x86_64 and aarch64, macOS x86_64 and arm64, CPython 3.10 through 3.13 |
| Benchmarking | Research notebooks and limited comparison artifacts | Deterministic benchmark suite with throughput and clustering-quality metrics, plots, and a showcase notebook |
The README carries committed, deterministic benchmarks because this project should win on numbers, not adjectives.
This is the scale checkpoint the rewrite has to answer for: 64 clusters, one machine, and a dataset large enough that hand-waving stops being useful.
Thread settings used for the max-throughput configuration:
24BLAS threads24OpenMP threads24Rayon threads
| Variant | Encode s | Cluster s | Recon MSE | Purity |
|---|---|---|---|---|
| original | 222.94 |
80.19 |
0.15160 |
0.6573 |
| clostera-fastest | 7.24 |
4.50 |
0.12354 |
1.0000 |
| clostera-quality | 131.34 |
4.39 |
0.05494 |
1.0000 |
How to read that table:
clostera-fastestis the throughput configuration. It is the answer when raw encode speed matters most.clostera-qualityis the quality configuration. It spends more time on rotation but cuts reconstruction MSE by2.25xversusclostera-fastestand by2.76xversus the original implementation.- Even before OPQ, the Rust rewrite already beats the original implementation on both throughput and cluster quality.
We also ran a deterministic K sweep on the same 200k x 2048 block-mixed family used in the benchmark suite. Here K means the number of clusters. This isolates the clustering stage: each implementation trains and encodes once, then we sweep K = 16, 32, 64, 128, 256 over the same PQ codes.
| K (number of clusters) | original cluster s | clostera-fastest cluster s | original / clostera-fastest speedup |
|---|---|---|---|
16 |
1.088 |
0.047 |
22.92x |
32 |
1.404 |
0.064 |
21.83x |
64 |
1.488 |
0.111 |
13.43x |
128 |
1.597 |
0.205 |
7.80x |
256 |
1.646 |
0.315 |
5.22x |
What this sweep says:
- The original implementation slows steadily as
Krises and stays well behindclostera-fastestat every point in the published sweep. - The important point is not just the ranking. It is that
clostera-fastestkeeps clustering comfortably sub-second throughK = 256clusters on200k x 2048, while the original implementation stays well above the one-second mark.
We also fixed the algorithm configuration at K = 64 clusters, M = 64, Ks = 64 and swept the deterministic 2048-dimensional block-mixed dataset from 50k to 800k rows. Each point below uses a 16,384-row warm-up and reports the median of 3 timing runs, so the curve reflects steady-state runtime rather than first-call overhead.
| N | original encode s | clostera-fastest encode s | Encode speedup | original cluster s | clostera-fastest cluster s | Cluster speedup |
|---|---|---|---|---|---|---|
50k |
0.680 |
0.037 |
18.39x |
0.295 |
0.032 |
9.11x |
100k |
1.925 |
0.073 |
26.41x |
0.602 |
0.057 |
10.64x |
200k |
3.697 |
0.145 |
25.47x |
1.258 |
0.109 |
11.58x |
400k |
6.921 |
0.298 |
23.25x |
2.851 |
0.185 |
15.41x |
800k |
12.873 |
0.641 |
20.09x |
5.680 |
0.372 |
15.28x |
What this sweep says:
- Encode cost is close to linear in
Nfor every implementation, but the slope is radically different:clostera-fastestholds roughly1.25Mto1.54Mvectors/s once the warm-up is out of the way, while the original implementation stays near52kto74kvectors/s. - At fixed
K = 64clusters, clustering also scales cleanly with dataset size.clostera-fasteststays about9xto15xfaster than the original implementation across the full sweep. - The main point for capacity planning is that scaling by
Nlooks predictable, not erratic. That matters when you are extrapolating from pilot runs to hundreds of millions or billions of vectors.
We do not benchmark on one flattering Gaussian and declare victory. The committed suite now runs deterministic 10M-vector workloads for:
- Gaussian data
- anisotropic Gaussian data
- Student-t heavy-tailed data
- block-mixed
2048-dimensional data
For each scenario we track:
- encode throughput
- clustering throughput
- reconstruction MSE
- purity
- adjusted Rand index
- normalized mutual information
- v-measure
- assigned-center MSE
Across the suite:
clostera-fastestimproves encode throughput over the original implementation by25.35xto32.72x.clostera-qualityreduces reconstruction error by2.40xto3.74xrelative toclostera-fastest.- on end-to-end pipeline time,
clostera-qualityis faster than the original implementation on every committed10M-vector suite scenario. - the original implementation is slower and has visibly worse clustering quality on every committed scenario.
Modern ARM machines are not a side quest. clostera treats them like real production hardware.
aarch64uses native NEON distance kernels for the common PQ subvector sizes4,8,16,32, and64.- The PQ assignment path is no longer “build a buffer and scan it later”. It now uses a fused lookup-accumulate-and-select kernel plus SIMD-backed
argmin, which matters on Apple Silicon because clustering on PQ codes is often dominated by assignment rather than raw distance evaluation. - The release workflow builds
macOS arm64wheels alongsidex86_64wheels. - The same wheel matrix also covers
manylinux_2_28x86_64andaarch64. - The release configuration uses
openblas-staticso published wheels are as self-contained as practical.
If you are running on Apple Silicon, this is not a Rosetta fallback story. There is architecture-specific code in the hot path and packaging support in the release pipeline.
One of the quietest but most important differences from the original repository is that clostera treats initialization like a real engineering problem instead of a footnote.
PQEncoderuses deterministic PCA-quantile initialization per subspace, rather than hoping random point picks land in a good configuration.PQKMeansuses deterministic farthest-first seeding in PQ code space for better initial coverage.- The default quality path refines an orthogonal rotation before final PQ training, which is where the large OPQ quality gains come from on correlated high-dimensional data.
That shows up as more stable training, fewer pathological runs, and better quality at the same code budget. The headline speedups are not coming from luckier random seeds.
pip install closteraOptional extras:
pip install "clostera[benchmarks]"
pip install "clostera[notebook]"System BLAS/LAPACK build:
python -m pip install maturin
python -m maturin develop --releaseStatic OpenBLAS build:
python -m maturin develop --release --no-default-features --features openblas-staticimport numpy as np
import clostera
rng = np.random.default_rng(7)
vectors = rng.normal(size=(100_000, 128)).astype(np.float32)
clusterer = clostera.Clusterer(k=None) # choose the number of clusters (K) automatically
labels = clusterer.fit_transform(vectors)
print(clusterer.selected_k_) # selected K = selected number of clustersclusterer = clostera.Clusterer(k=known_k) # known_k = desired number of clusters
labels = clusterer.fit_transform(vectors)clusterer = clostera.Clusterer(k=known_k, fastest=True) # known_k = desired number of clusters
labels = clusterer.fit_transform(vectors)clusterer = clostera.Clusterer(k=known_k) # known_k = desired number of clusters
clusterer.fit(vectors)
labels = clusterer.transform(vectors[:1024])clusterer = clostera.Clusterer(k=None) # choose the number of clusters (K) automatically
labels = clusterer.fit_transform("vectors.parquet")When the original float vectors do not fit in RAM, pass a parquet path or a numpy.memmap-backed matrix and set max_ram_bytes.
clusterer = clostera.Clusterer(k=None) # choose the number of clusters (K) automatically
labels = clusterer.fit_transform(
"vectors.parquet",
max_ram_bytes=8 << 30,
)With max_ram_bytes, clostera keeps the training sample bounded, streams raw vectors in batches during encoding, and automatically spills PQ codes to a temporary memmap when needed. The raw vector matrix no longer needs to fit in RAM all at once. If you already materialized the data as a normal in-memory ndarray, clostera can only bound its own additional working set; for truly out-of-core runs, use parquet or numpy.memmap.
Most users should start with Clusterer. The lower-level building blocks are still available when you want to:
- reuse encoded PQ codes across many clustering runs
- fit encoders and clusterers separately
- switch explicitly between plain PQ and OPQ
- tune encoder-specific and clusterer-specific parameters independently
Use Clusterer(fastest=True) when you want the fastest high-level path. Use plain PQEncoder and PQKMeans when you need that same plain-PQ behavior with explicit control. Use OPQEncoder and OPQMeans when reconstruction fidelity matters more and the data has strong cross-subspace correlation.
If you omit num_subquantizers, clostera infers a sensible default from the input dimensionality. For typical embeddings that lands near sqrt(D) code bytes while keeping each subvector wide enough to stay stable.
encoder = clostera.PQEncoder()
encoder.fit(vectors)
codes = encoder.transform(
vectors,
)
clusterer = clostera.PQKMeans(encoder=encoder, k=None) # choose the number of clusters (K) automatically
labels = clusterer.fit_transform(codes)The repository includes a walkthrough notebook designed for readers who want the full visual story:
The committed notebook embeds its static figures directly, so the visuals render in GitHub and standalone notebook viewers without depending on external image paths.
It covers:
- the high-level
Clustererworkflow - automatic number-of-clusters (
K) selection withk=None - parquet workflows
- toy clustering visualization
- plain PQ versus OPQ reconstruction quality
- the advanced encoder/clusterer split when you need it
- cross-dataset benchmark comparisons
- the large-scale
10M x 2048checkpoint K(number of clusters) andNscaling sweeps
In the API tables below, PathLike means a plain path string or a pathlib.Path object.
Clusterer is the default high-level API. It hides the encoder/clusterer split and gives the common workflow a simple fit, transform, fit_transform, fit_predict, and predict surface. By default it uses the quality-first OPQ path; pass fastest=True when you want the maximum-throughput plain-PQ path instead.
| Parameter | Type | Default | Meaning |
|---|---|---|---|
k |
int | None |
None |
Number of target clusters. Here K means the number of clusters. None enables automatic number-of-clusters selection. |
fastest |
bool |
False |
Turn off OPQ and use the maximum-throughput plain-PQ path. This usually lowers reconstruction quality but can reduce total fit time substantially on large runs. |
num_subquantizers |
int | None |
None |
Optional PQ subspace count. When omitted, clostera infers a deterministic default from the input dimensionality. |
codebook_size |
int |
256 |
Number of codewords per subspace. |
iterations |
int |
20 |
Shared iteration budget for the simple high-level API. |
seed |
int |
0 |
Deterministic seed. |
opq_iterations |
int |
3 |
OPQ refinement steps used on the default quality-first path. When fastest=True, the current code always uses plain PQ and ignores this setting. |
verbose |
bool |
False |
Emit inertia diagnostics during fitting. |
lookup_table_bytes |
int |
1 << 30 |
Memory budget for code-domain lookup tables. Larger budgets favor faster assignment. |
auto_k_method |
str |
"centroid_silhouette" |
Automatic-number-of-clusters (K) scoring rule. Supported values are "centroid_silhouette", "davies_bouldin", "elbow", and "bic". |
auto_k_candidates |
list[int] | tuple[int, ...] | np.ndarray | None |
None |
Explicit candidate K values, meaning candidate cluster counts, to test when k=None. If omitted, clostera builds a default candidate template automatically, including practical values such as 4, 6, 8, 12, 16, 24, and 32 when the dataset size supports them. |
auto_k_min |
int |
2 |
Lower bound for automatically generated candidate values when auto_k_candidates is omitted. |
auto_k_max |
int | None |
None |
Upper bound for automatically generated candidate values when auto_k_candidates is omitted. |
auto_k_step |
int | None |
None |
Optional arithmetic step for generated candidates. If omitted, clostera uses a baked-in candidate template. |
auto_k_sample_rows |
int |
16_384 |
Number of PQ codes sampled for the Rust-side candidate analysis pass. |
| Parameter | Type | Default | Meaning |
|---|---|---|---|
data |
np.ndarray | PathLike |
required |
Raw float vectors as an array, parquet path, or numpy.memmap-backed matrix. |
parquet_column |
str | None |
None |
Specific parquet vector column. |
batch_size |
int |
65_536 |
Parquet streaming batch size. |
codes_output_path |
PathLike | None |
None |
Optional memmap destination when raw parquet input must be encoded first. |
max_ram_bytes |
int | None |
None |
Optional RAM budget for bounded-memory raw-vector workflows. |
Advanced access after fitting:
encoder_: the fittedPQEncoderorOPQEncoderclusterer_: the fittedPQKMeansorOPQMeanslabels_,cluster_centers_,inertia_history_,selected_k_,k_selection_
The classes below expose the encoder/clusterer split directly. Reach for them when you want to reuse PQ codes, separate training phases, or tune encoder-specific and clusterer-specific parameters independently.
| Parameter | Type | Default | Meaning |
|---|---|---|---|
num_subquantizers |
int | None |
None |
Number of PQ subspaces M. When omitted, clostera infers a deterministic default from the input dimensionality. Explicit values still require the dimensionality to be divisible by M. |
codebook_size |
int |
256 |
Number of codewords per subspace Ks. Supported range is 2..=256. |
iterations |
int |
20 |
Number of Lloyd iterations for subspace k-means training. |
seed |
int |
0 |
Deterministic seed used for initialization fallback and reproducible training behavior. |
opq_iterations |
int |
0 |
Number of OPQ refinement steps. 0 keeps plain PQ, >0 learns an orthogonal rotation before final PQ training. |
OPQEncoder has the same API and runtime methods as PQEncoder, but defaults opq_iterations to 3.
| Parameter | Type | Default | Meaning |
|---|---|---|---|
data |
np.ndarray | PathLike |
required |
A dense float32 matrix or a parquet path. |
parquet_column |
str | None |
None |
Specific parquet column to treat as the vector column. |
batch_size |
int |
65_536 |
Batch size for parquet streaming. |
train_rows |
int | None |
None |
Number of deterministic training rows to sample. With in-memory arrays, omitting this uses the full matrix unless max_ram_bytes is set. |
max_ram_bytes |
int | None |
None |
Optional RAM budget for the training sample plus OPQ workspace. When set, large parquet or memmap-backed inputs are trained from a bounded deterministic sample. |
| Parameter | Type | Default | Meaning |
|---|---|---|---|
data |
np.ndarray | PathLike |
required |
Dense vectors or parquet input. |
parquet_column |
str | None |
None |
Specific parquet vector column. |
batch_size |
int |
65_536 |
Parquet streaming batch size. |
output_path |
PathLike | None |
None |
Optional destination for a memory-mapped uint8 code matrix. |
max_ram_bytes |
int | None |
None |
Optional RAM budget for batched encoding. Large raw-vector inputs are processed in chunks; if codes would not fit in RAM, provide output_path or call PQKMeans.fit(...) directly. |
| Parameter | Type | Default | Meaning |
|---|---|---|---|
data |
np.ndarray | PathLike |
required |
A dense float32 matrix or a parquet path. |
parquet_column |
str | None |
None |
Specific parquet column to treat as the vector column. |
batch_size |
int |
65_536 |
Parquet streaming batch size. |
train_rows |
int | None |
None |
Number of deterministic training rows to sample before encoding. |
output_path |
PathLike | None |
None |
Optional destination for a memory-mapped uint8 code matrix produced by the transform phase. |
max_ram_bytes |
int | None |
None |
Optional RAM budget applied to both training and encoding. |
| Parameter | Type | Default | Meaning |
|---|---|---|---|
codes |
np.ndarray |
required |
A 2D PQ code matrix with shape (rows, num_subquantizers). Returns decoded float32 vectors. |
| Parameter | Type | Default | Meaning |
|---|---|---|---|
encoder |
PQEncoder |
required |
Trained encoder that defines the codebooks. |
k |
int | None |
None |
Number of target clusters. Here K means the number of clusters. None enables Rust-side automatic number-of-clusters selection over candidate values in PQ code space. |
iterations |
int |
20 |
Number of clustering update rounds. |
seed |
int |
0 |
Deterministic seed for cluster-center initialization. |
verbose |
bool |
False |
Emit inertia diagnostics during fitting. |
lookup_table_bytes |
int |
1 << 30 |
Memory budget for code-domain lookup tables. Larger budgets favor faster assignment. |
auto_k_method |
str |
"centroid_silhouette" |
Automatic-number-of-clusters (K) scoring rule. Supported values are "centroid_silhouette", "davies_bouldin", "elbow", and "bic". |
auto_k_candidates |
list[int] | tuple[int, ...] | np.ndarray | None |
None |
Explicit candidate K values, meaning candidate cluster counts, to test when k=None. If omitted, clostera builds a default candidate template automatically, including practical values such as 4, 6, 8, 12, 16, 24, and 32 when the dataset size supports them. |
auto_k_min |
int |
2 |
Lower bound for automatically generated candidate values when auto_k_candidates is omitted. |
auto_k_max |
int | None |
None |
Upper bound for automatically generated candidate values when auto_k_candidates is omitted. |
auto_k_step |
int | None |
None |
Optional arithmetic step for generated candidates. If omitted, clostera uses a baked-in candidate template. |
auto_k_sample_rows |
int |
16_384 |
Number of PQ codes sampled for the Rust-side candidate analysis pass. |
OPQMeans mirrors PQKMeans, but treats OPQ as the default rather than an extra knob. If you do not pass encoder=, it lazily creates and fits an OPQEncoder from the raw vectors or parquet source on first fit(...), fit_predict(...), or fit_transform(...). If you do pass encoder=, the current code requires it to have been trained with opq_iterations > 0.
| Parameter | Type | Default | Meaning |
|---|---|---|---|
encoder |
PQEncoder | None |
None |
Optional pre-trained OPQ encoder. If omitted, OPQMeans builds one automatically. |
num_subquantizers |
int | None |
None |
Optional encoder-side PQ subspace count when encoder is omitted. |
codebook_size |
int |
256 |
Optional encoder-side codebook size when encoder is omitted. |
encoder_iterations |
int |
20 |
Encoder training iterations used when encoder is omitted. |
seed |
int |
0 |
Deterministic seed shared by the implicit encoder and the clusterer. |
opq_iterations |
int |
3 |
OPQ refinement steps used by the implicit encoder. |
k |
int | None |
None |
Number of target clusters. Here K means the number of clusters. None enables Rust-side automatic number-of-clusters selection over candidate values in PQ code space. |
iterations |
int |
20 |
Number of clustering update rounds. |
verbose |
bool |
False |
Emit inertia diagnostics during fitting. |
lookup_table_bytes |
int |
1 << 30 |
Memory budget for code-domain lookup tables. Larger budgets favor faster assignment. |
auto_k_method |
str |
"centroid_silhouette" |
Automatic-number-of-clusters (K) scoring rule. Supported values are "centroid_silhouette", "davies_bouldin", "elbow", and "bic". |
auto_k_candidates |
list[int] | tuple[int, ...] | np.ndarray | None |
None |
Explicit candidate K values, meaning candidate cluster counts, to test when k=None. If omitted, clostera builds a default candidate template automatically, including practical values such as 4, 6, 8, 12, 16, 24, and 32 when the dataset size supports them. |
auto_k_min |
int |
2 |
Lower bound for automatically generated candidate values when auto_k_candidates is omitted. |
auto_k_max |
int | None |
None |
Upper bound for automatically generated candidate values when auto_k_candidates is omitted. |
auto_k_step |
int | None |
None |
Optional arithmetic step for generated candidates. If omitted, clostera uses a baked-in candidate template. |
auto_k_sample_rows |
int |
16_384 |
Number of PQ codes sampled for the Rust-side candidate analysis pass. |
OPQMeans uses the same runtime method signatures as PQKMeans: fit(...), transform(...), fit_transform(...), fit_predict(...), and predict(...).
| Parameter | Type | Default | Meaning |
|---|---|---|---|
data |
np.ndarray | PathLike |
required |
Either raw vectors or precomputed PQ codes. |
parquet_column |
str | None |
None |
Specific parquet vector column. |
batch_size |
int |
65_536 |
Parquet streaming batch size. |
codes_output_path |
PathLike | None |
None |
Optional memmap destination when raw parquet input must be encoded first. |
max_ram_bytes |
int | None |
None |
Optional RAM budget for encoding raw vectors into PQ codes before clustering. When set and no codes_output_path is supplied, clostera creates a temporary memmap automatically. |
When k=None, fitting also populates:
selected_k_: the final chosen cluster count (K)k_selection_: the full Rust-side selection report, including the tested candidate values and per-method scores
| Environment variable | Meaning |
|---|---|
CLOSTERA_ROTATION_BATCH_MIB |
Override the default OPQ rotation batch target in MiB for benchmarking or machine-specific tuning. |
python scripts/generate_synthetic_dataset.py \
--output-dir .artifacts/block-mixed-200k-2048 \
--distribution block_mixed \
--rows 200000 \
--dim 2048 \
--clusters 64 \
--seed 11python scripts/compare_impls.py \
--dataset-dir .artifacts/block-mixed-200k-2048 \
--original-python "$(which python)" \
--enhanced-python "$(which python)" \
--train-rows 32768 \
--metric-sample-rows 32768 \
--num-subquantizers 64 \
--codebook-size 64 \
--pq-iterations 6 \
--cluster-k 64 \
--cluster-iterations 4 \
--opq-iterations 3 \
--blas-threads 24 \
--omp-threads 24 \
--rayon-threads 24 \
--rotation-batch-mib 32 \
--output-json .artifacts/block-mixed-200k-2048/compare.jsonpython scripts/benchmark_k_sweep.py \
--dataset-dir .artifacts/k-sweep-block-mixed-200k-2048 \
--output-json benchmarks/results/k-sweep.json \
--original-python "$(which python)" \
--enhanced-python "$(which python)" \
--forcepython scripts/benchmark_n_sweep.py \
--dataset-dir .artifacts/n-sweep-block-mixed-800k-2048 \
--output-json benchmarks/results/n-sweep.json \
--original-python "$(which python)" \
--enhanced-python "$(which python)" \
--forcepython scripts/benchmark_suite.py \
--output-dir .artifacts/benchmark-suite \
--original-python "$(which python)" \
--enhanced-python "$(which python)" \
--blas-threads 24 \
--omp-threads 24 \
--rayon-threads 24 \
--rotation-batch-mib 32 \
--forcepython scripts/evaluate_auto_k_methods.py \
--output-json benchmarks/results/auto-k-methods.json \
--forcepython scripts/render_benchmark_assets.py \
--suite-json benchmarks/results/benchmark-suite.json \
--large-json benchmarks/results/large-scale-10m.json \
--k-sweep-json benchmarks/results/k-sweep.json \
--n-sweep-json benchmarks/results/n-sweep.json \
--auto-k-json benchmarks/results/auto-k-methods.json \
--output-dir docs/assetsThe repository already includes publication artifacts for:
manylinux_2_28wheels forx86_64andaarch64- macOS wheels for
x86_64andarm64 - CPython
3.10through3.13 - source distributions
Relevant files:
.github/workflows/ci.yml.github/workflows/release.ymlrust-toolchain.toml
The release workflow builds wheels with openblas-static enabled so binary installs are as self-contained as practical.
The PyPI project name is clostera.
Once the one-time PyPI Trusted Publisher setup is done for:
- owner:
BaseModelAI - repository:
clostera - workflow:
.github/workflows/release.yml - environment:
pypi
the normal release path is:
python scripts/release.py 1.0.3 --commit --tag --pushThat updates the version in the release metadata, creates the release commit, creates tag v1.0.3, and pushes both to origin. The tag push triggers the GitHub release workflow, which builds the wheels and publishes them to PyPI.
- Original repository: https://github.com/DwangoMediaVillage/pqkmeans
- Original project page: https://yusukematsui.me/project/pqkmeans/pqkmeans.html
- Original paper: https://arxiv.org/abs/1709.03708
- Jégou, Douze, Schmid. Product Quantization for Nearest Neighbor Search. IEEE TPAMI 2011. https://ieeexplore.ieee.org/document/5432202/
- Ge, He, Ke, Sun. Optimized Product Quantization. IEEE TPAMI 2014. https://www.microsoft.com/en-us/research/wp-content/uploads/2013/11/pami13opq.pdf
- André, Kégl, Szegedy. Accelerated Nearest Neighbor Search with Quick ADC. https://arxiv.org/abs/1704.07355
- André et al. Quicker ADC: Unlocking the Hidden Potential of Product Quantization with SIMD. https://arxiv.org/abs/1812.09162
- Matsui, Uchida, Jégou, Satoh. A Survey of Product Quantization. https://www.jstage.jst.go.jp/article/mta/6/1/6_2/_article
Current local verification commands:
python -m maturin develop --release
cargo test --release
pytest -q
cargo check --no-default-features --features openblas-static
cargo bench --bench core_bench








