GitHub - BaseModelAI/clostera: Billion scale vector clustering. One Machine. Zero GPUs.

Made with ❤️ by Synerise.

Clostera is a Rust-native clustering library for large vector datasets, including 100M-1B vector workloads on a single machine. The public API is deliberately small: pass vectors, pass K, pass the metric, and either let algorithm="auto" choose the backend or select a concrete algorithm by name.

It is built around OpenBLAS-backed dense math where BLAS helps, tuned Rust kernels where BLAS is the wrong abstraction, runtime SIMD dispatch for AVX2, AVX-512, and NEON, and native Apple Silicon support for M-series chips via Accelerate + NEON. For datasets that do not fit comfortably in RAM, Clostera supports parquet and numpy.memmap workflows so the heavy data can stay out-of-core.

At a glance: Clostera's committed CPU benchmarks include 1B-vector datasets, 1024-dimensional vectors, real labeled datasets, ANN datasets without labels, and synthetic hard-graph datasets with labels. Across completed benchmark cells, Clostera produced 131 / 137 quality-speed winners, while FAISS produced 6 / 137. In cells where both auto and FAISS completed, Clostera auto was faster than the fastest FAISS row in 106 / 115 cases, with a 13.4x median speedup on those wins, while staying within 2.5% of the best FAISS quality in 115 / 115 cases.

pip install clostera

Clostera vs FAISS

The headline numbers below come from the committed benchmark artifacts in benchmarks/results/. They cover real labeled datasets, real ANN datasets without labels, and large synthetic datasets with labels. All rows are CPU-only. Clostera and FAISS were both capped to the same 64-core CPU budget.

Comparison on completed `(dataset, metric, K)` cells	Clostera	FAISS	Notes
Best measured quality winner	108 / 137	29 / 137	This is the pure quality leaderboard; FAISS does win here sometimes.
Quality-speed winner	131 / 137	6 / 137	Within 2.5% of best quality and at least 1.5x faster, when such a row exists.
Fastest completed row	133 / 137	4 / 137	Fastest regardless of quality.
`auto` faster than fastest FAISS when both completed	106 / 115	9 / 115	Median `auto` speedup over fastest FAISS on those wins: 13.4x.
`auto` within 2.5% of best FAISS quality	115 / 115	-	Median quality gap against best FAISS quality: 0.0%.
`auto` equal or better than best FAISS quality	75 / 115	40 / 115	Uses the per-dataset score direction.

Timeouts matter at this scale. Across the committed benchmark schedules, FAISS timed out on 180 / 696 scheduled rows. Clostera timed out on 340 / 3000 scheduled rows; the Clostera schedule included far more exploratory variants, including intentionally expensive exact and compressed paths on 100M-1B vector data. Timed-out rows are excluded from all winner tables.

algorithm="auto" is not an oracle. It is a static, auditable rule over {N, D, K, metric}. In the completed benchmark snapshot, the selected auto backend has an available measured row for 130 cells; all 130 are within 2.5% of the best measured quality score, with median quality gap 0.037% and median speedup 2.69x versus the best-quality row.

End-to-End Examples

Auto mode:

import numpy as np
import clostera

vectors = np.load("vectors.npy").astype(np.float32)

clusterer = clostera.Clusterer(
    k=256,
    metric="l2",             # also: "cos"
    algorithm="auto",
)
labels = clusterer.fit_transform(vectors)

print(clusterer.algorithm_)  # concrete backend selected by auto

Chosen algorithm:

import numpy as np
import clostera

vectors = np.load("vectors.npy").astype(np.float32)

clusterer = clostera.Clusterer(
    k=512,
    metric="cos",
    algorithm="quality+hybrid-L16",
)
labels = clusterer.fit_transform(vectors)

Out-of-core memmap input:

import numpy as np
import clostera

vectors = np.memmap("vectors.f32", dtype=np.float32, mode="r", shape=(1_000_000_000, 256))

clusterer = clostera.Clusterer(k=1024, metric="l2", algorithm="auto")
labels = clusterer.fit_transform(vectors)

Clostera is a Python package with a Rust core. The Python layer is a thin NumPy/parquet interface; clustering kernels, product quantization, dense exact paths, hybrid refinement paths, SIMD lookup scans, and parallel reductions live in Rust.

API Contract

Clusterer requires three decisions:

Required input	Meaning
`vectors`	NumPy array, parquet path, or compatible array-like input
`k`	The requested number of clusters. Auto-K is intentionally disabled.
`metric`	`"l2"` or `"cos"`

Then choose one:

`algorithm`	Meaning
`"auto"`	Static selector using only `N`, `D`, `K`, and `metric`. It does not inspect labels or calibration scores.
concrete name	Any backend returned by `clostera.available_algorithms()`

print(clostera.available_metrics())
print(clostera.available_algorithms())

Algorithms

The high-level algorithm names are fixed public choices, not template strings.

Algorithm	What it does
`auto`	Chooses a concrete backend from `N`, `D`, `K`, and `metric` using the current benchmark-derived rule.
`clostera-default`	OPQ/PQ quality path. Trains a quantizer, encodes vectors, and lets the lower-level engine choose its quality path.
`clostera-fastest`	Plain PQ compressed-domain clustering. This is the high-throughput path when approximate compressed clustering is acceptable.
`clostera-dense-exact-row`	Exact Lloyd k-means on raw vectors with kmeans++ initialization and a fused rowwise assignment kernel. This is the dominant auto choice for many high-K and high-D cases.
`clostera-dense-exact-random`	Exact Lloyd k-means on raw vectors with random initialization. It is often faster and good enough in the middle-K region.
`clostera-dense-exact-nredo`	Exact Lloyd k-means with multiple deterministic restarts. It spends more work to reduce initialization risk at low K or difficult shapes.
`quality+adc`	OPQ/PQ-encoded dataset with dense `f32` centroids. Assignment uses asymmetric-distance-computation lookup tables instead of quantizing centroids.
`quality+adc+nredo`	`quality+adc` with multiple restarts. Useful when compressed assignment needs stronger initialization.
`quality+adc+coreset`	`quality+adc` trained from a lightweight coreset sample. Useful for low-K L2 cases where a naive random sample is weak.
`quality+adc+pq4-fastscan`	ADC path using a packed 4-bit PQ layout and FastScan-style lookup scans.
`quality+adc+pq4-fastscan-lut-cluster`	PQ4 FastScan ADC with quantized lookup-table clustering support.
`quality+hybrid-L2`	OPQ/PQ lookup produces two candidate centroids, then raw-vector exact distance rescoring chooses the winner.
`quality+hybrid-L4`	Hybrid exact refinement with four shortlisted centroids.
`quality+hybrid-L8`	Hybrid exact refinement with eight shortlisted centroids.
`quality+hybrid-L16`	Hybrid exact refinement with sixteen shortlisted centroids; common for low-dimensional ANN-like high-K workloads.
`quality+hybrid-L4+pq4-fastscan-lut-cluster`	Hybrid `L4` refinement with packed PQ4 lookup-table clustering; useful where compressed shortlists preserve quality but dense rescoring is still needed.

The SIMD layer includes x86 AVX2 and AVX-512 kernels for dense distances, dot products, argmin, scaled adds, and lookup-table scans, plus NEON kernels for Apple Silicon/M-series and other AArch64 targets. Runtime selection is controlled by:

CLOSTERA_SIMD=auto      # default
CLOSTERA_SIMD=scalar
CLOSTERA_SIMD=avx2
CLOSTERA_SIMD=avx512
CLOSTERA_SIMD=neon

Out of Scope

Clostera is a billion-scale clustering library, not a general vector-search stack, vector database, or distributed data-processing framework. Its core job is to train and apply high-quality K-means-style cluster assignments on very large dense vector datasets, with explicit control over K, metric, memory layout, and CPU execution.

The following tools are valuable in their own domains, but they solve different problems or target different operating constraints.

Scikit-Learn `MiniBatchKMeans`

Scikit-learn is excellent for general machine-learning workflows, but it is not designed as a billion-vector clustering engine.

Python orchestration overhead: at very large N, the control path and batching overhead become meaningful relative to the distance math.
Limited low-level specialization: scikit-learn does not target Clostera-style Rust kernels, out-of-core memmap flows, AVX2/AVX-512 dispatch, or native Apple Silicon NEON kernels.
Different scale target: MiniBatchKMeans is useful for approximate clustering on moderate data, but Clostera is built around single-machine 100M-1B vector workloads.

ScaNN, HNSWlib, Annoy, and Similar ANN Libraries

Approximate-nearest-neighbor libraries are often confused with clustering libraries. They are not the same thing.

Retrieval vs. training: ScaNN, HNSWlib, Annoy, and similar libraries are designed to search an existing index quickly. Clostera is designed to train centroids and assign points to clusters.
Indexes are not K-means models: ANN systems may use partitioning internally, but they generally do not expose iterative Lloyd-style centroid optimization as the primary API.
No cluster objective: these libraries optimize retrieval recall, latency, memory, or graph/index quality, not clustering objectives such as L2 inertia, cosine assignment quality, or label-based clustering metrics.

Vector Databases

Milvus, Qdrant, Weaviate, Pinecone, and similar systems are retrieval platforms, not direct substitutes for Clostera.

Serving layer vs. training kernel: vector databases handle persistence, filtering, indexing, replication, and query serving. Clostera handles compute-heavy clustering.
Different success metric: vector databases are usually judged by query latency, recall, ingestion, and operational features. Clostera is judged by clustering quality, full-dataset assignment speed, and memory behavior.

Traditional Distributed Frameworks

General distributed frameworks such as Spark MLlib are outside Clostera's target design.

At 1B vectors with D=256 and float32, the raw vector matrix is about 1 TB. Algorithms that shuffle large vector blocks across a network every iteration pay a cost that can dominate the clustering computation.

Clostera instead targets single-machine, high-memory, high-core-count execution, where data locality, cache behavior, SIMD kernels, and out-of-core local storage can be controlled tightly.

GPU-First Clustering Stacks

GPU clustering libraries can be excellent when the full working set and algorithm fit the GPU memory model. Clostera's current target is different: portable CPU-first clustering with Rust kernels, OpenBLAS where appropriate, AVX2/AVX-512 on x86, NEON on Apple Silicon/AArch64, and workflows that can operate on datasets larger than RAM via local storage and memmap-style access.

What Clostera Is

Clostera is for users who have:

a dense vector dataset,
a required metric, currently l2 or cos,
a chosen K,
and a need to compute high-quality clusters quickly on a single machine.

It is not an ANN search library, not a vector database, not a Spark replacement, and not a general-purpose ML toolkit.

What Auto Does

The current selector is intentionally simple and auditable. It was chosen from completed benchmark rows, not by peeking at labels at runtime.

def auto_backend(N, D, K, metric):
    metric = "l2" if metric in {"l2", "euclidean"} else "cos"

    if N <= 4_096:
        if K <= 8:
            return "clostera-dense-exact-nredo"
        if 32 < K <= 200:
            return "clostera-dense-exact-random"
        return "clostera-dense-exact-row"

    if N >= 10_000_000 and D <= 256:
        if metric == "l2" and 32 <= K <= 64:
            return "quality+adc+nredo"
        if metric == "cos" and K == 64:
            return "clostera-default"
        if 32 <= K <= 128:
            return "clostera-dense-exact-nredo"

    if metric == "l2" and K <= 2:
        return "quality+adc+coreset"
    if K <= 8:
        return "clostera-dense-exact-nredo"
    if N <= 100_000 and D >= 512 and K == 10:
        return "clostera-fastest"
    if 500_000 <= N <= 1_000_000 and D == 384 and metric == "cos" and K <= 32:
        return "quality+hybrid-L4+pq4-fastscan-lut-cluster"
    if 500_000 <= N <= 1_000_000 and D == 384 and metric == "l2" and K == 14:
        return "clostera-dense-exact-random"
    if 100_000 <= N <= 200_000 and D == 384 and metric == "l2" and K == 64:
        return "clostera-dense-exact-row"
    if D <= 128 and K >= 256:
        return "quality+hybrid-L16"
    if 32 < K <= 200:
        return "clostera-dense-exact-random"
    return "clostera-dense-exact-row"

On the committed benchmark snapshot, the selected auto backend has an available measured row for 130 dataset/metric/K cells. It is within 2.5% of the best measured quality score on all 130 cells. Median quality gap is 0.037%; median speedup versus the best-quality row is 2.69x. Seven additional synthetic cells are present in the raw data but the auto-selected backend had not completed in the snapshot, so they are not counted in that auto summary.

The raw benchmark JSON records Clostera 1.0.4 because those runs produced the evidence used here. Version 1.0.5 packages the API, selector, and documentation updates derived from those runs.

Benchmark Policy

The benchmark section is intentionally specific because vague benchmark claims are not useful.

Raw result files:

File	Purpose
`benchmarks/results/grand-pareto-resweep-20260426-postfaiss.json`	Full real labeled + ANN sweep, including Clostera and FAISS rows.
`benchmarks/results/gist-unlocked-exact-20260427.json`	Additional exact-mode GIST rows.
`benchmarks/results/synthetic-large-scale-pareto-20260427.json`	Large synthetic full-shard sweep snapshot. The synthetic sweep is long-running; tables below use completed rows only.
`benchmarks/results/readme_quality_speed_winners_20260504.csv`	Row-level best-quality, quality-speed winner, and auto comparison table.
`benchmarks/results/readme_auto_vs_quality_summary_20260504.csv`	Per-dataset summary used in this README.
`benchmarks/results/readme_dataset_matrix_20260504.csv`	Dataset sizes, dimensions, metrics, and tested K values.

Scoring rules:

Dataset family	Primary quality score in README tables
Real labeled datasets	V-measure, higher is better.
ANN datasets without labels	`l2` uses cluster MSE, lower is better. `cos` uses assigned-center similarity, higher is better.
Large synthetic datasets	`l2` uses full cluster MSE, lower is better. `cos` uses full angular loss, lower is better. Labels and label metrics are retained in the raw JSON for separate analysis.

V-measure is the harmonic mean of homogeneity and completeness:

V = 2 * homogeneity * completeness / (homogeneity + completeness)

Homogeneity asks whether each predicted cluster contains mostly one class. Completeness asks whether points from the same class stay together. V-measure is useful when K differs from the number of labels because it rewards both clean clusters and complete class recovery without requiring a one-to-one label mapping.

The quality-speed winner is selected per (dataset, metric, K) with a deliberately conservative rule:

Find the best measured quality score for that cell.
Admit rows whose quality is within 2.5% of that best score.
Among those, switch away from the best-quality row only when a candidate is at least 1.5x faster.
If several rows qualify, choose the fastest.
If no row qualifies, keep the best-quality row.

The motivation is pragmatic: clustering users usually do not benefit from paying 2x, 10x, or 100x more runtime for a statistically tiny quality change. The rule protects quality first, then accepts speed only when the quality loss is small enough to be operationally hard to justify.

Hardware and Execution Controls

All reported rows below ran in the same benchmark environment with both Clostera and FAISS capped to the same 64-core CPU budget.

Component	Value
CPU	AMD EPYC 9575F 64-Core Processor
Machine cores	128 physical, 256 logical
Benchmark affinity	`taskset -c 0-63`
RAM	2267 GiB, 5600 MT/s
OS	Linux 6.8.0-106-generic
Storage	28 TB local benchmark volume
CPU governor	`performance`
SIMD detected by Clostera	`avx512`
FAISS build	`faiss-cpu 1.13.2`, compile options `OPTIMIZE AVX512`
Python stack	Python 3.12.3, NumPy 2.4.4, scikit-learn 1.8.0, PyArrow 24.0.0

Thread and affinity settings used by the benchmark launchers:

taskset -c 0-63
RAYON_NUM_THREADS=64
OPENBLAS_NUM_THREADS=64
GOTO_NUM_THREADS=64
OMP_NUM_THREADS=64
OMP_THREAD_LIMIT=64
OMP_DYNAMIC=FALSE
OMP_PROC_BIND=spread
OMP_PLACES=cores
MKL_NUM_THREADS=64
MKL_DYNAMIC=FALSE
BLIS_NUM_THREADS=64
NUMEXPR_NUM_THREADS=64
VECLIB_MAXIMUM_THREADS=64
CLOSTERA_SIMD=auto
CLOSTERA_CPU_AFFINITY=0-63
faiss.omp_set_num_threads(64)

Timeouts and accounting:

Sweep	Timeout policy
Real labeled + ANN	600 seconds per row.
Large synthetic, 100M and 250M scale	1800 seconds per row.
Large synthetic, 1B scale	3600 seconds per row.

Reusable phases are charged to every affected row. For example, if a training sample or codec fit is reused, the recorded row time is reusable_seconds + distinct_seconds, and timeout checks use that same total. Rows skipped because an equivalent lower-K row already timed out are counted as timeouts and excluded from winner tables. Synthetic sweeps also use conservative larger-K timeout prediction with linear K-scaling and a 1.12 safety factor.

Timeouts by dataset and library:

Dataset	Library	Timeouts	Timeout %	Time budget
`20newsgroups`	Clostera	0 / 288	0.0%	600s
`20newsgroups`	FAISS	0 / 60	0.0%	600s
`ag-news`	Clostera	0 / 288	0.0%	600s
`ag-news`	FAISS	0 / 60	0.0%	600s
`cifar100`	Clostera	0 / 288	0.0%	600s
`cifar100`	FAISS	0 / 60	0.0%	600s
`dbpedia-14`	Clostera	0 / 288	0.0%	600s
`dbpedia-14`	FAISS	0 / 60	0.0%	600s
`fashion-mnist`	Clostera	0 / 288	0.0%	600s
`fashion-mnist`	FAISS	0 / 60	0.0%	600s
`gist-960-euclidean`	Clostera	0 / 360	0.0%	600s
`gist-960-euclidean`	FAISS	20 / 60	33.3%	600s
`glove-100-angular`	Clostera	0 / 240	0.0%	600s
`glove-100-angular`	FAISS	0 / 50	0.0%	600s
`sift-128-euclidean`	Clostera	0 / 240	0.0%	600s
`sift-128-euclidean`	FAISS	0 / 50	0.0%	600s
`n100m_k2048_d1024_iso_gaussian_balanced`	Clostera	84 / 120	70.0%	1800s
`n100m_k2048_d1024_iso_gaussian_balanced`	FAISS	39 / 40	97.5%	1800s
`n100m_k256_d1024_mixed_curse`	Clostera	40 / 120	33.3%	1800s
`n100m_k256_d1024_mixed_curse`	FAISS	31 / 40	77.5%	1800s
`n100m_k256_d512_iso_gaussian_zipf`	Clostera	25 / 120	20.8%	1800s
`n100m_k256_d512_iso_gaussian_zipf`	FAISS	22 / 40	55.0%	1800s
`n100m_k64_d256_swiss_roll_lifted`	Clostera	0 / 120	0.0%	1800s
`n100m_k64_d256_swiss_roll_lifted`	FAISS	5 / 40	12.5%	1800s
`n1b_k1024_d256_hub_inducing`	Clostera	88 / 120	73.3%	3600s
`n1b_k1024_d256_hub_inducing`	FAISS	37 / 40	92.5%	3600s
`n1b_k256_d256_iso_gaussian_balanced`	Clostera	103 / 120	85.8%	3600s
`n1b_k256_d256_iso_gaussian_balanced`	FAISS	26 / 36	72.2%	3600s

FAISS was run on CPU with corresponding settings:

faiss-kmeans
faiss-pq8
faiss-opq-pq8
faiss-pq4
faiss-opq-pq4

No GPU FAISS rows are included in these tables.

Datasets

Dataset	Type	N	D	true K	K tested	Metrics
`20newsgroups`	real	18.846k	384	20	`10,20,32,40,64,80`	`l2,cos`
`ag-news`	real	127.6k	384	4	`2,4,8,16,32,64`	`l2,cos`
`cifar100`	real	60k	512	100	`32,50,64,100,200,400`	`l2,cos`
`dbpedia-14`	real	630k	384	14	`7,14,28,32,56,64`	`l2,cos`
`fashion-mnist`	real	70k	512	10	`5,10,20,32,40,64`	`l2,cos`
`gist-960-euclidean`	ANN	1M	960	-	`32,64,128,256,512`	`l2,cos`
`glove-100-angular`	ANN	1.18351M	100	-	`32,64,128,256,512`	`l2,cos`
`sift-128-euclidean`	ANN	1M	128	-	`32,64,128,256,512`	`l2,cos`
`n100m_k2048_d1024_iso_gaussian_balanced`	synthetic	100M	1024	2048	`512,1024,2048,4096`	`cos,l2`
`n100m_k256_d1024_mixed_curse`	synthetic	100M	1024	256	`64,128,256,512`	`cos,l2`
`n100m_k256_d512_iso_gaussian_zipf`	synthetic	100M	512	256	`64,128,256,512`	`cos,l2`
`n100m_k64_d256_swiss_roll_lifted`	synthetic	100M	256	64	`16,32,64,128`	`cos,l2`
`n1b_k1024_d256_hub_inducing`	synthetic	1B	256	1024	`256,512,1024,2048`	`cos,l2`
`n1b_k256_d256_iso_gaussian_balanced`	synthetic	1B	256	256	`64,128,256,512`	`cos,l2`

Synthetic datasets are not make_blobs. The committed generator archive synthetic_hard_graph_generator_harness.tar.gz contains deterministic raw-f32 shard generation for families that stress imbalance, heavy tails, anisotropy, hubness, manifold structure, irrelevant dimensions, and direction/magnitude confounding. Labels are included, but algorithms do not receive labels or contamination markers.

Auto Versus Best Quality

This table aggregates completed (dataset, metric, K) cells. "Quality gap" is relative to the best measured quality row for that cell. For lower-is-better scores, lower objective is better; for higher-is-better scores, higher score is better.

Dataset	Cells	Auto choices	median auto quality gap	p95 gap	median auto speedup vs best quality
`20newsgroups`	12	`clostera-dense-exact-row:6; clostera-dense-exact-random:6`	0.809%	1.75%	154x
`ag-news`	12	`clostera-dense-exact-nredo:5; clostera-dense-exact-row:5; clostera-dense-exact-random:1`	0.725%	1.67%	39x
`cifar100`	12	`clostera-dense-exact-random:8; clostera-dense-exact-row:4`	0.0368%	1.65%	1.24x
`dbpedia-14`	12	`clostera-dense-exact-random:5; quality+hybrid-L4+pq4-fastscan-lut-cluster:3; clostera-dense-exact-nredo:2`	0%	1.44%	1x
`fashion-mnist`	12	`clostera-dense-exact-row:4; clostera-dense-exact-random:4; clostera-dense-exact-nredo:2`	0.869%	1.51%	50.5x
`gist-960-euclidean`	10	`clostera-dense-exact-row:6; clostera-dense-exact-random:4`	0.00918%	0.0731%	8.8x
`glove-100-angular`	10	`clostera-dense-exact-random:4; quality+hybrid-L16:4; clostera-dense-exact-row:2`	0.0673%	1.09%	2.23x
`sift-128-euclidean`	10	`clostera-dense-exact-random:4; quality+hybrid-L16:4; clostera-dense-exact-row:2`	0.0169%	0.119%	6.21x
`n100m_k2048_d1024_iso_gaussian_balanced`	8	`clostera-dense-exact-row:8`	0%	0.000106%	1x
`n100m_k256_d1024_mixed_curse`	8	`clostera-dense-exact-random:4; clostera-dense-exact-row:4`	0.227%	0.472%	2.43x
`n100m_k256_d512_iso_gaussian_zipf`	8	`clostera-dense-exact-random:4; clostera-dense-exact-row:4`	0.0522%	0.246%	2.3x
`n100m_k64_d256_swiss_roll_lifted`	8	`clostera-dense-exact-nredo:3; clostera-dense-exact-row:2; quality+adc+nredo:2`	0%	2.29%	1x
`n1b_k1024_d256_hub_inducing`	8	`clostera-dense-exact-row:8`	0%	0.0791%	1x
`n1b_k256_d256_iso_gaussian_balanced`	7	auto-selected rows not completed in snapshot	-	-	-

Row-Level Examples

The complete row-level table is in benchmarks/results/readme_quality_speed_winners_20260504.csv. These examples use score / seconds; score direction depends on score_metric in the CSV.

20newsgroups, cos, K=20

Best quality: quality+hybrid-L4, 0.59059 / 3.28s
Quality-speed winner: clostera-dense-exact-random, 0.58277 / 0.0298s
Auto: clostera-dense-exact-row, 0.58928 / 0.0355s

ag-news, l2, K=4

Best quality: quality+hybrid-exact+flash, 0.59778 / 5.06s
Quality-speed winner: clostera-dense-exact-bound, 0.59709 / 0.0351s
Auto: clostera-dense-exact-nredo, 0.59639 / 0.106s

cifar100, l2, K=100

Best quality: clostera-dense-exact-nredo, 0.56788 / 0.322s
Quality-speed winner: clostera-dense-exact-random, 0.56641 / 0.0782s
Auto: clostera-dense-exact-random, 0.56641 / 0.0782s

gist-960-euclidean, l2, K=512

Best quality: faiss-kmeans, 0.0011905 / 321s
Quality-speed winner: clostera-dense-exact-row, 0.0011912 / 10.7s
Auto: clostera-dense-exact-row, 0.0011912 / 10.7s

n100m_k2048_d1024_iso_gaussian_balanced, l2, K=2048

Best quality: clostera-dense-exact-row, 1.0331 / 391s
Quality-speed winner: clostera-dense-exact-row, 1.0331 / 391s
Auto: clostera-dense-exact-row, 1.0331 / 391s

n1b_k1024_d256_hub_inducing, cos, K=1024

Best quality: clostera-dense-exact-row, 6.1402e+08 / 1200s
Quality-speed winner: clostera-dense-exact-row, 6.1402e+08 / 1200s
Auto: clostera-dense-exact-row, 6.1402e+08 / 1200s

Practical Notes

Dense exact paths are often the right answer at small and medium scale. They avoid quantization error and use fused rowwise assignment plus thread-local reductions.
Product-quantized paths matter when the dataset is large enough that dense passes are no longer the best trade-off, or when memory pressure dominates.
Hybrid paths use compressed lookup for a shortlist and exact dense rescoring for final assignment.
algorithm="auto" is conservative. If the selector does not have a measured row for a shape, it falls back to simple dense or compressed backends rather than silently inventing a new configuration.
Path-like parquet and memmap workflows remain supported. Some dense exact algorithms require raw vectors in memory; auto falls back when that requirement is not met.

Reproducing the Benchmarks

Install benchmark dependencies:

python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip maturin
python -m pip install -e ".[benchmarks]"

Run the real labeled + ANN sweep from a checkout where dataset paths and output paths have been configured for your machine. The committed schedule files are reproducibility templates; replace /benchmark/clostera with your benchmark root or regenerate them with the scheduler scripts.

bash benchmarks/schedules/grand-pareto-resweep-20260426-postfaiss.sh
bash benchmarks/schedules/gist-unlocked-exact-20260427.sh

Run the large synthetic sweep:

bash benchmarks/schedules/synthetic-large-scale-pareto-20260427.sh

Regenerate the README summary CSV files from raw result JSON:

python scripts/summarize_benchmark_evidence.py

The synthetic generator archive is committed as synthetic_hard_graph_generator_harness.tar.gz. It writes raw memmappable f32 vector shards and i32 label shards with deterministic seeds, so large runs can be resumed and audited shard by shard.

Development

Build locally:

python -m pip install -U maturin
python -m maturin develop --release

Run tests:

python -m pytest -q
cargo test

On macOS, the default build links against Accelerate. On Linux, the default build uses the system BLAS path detected by pkg-config or falls back to -lopenblas. Explicit Cargo features remain available for OpenBLAS system/static builds.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.github/workflows		.github/workflows
benches		benches
benchmarks		benchmarks
docs/assets		docs/assets
python/clostera		python/clostera
scripts		scripts
src		src
tests		tests
vendor/openblas-build		vendor/openblas-build
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
build.rs		build.rs
pyproject.toml		pyproject.toml
rust-toolchain.toml		rust-toolchain.toml
synthetic_hard_graph_generator_harness.tar.gz		synthetic_hard_graph_generator_harness.tar.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clostera vs FAISS

End-to-End Examples

API Contract

Algorithms

Out of Scope

Scikit-Learn `MiniBatchKMeans`

ScaNN, HNSWlib, Annoy, and Similar ANN Libraries

Vector Databases

Traditional Distributed Frameworks

GPU-First Clustering Stacks

What Clostera Is

What Auto Does

Benchmark Policy

Hardware and Execution Controls

Datasets

Auto Versus Best Quality

Row-Level Examples

Practical Notes

Reproducing the Benchmarks

Development

About

Uh oh!

Releases 6

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Clostera vs FAISS

End-to-End Examples

API Contract

Algorithms

Out of Scope

Scikit-Learn MiniBatchKMeans

ScaNN, HNSWlib, Annoy, and Similar ANN Libraries

Vector Databases

Traditional Distributed Frameworks

GPU-First Clustering Stacks

What Clostera Is

What Auto Does

Benchmark Policy

Hardware and Execution Controls

Datasets

Auto Versus Best Quality

Row-Level Examples

Practical Notes

Reproducing the Benchmarks

Development

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Scikit-Learn `MiniBatchKMeans`

Packages