Skip to content

classify: keep large-array sampler O(num_sample) in host memory (#3412)#3416

Open
brendancol wants to merge 1 commit into
mainfrom
deep-sweep-performance-classify-2026-06-20
Open

classify: keep large-array sampler O(num_sample) in host memory (#3412)#3416
brendancol wants to merge 1 commit into
mainfrom
deep-sweep-performance-classify-2026-06-20

Conversation

@brendancol

Copy link
Copy Markdown
Contributor

Closes #3412.

_generate_sample_indices has a branch for large arrays (num_data > 10_000_000) that is meant to keep host memory proportional to num_sample. It used np.random.RandomState.choice(num_data, size=num_sample, replace=False), which builds a full arange(num_data) permutation internally, so peak driver-host memory scaled with num_data rather than num_sample. On a large dask array the sample-index step OOMs the driver before any chunk is read. The docstring claimed the branch was "O(num_sample) rather than O(num_data)", the opposite of its actual behaviour.

This swaps the large branch to np.random.default_rng(seed).choice(..., replace=False). NumPy's Generator.choice uses Floyd's algorithm and stays O(num_sample) when num_sample is much smaller than num_data. The sampler is still seeded and deterministic. The small-array branch (<= 10M elements), which builds a full linspace to stay reproducible against the numpy backend, is unchanged.

The sampler backs the dask and dask+cupy paths of natural_breaks, maximum_breaks, quantile, percentiles, and box_plot.

Measured peak for a 20k sample from a 20M-element population drops from ~160 MB to under 1 MB.

Added two regression tests: one asserts the large branch stays sub-linear in host memory (tracemalloc peak under 20 MB at 20M population), and one asserts it stays deterministic across calls. All 93 classify tests pass, including the cupy and dask+cupy paths on a CUDA host.

_generate_sample_indices used RandomState.choice(replace=False) on the
>10M-element branch, which builds a full arange(num_data) permutation
internally. Peak driver-host memory scaled with num_data instead of
num_sample, so the sample-index step OOMed on very large dask arrays -
the opposite of the branch's documented intent.

Switch the large branch to np.random.default_rng().choice, which uses
Floyd's algorithm and stays O(num_sample) when num_sample << num_data.
It remains seeded and deterministic, and the small-array reproducibility
branch is unchanged. Peak for a 20k sample from a 20M population drops
from ~160 MB to under 1 MB.

Backs the dask and dask+cupy paths of natural_breaks, maximum_breaks,
quantile, percentiles, and box_plot.

Closes #3412
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

classify: large-array sample-index path allocates O(num_data) on the driver and OOMs at scale

1 participant