classify: keep large-array sampler O(num_sample) in host memory (#3412) by brendancol · Pull Request #3416 · xarray-contrib/xarray-spatial

brendancol · 2026-06-20T09:53:13Z

Closes #3412.

_generate_sample_indices has a branch for large arrays (num_data > 10_000_000) that is meant to keep host memory proportional to num_sample. It used np.random.RandomState.choice(num_data, size=num_sample, replace=False), which builds a full arange(num_data) permutation internally, so peak driver-host memory scaled with num_data rather than num_sample. On a large dask array the sample-index step OOMs the driver before any chunk is read. The docstring claimed the branch was "O(num_sample) rather than O(num_data)", the opposite of its actual behaviour.

This swaps the large branch to np.random.default_rng(seed).choice(..., replace=False). NumPy's Generator.choice uses Floyd's algorithm and stays O(num_sample) when num_sample is much smaller than num_data. The sampler is still seeded and deterministic. The small-array branch (<= 10M elements), which builds a full linspace to stay reproducible against the numpy backend, is unchanged.

The sampler backs the dask and dask+cupy paths of natural_breaks, maximum_breaks, quantile, percentiles, and box_plot.

Measured peak for a 20k sample from a 20M-element population drops from ~160 MB to under 1 MB.

Added two regression tests: one asserts the large branch stays sub-linear in host memory (tracemalloc peak under 20 MB at 20M population), and one asserts it stays deterministic across calls. All 93 classify tests pass, including the cupy and dask+cupy paths on a CUDA host.

_generate_sample_indices used RandomState.choice(replace=False) on the >10M-element branch, which builds a full arange(num_data) permutation internally. Peak driver-host memory scaled with num_data instead of num_sample, so the sample-index step OOMed on very large dask arrays - the opposite of the branch's documented intent. Switch the large branch to np.random.default_rng().choice, which uses Floyd's algorithm and stays O(num_sample) when num_sample << num_data. It remains seeded and deterministic, and the small-array reproducibility branch is unchanged. Peak for a 20k sample from a 20M population drops from ~160 MB to under 1 MB. Backs the dask and dask+cupy paths of natural_breaks, maximum_breaks, quantile, percentiles, and box_plot. Closes #3412

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

classify: keep large-array sampler O(num_sample) in host memory (#3412)#3416

classify: keep large-array sampler O(num_sample) in host memory (#3412)#3416
brendancol wants to merge 1 commit into
mainfrom
deep-sweep-performance-classify-2026-06-20

brendancol commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brendancol commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant