classify: keep large-array sampler O(num_sample) in host memory (#3412)#3416
Open
brendancol wants to merge 1 commit into
Open
classify: keep large-array sampler O(num_sample) in host memory (#3412)#3416brendancol wants to merge 1 commit into
brendancol wants to merge 1 commit into
Conversation
_generate_sample_indices used RandomState.choice(replace=False) on the >10M-element branch, which builds a full arange(num_data) permutation internally. Peak driver-host memory scaled with num_data instead of num_sample, so the sample-index step OOMed on very large dask arrays - the opposite of the branch's documented intent. Switch the large branch to np.random.default_rng().choice, which uses Floyd's algorithm and stays O(num_sample) when num_sample << num_data. It remains seeded and deterministic, and the small-array reproducibility branch is unchanged. Peak for a 20k sample from a 20M population drops from ~160 MB to under 1 MB. Backs the dask and dask+cupy paths of natural_breaks, maximum_breaks, quantile, percentiles, and box_plot. Closes #3412
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #3412.
_generate_sample_indiceshas a branch for large arrays (num_data > 10_000_000) that is meant to keep host memory proportional tonum_sample. It usednp.random.RandomState.choice(num_data, size=num_sample, replace=False), which builds a fullarange(num_data)permutation internally, so peak driver-host memory scaled withnum_datarather thannum_sample. On a large dask array the sample-index step OOMs the driver before any chunk is read. The docstring claimed the branch was "O(num_sample) rather than O(num_data)", the opposite of its actual behaviour.This swaps the large branch to
np.random.default_rng(seed).choice(..., replace=False). NumPy'sGenerator.choiceuses Floyd's algorithm and stays O(num_sample) whennum_sampleis much smaller thannum_data. The sampler is still seeded and deterministic. The small-array branch (<= 10M elements), which builds a fulllinspaceto stay reproducible against the numpy backend, is unchanged.The sampler backs the dask and dask+cupy paths of
natural_breaks,maximum_breaks,quantile,percentiles, andbox_plot.Measured peak for a 20k sample from a 20M-element population drops from ~160 MB to under 1 MB.
Added two regression tests: one asserts the large branch stays sub-linear in host memory (tracemalloc peak under 20 MB at 20M population), and one asserts it stays deterministic across calls. All 93 classify tests pass, including the cupy and dask+cupy paths on a CUDA host.