perf: Fix near-miss penalty in _morton_order with hybrid ceiling+argsort strategy#3718
Open
mkitti wants to merge 5 commits intozarr-developers:mainfrom
Open
perf: Fix near-miss penalty in _morton_order with hybrid ceiling+argsort strategy#3718mkitti wants to merge 5 commits intozarr-developers:mainfrom
mkitti wants to merge 5 commits intozarr-developers:mainfrom
Conversation
Add (30,30,30) to large_morton_shards and (10,10,10), (20,20,20), (30,30,30) to morton_iter_shapes to benchmark the scalar fallback path for non-power-of-2 shapes, which are not fully covered by the vectorized hypercube path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents the performance penalty when a shard shape is just above a power-of-2 boundary, causing n_z to jump from 32,768 to 262,144. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ort strategy For shapes just above a power-of-2 (e.g. (33,33,33)), the ceiling-only approach generates n_z=262,144 Morton codes for only 35,937 valid coordinates (7.3× overgeneration). The floor+scalar approach is even worse since the scalar loop iterates n_z-n_floor times (229,376 for (33,33,33)), not n_total-n_floor. The fix: when n_z > 4*n_total, use an argsort strategy that enumerates all n_total valid coordinates via meshgrid, encodes each to a Morton code using vectorized bit manipulation, then sorts by Morton code. This avoids the large overgeneration while remaining fully vectorized. Result for test_morton_order_iter: (30,30,30): 24ms (ceiling, ratio=1.21) (32,32,32): 28ms (ceiling, ratio=1.00) (33,33,33): 32ms (argsort, ratio=7.3 → fixed from ~820ms with scalar) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the floor-hypercube + scalar fallback in
_morton_orderwith ahybrid ceiling + argsort strategy that eliminates the near-miss
performance penalty for shard shapes just above a power-of-2 (e.g.
(33,33,33)).Strategy selection
Two fully-vectorized strategies are used depending on the overgeneration
ratio
n_z / n_total, wheren_zis the size of the ceiling power-of-2hypercube and
n_total = product(chunk_shape):Ceiling (
n_z ≤ 4 * n_total): decode alln_zMorton codes withdecode_morton_vectorized, then filter to in-bounds coordinates. Workswell when overgeneration is small (e.g.
(30,30,30): ratio = 1.21).Argsort (
n_z > 4 * n_total): enumerate alln_totalvalidcoordinates directly via
np.meshgrid, encode each to a Morton codeusing vectorized bit manipulation, then
np.argsortby code. Avoidsthe large overgeneration for near-miss shapes (e.g.
(33,33,33):n_z = 262,144vsn_total = 35,937, ratio ≈ 7.3×). Cost isO(n_total · bits + n_total log n_total)instead ofO(n_z · bits) = O(8 · n_total · bits).The old floor-hypercube + scalar loop approach had an even worse
near-miss penalty: iterating
n_z − n_floorscalardecode_mortoncalls (229,376 for
(33,33,33)) at ~3 µs each totalled ~820 ms.Benchmark results
Baseline is the previous floor-hypercube + scalar implementation
(see PR #3717 for raw baseline numbers). After this PR:
test_morton_order_iter— pure Morton computation, LRU cache cleared each round(8,8,8)(16,16,16)(32,32,32)(10,10,10)(20,20,20)(30,30,30)(33,33,33)Power-of-2 shapes are unchanged. Non-power-of-2 and near-miss shapes
see 5–23× speedups.
test_sharded_morton_write_single_chunk— write 1 chunk to a large shard, cache cleared each round(32,32,32)(30,30,30)(33,33,33)test_sharded_morton_single_chunk— read 1 chunk (cache warm after first access)Unchanged and fast across all shapes (~0.65–0.70 ms), since reads
benefit from the
_morton_orderLRU cache after the first access.Test plan
docs/user-guide/*.mdchanges/All existing sharding and indexing tests pass locally
(
tests/test_codecs/test_sharding.py,tests/test_indexing.py).Correctness verified against the scalar
decode_mortonreference forall benchmark shapes including
(10,10,10),(20,20,20),(30,30,30),and
(33,33,33).