Skip to content

Comments

perf: Fix near-miss penalty in _morton_order with hybrid ceiling+argsort strategy#3718

Open
mkitti wants to merge 5 commits intozarr-developers:mainfrom
mkitti:mkitti-non-power-of-two-shards
Open

perf: Fix near-miss penalty in _morton_order with hybrid ceiling+argsort strategy#3718
mkitti wants to merge 5 commits intozarr-developers:mainfrom
mkitti:mkitti-non-power-of-two-shards

Conversation

@mkitti
Copy link
Contributor

@mkitti mkitti commented Feb 21, 2026

Summary

Replaces the floor-hypercube + scalar fallback in _morton_order with a
hybrid ceiling + argsort strategy that eliminates the near-miss
performance penalty for shard shapes just above a power-of-2 (e.g.
(33,33,33)).

Strategy selection

Two fully-vectorized strategies are used depending on the overgeneration
ratio n_z / n_total, where n_z is the size of the ceiling power-of-2
hypercube and n_total = product(chunk_shape):

  • Ceiling (n_z ≤ 4 * n_total): decode all n_z Morton codes with
    decode_morton_vectorized, then filter to in-bounds coordinates. Works
    well when overgeneration is small (e.g. (30,30,30): ratio = 1.21).

  • Argsort (n_z > 4 * n_total): enumerate all n_total valid
    coordinates directly via np.meshgrid, encode each to a Morton code
    using vectorized bit manipulation, then np.argsort by code. Avoids
    the large overgeneration for near-miss shapes (e.g. (33,33,33):
    n_z = 262,144 vs n_total = 35,937, ratio ≈ 7.3×). Cost is
    O(n_total · bits + n_total log n_total) instead of
    O(n_z · bits) = O(8 · n_total · bits).

The old floor-hypercube + scalar loop approach had an even worse
near-miss penalty: iterating n_z − n_floor scalar decode_morton
calls (229,376 for (33,33,33)) at ~3 µs each totalled ~820 ms.

Benchmark results

Baseline is the previous floor-hypercube + scalar implementation
(see PR #3717 for raw baseline numbers). After this PR:

test_morton_order_iter — pure Morton computation, LRU cache cleared each round

Shape Elements Type Before After Speedup
(8,8,8) 512 power-of-2 0.45 ms 0.45 ms
(16,16,16) 4,096 power-of-2 3.6 ms 3.6 ms
(32,32,32) 32,768 power-of-2 28.9 ms 29.0 ms
(10,10,10) 1,000 non-power-of-2 9.6 ms 0.90 ms 11×
(20,20,20) 8,000 non-power-of-2 88.2 ms 7.4 ms 12×
(30,30,30) 27,000 non-power-of-2 125.6 ms 23.9 ms
(33,33,33) 35,937 near-miss (+1 above 32³) 767 ms 33.2 ms 23×

Power-of-2 shapes are unchanged. Non-power-of-2 and near-miss shapes
see 5–23× speedups.

test_sharded_morton_write_single_chunk — write 1 chunk to a large shard, cache cleared each round

Shape Before After Speedup
(32,32,32) 35.7 ms 34.7 ms
(30,30,30) 127.5 ms 28.9 ms 4.4×
(33,33,33) 767.8 ms 39.5 ms 19×

test_sharded_morton_single_chunk — read 1 chunk (cache warm after first access)

Unchanged and fast across all shapes (~0.65–0.70 ms), since reads
benefit from the _morton_order LRU cache after the first access.

Test plan

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

All existing sharding and indexing tests pass locally
(tests/test_codecs/test_sharding.py, tests/test_indexing.py).
Correctness verified against the scalar decode_morton reference for
all benchmark shapes including (10,10,10), (20,20,20), (30,30,30),
and (33,33,33).

mkitti and others added 5 commits February 20, 2026 16:27
Add (30,30,30) to large_morton_shards and (10,10,10), (20,20,20),
(30,30,30) to morton_iter_shapes to benchmark the scalar fallback path
for non-power-of-2 shapes, which are not fully covered by the vectorized
hypercube path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents the performance penalty when a shard shape is just above a
power-of-2 boundary, causing n_z to jump from 32,768 to 262,144.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ort strategy

For shapes just above a power-of-2 (e.g. (33,33,33)), the ceiling-only
approach generates n_z=262,144 Morton codes for only 35,937 valid
coordinates (7.3× overgeneration). The floor+scalar approach is even
worse since the scalar loop iterates n_z-n_floor times (229,376 for
(33,33,33)), not n_total-n_floor.

The fix: when n_z > 4*n_total, use an argsort strategy that enumerates
all n_total valid coordinates via meshgrid, encodes each to a Morton code
using vectorized bit manipulation, then sorts by Morton code. This avoids
the large overgeneration while remaining fully vectorized.

Result for test_morton_order_iter:
  (30,30,30): 24ms  (ceiling, ratio=1.21)
  (32,32,32): 28ms  (ceiling, ratio=1.00)
  (33,33,33): 32ms  (argsort, ratio=7.3 → fixed from ~820ms with scalar)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant