perf: Fix near-miss penalty in _morton_order with hybrid ceiling+argsort strategy by mkitti · Pull Request #3718 · zarr-developers/zarr-python

mkitti · 2026-02-21T01:31:48Z

Summary

Replaces the floor-hypercube + scalar fallback in _morton_order with a
hybrid ceiling + argsort strategy that eliminates the near-miss
performance penalty for shard shapes just above a power-of-2 (e.g.
(33,33,33)).

Strategy selection

Two fully-vectorized strategies are used depending on the overgeneration
ratio n_z / n_total, where n_z is the size of the ceiling power-of-2
hypercube and n_total = product(chunk_shape):

Ceiling (n_z ≤ 4 * n_total): decode all n_z Morton codes with
decode_morton_vectorized, then filter to in-bounds coordinates. Works
well when overgeneration is small (e.g. (30,30,30): ratio = 1.21).
Argsort (n_z > 4 * n_total): enumerate all n_total valid
coordinates directly via np.meshgrid, encode each to a Morton code
using vectorized bit manipulation, then np.argsort by code. Avoids
the large overgeneration for near-miss shapes (e.g. (33,33,33):
n_z = 262,144 vs n_total = 35,937, ratio ≈ 7.3×). Cost is
O(n_total · bits + n_total log n_total) instead of
O(n_z · bits) = O(8 · n_total · bits).

The old floor-hypercube + scalar loop approach had an even worse
near-miss penalty: iterating n_z − n_floor scalar decode_morton
calls (229,376 for (33,33,33)) at ~3 µs each totalled ~820 ms.

Benchmark results

Baseline is the previous floor-hypercube + scalar implementation
(see PR #3717 for raw baseline numbers). After this PR:

`test_morton_order_iter` — pure Morton computation, LRU cache cleared each round

Shape	Elements	Type	Before	After	Speedup
`(8,8,8)`	512	power-of-2	0.45 ms	0.45 ms	1×
`(16,16,16)`	4,096	power-of-2	3.6 ms	3.6 ms	1×
`(32,32,32)`	32,768	power-of-2	28.9 ms	29.0 ms	1×
`(10,10,10)`	1,000	non-power-of-2	9.6 ms	0.90 ms	11×
`(20,20,20)`	8,000	non-power-of-2	88.2 ms	7.4 ms	12×
`(30,30,30)`	27,000	non-power-of-2	125.6 ms	23.9 ms	5×
`(33,33,33)`	35,937	near-miss (+1 above 32³)	767 ms	33.2 ms	23×

Power-of-2 shapes are unchanged. Non-power-of-2 and near-miss shapes
see 5–23× speedups.

`test_sharded_morton_write_single_chunk` — write 1 chunk to a large shard, cache cleared each round

Shape	Before	After	Speedup
`(32,32,32)`	35.7 ms	34.7 ms	1×
`(30,30,30)`	127.5 ms	28.9 ms	4.4×
`(33,33,33)`	767.8 ms	39.5 ms	19×

`test_sharded_morton_single_chunk` — read 1 chunk (cache warm after first access)

Unchanged and fast across all shapes (~0.65–0.70 ms), since reads
benefit from the _morton_order LRU cache after the first access.

Test plan

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/user-guide/*.md
Changes documented as a new file in changes/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

All existing sharding and indexing tests pass locally
(tests/test_codecs/test_sharding.py, tests/test_indexing.py).
Correctness verified against the scalar decode_morton reference for
all benchmark shapes including (10,10,10), (20,20,20), (30,30,30),
and (33,33,33).

Add (30,30,30) to large_morton_shards and (10,10,10), (20,20,20), (30,30,30) to morton_iter_shapes to benchmark the scalar fallback path for non-power-of-2 shapes, which are not fully covered by the vectorized hypercube path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Documents the performance penalty when a shard shape is just above a power-of-2 boundary, causing n_z to jump from 32,768 to 262,144. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ort strategy For shapes just above a power-of-2 (e.g. (33,33,33)), the ceiling-only approach generates n_z=262,144 Morton codes for only 35,937 valid coordinates (7.3× overgeneration). The floor+scalar approach is even worse since the scalar loop iterates n_z-n_floor times (229,376 for (33,33,33)), not n_total-n_floor. The fix: when n_z > 4*n_total, use an argsort strategy that enumerates all n_total valid coordinates via meshgrid, encodes each to a Morton code using vectorized bit manipulation, then sorts by Morton code. This avoids the large overgeneration while remaining fully vectorized. Result for test_morton_order_iter: (30,30,30): 24ms (ceiling, ratio=1.21) (32,32,32): 28ms (ceiling, ratio=1.00) (33,33,33): 32ms (argsort, ratio=7.3 → fixed from ~820ms with scalar) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mkitti and others added 5 commits February 20, 2026 16:27

tests: Add near-miss power-of-2 shape (33,33,33) to benchmarks

1dfd71d

Documents the performance penalty when a shard shape is just above a power-of-2 boundary, causing n_z to jump from 32,768 to 262,144. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

style: Apply ruff format to benchmark file

403c50b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

changes: Add changelog entry for PR zarr-developers#3717

ffa3065

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

perf: Fix near-miss penalty in _morton_order with hybrid ceiling+argsort strategy#3718

perf: Fix near-miss penalty in _morton_order with hybrid ceiling+argsort strategy#3718
mkitti wants to merge 5 commits intozarr-developers:mainfrom
mkitti:mkitti-non-power-of-two-shards

mkitti commented Feb 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Comments

Conversation

mkitti commented Feb 21, 2026

Summary

Strategy selection

Benchmark results

test_morton_order_iter — pure Morton computation, LRU cache cleared each round

test_sharded_morton_write_single_chunk — write 1 chunk to a large shard, cache cleared each round

test_sharded_morton_single_chunk — read 1 chunk (cache warm after first access)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`test_morton_order_iter` — pure Morton computation, LRU cache cleared each round

`test_sharded_morton_write_single_chunk` — write 1 chunk to a large shard, cache cleared each round

`test_sharded_morton_single_chunk` — read 1 chunk (cache warm after first access)