perf: Vectorize get_chunk_slice for faster sharded writes by mkitti · Pull Request #3713 · zarr-developers/zarr-python

mkitti · 2026-02-17T06:10:03Z

Summary

This PR adds vectorized methods to _ShardIndex and _ShardReader for batch chunk slice lookups, significantly reducing per-chunk function call overhead when writing to shards.

Changes

New Methods

_ShardIndex.get_chunk_slices_vectorized: Batch lookup of chunk slices using NumPy vectorized operations instead of per-chunk Python calls.

_ShardReader.to_dict_vectorized: Build a chunk dictionary using vectorized lookup instead of iterating with individual get() calls.

Modified Code Path

In _encode_partial_single, replaced:

shard_dict = {k: shard_reader.get(k) for k in morton_order_iter(chunks_per_shard)}

With vectorized approach:

morton_coords = _morton_order(chunks_per_shard)
chunk_coords_array = np.array(morton_coords, dtype=np.uint64)
shard_dict = shard_reader.to_dict_vectorized(chunk_coords_array, morton_coords)

Benchmark Results

Single Chunk Write to Large Shard

Writing a single 1x1x1 chunk to a shard with 32³ chunks (using test_sharded_morton_write_single_chunk from PR #3712):

Optimization	Time	Speedup vs Main
Main branch (original)	422ms	-
+ Morton optimization (PR #3708)	261ms	1.6x
+ Vectorized get_chunk_slice	95ms	4.4x

Profile Breakdown

Function	Before	After
`get_chunk_slice` + `_localize_chunk`	215ms	3ms
`to_dict_vectorized` loop	81ms	9ms
Total function calls	299k	37k

Checklist

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/user-guide/*.md
Changes documented as a new file in changes/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

Add benchmarks that clear the _morton_order LRU cache before each iteration to measure the full Morton computation cost: - test_sharded_morton_indexing: 512-4096 chunks per shard - test_sharded_morton_indexing_large: 32768 chunks per shard Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add vectorized methods to _ShardIndex and _ShardReader for batch chunk slice lookups, reducing per-chunk function call overhead when writing to shards. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

mkitti · 2026-02-18T16:20:55Z

@d-v-b add the benchmark tag here as well please.

codspeed-hq · 2026-02-18T16:43:48Z

Merging this PR will improve performance by ×7.2

⚡ 4 improved benchmarks
✅ 52 untouched benchmarks
⏩ 6 skipped benchmarks¹

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	WallTime	`test_morton_order_iter[(8, 8, 8)]`	6.2 ms	1.1 ms	×5.5
⚡	WallTime	`test_morton_order_iter[(16, 16, 16)]`	56.2 ms	8.1 ms	×6.9
⚡	WallTime	`test_morton_order_iter[(32, 32, 32)]`	502.6 ms	70.2 ms	×7.2
⚡	WallTime	`test_sharded_morton_write_single_chunk[(32, 32, 32)-memory]`	953.3 ms	164.4 ms	×5.8

_{Comparing mkitti:mkitti-get-chunk-slice-vectorization (157d283) with main (36caf1f)}

6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

src/zarr/codecs/sharding.py

_morton_order now returns a read-only npt.NDArray[np.intp] (annotated as Iterable[Sequence[int]]) instead of a tuple of tuples, eliminating the intermediate list-of-tuples allocation. morton_order_iter converts rows to tuples on the fly. to_dict_vectorized no longer requires a redundant chunk_coords_tuples argument; tuple conversion happens inline during dict population. get_chunk_slices_vectorized accepts any integer array dtype (npt.NDArray[np.integer[Any]]) and casts to uint64 internally. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add _morton_order_keys() as a second lru_cache that converts the ndarray returned by _morton_order into a tuple of tuples. This restores cached access to hashable chunk coordinate keys without reverting to the old dual-argument interface. morton_order_iter now uses _morton_order_keys, and to_dict_vectorized derives its keys from _morton_order_keys internally using the shard index shape, keeping the call site single-argument. Result: test_sharded_morton_write_single_chunk[(32,32,32)] improves from ~33ms to ~7ms (~5x speedup over prior to this PR's changes). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…hmarks All benchmark functions that call _morton_order.cache_clear() now also call _morton_order_keys.cache_clear() to ensure both caches are reset before each benchmark iteration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mkitti · 2026-02-20T04:55:10Z

Benchmark results after latest commits

Compared against main with cache cleared before each iteration (_morton_order and _morton_order_keys):

Benchmark	`main`	This PR	Speedup
`test_sharded_morton_write_single_chunk[(32,32,32)]`	76,235 µs	6,802 µs	11.2×
`test_morton_order_iter[(16,16,16)]`	2,092 µs	< threshold	>8×
`test_morton_order_iter[(32,32,32)]`	17,032 µs	< threshold	>8×
`test_sharded_morton_indexing[(16,16,16)]`	21,925 µs	21,788 µs	~1× (unaffected)
`test_sharded_morton_indexing[(32,32,32)]`	175,284 µs	177,995 µs	~1× (unaffected)

Profiling breakdown (test_sharded_morton_write_single_chunk, 32³ shard, 32,768 chunks):

Cost	`main`	This PR
`_morton_order` computation	18ms (scalar loop × 32,768)	1ms (vectorized)
`get_chunk_slice` / `_localize_chunk`	67ms (32,768 Python calls)	1ms (`get_chunk_slices_vectorized`)
Total function calls	297,531	166,468

The 11× speedup comes from two changes: vectorizing _morton_order (eliminating 32,768 scalar decode_morton calls) and vectorizing the chunk slice lookup (eliminating 32,768 per-chunk get_chunk_slice + _localize_chunk calls). Read paths are unaffected.

Claude generated this report.

src/zarr/core/indexing.py

d-v-b · 2026-02-20T15:15:14Z

this looks great, just 1 nit about narrowing the return type of a function and I think this is good to go

More precise than Iterable[Sequence[int]] and accurately reflects the actual return value. Remove the now-unused Iterable import. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mkitti · 2026-02-20T20:07:56Z

We seem to be stuck due to codspeed running out of time.

The usage of Macro Runners on zarr-developers exceeded the limit of 600 minutes.

The current period ends on February 28, 2026.

This may need to be merged manually.

Also, I have mainly optimized the power-of-2 case. I am preparing additonal benchmarks beyond a power of 2.

mkitti and others added 21 commits February 13, 2026 00:15

perf: Skip bounds check for initial elements in 2^n hypercube

47730f3

lint:Use a list comprehension rather than a for loop

865df2a

pref:Add decode_morton_vectorized

ce0099c

perf:Replace math.log2() with bit_length()

1b1076f

perf:Use magic numbers for 2D and 3D

47a68eb

perf:Add 4D Morton magic numbers

6fb6d00

perf:Add Morton magic numbers for 5D

db31842

perf:Remove singleton dimensions to reduce ndims

f9952f1

Add changes

aedce5a

fix:Address type annotation and linting issues

ef18210

perf:Remove magic number functions

24dcbd5

test:Add power of 2 sharding indexing tests

7b3db07

fix:Bound LRU cache of _morton_order to 16

1cdcbdf

Merge branch 'main' into mkitti-morton-decode-optimization

536f520

Merge branch 'main' into mkitti-morton-decode-optimization

65205b3

test:Add a single chunk test for a large shard

c872e2b

test:Add indexing benchmarks for writing

1cad983

tests:Add single chunk write test for sharding

a666211

perf: Vectorize get_chunk_slice for faster sharded writes

6129cd3

Add vectorized methods to _ShardIndex and _ShardReader for batch chunk slice lookups, reducing per-chunk function call overhead when writing to shards. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Merge branch 'main' into mkitti-get-chunk-slice-vectorization

157d283

d-v-b added the benchmark Code will be benchmarked in a CI job. label Feb 18, 2026

d-v-b reviewed Feb 18, 2026

View reviewed changes

src/zarr/codecs/sharding.py Outdated Show resolved Hide resolved

mkitti and others added 4 commits February 19, 2026 17:17

Merge branch 'main' into mkitti-get-chunk-slice-vectorization

085fc96

d-v-b reviewed Feb 20, 2026

View reviewed changes

src/zarr/core/indexing.py Outdated Show resolved Hide resolved

refactor: Use npt.NDArray[np.intp] as return type for _morton_order

471e100

More precise than Iterable[Sequence[int]] and accurately reflects the actual return value. Remove the now-unused Iterable import. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

d-v-b enabled auto-merge (squash) February 20, 2026 18:17

d-v-b disabled auto-merge February 20, 2026 20:09

d-v-b approved these changes Feb 20, 2026

View reviewed changes

d-v-b merged commit 2f9b0b3 into zarr-developers:main Feb 20, 2026
24 of 26 checks passed

mkitti mentioned this pull request Feb 20, 2026

Add non-power-of-2 shapes for Morton coding to benchmarks #3717

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

perf: Vectorize get_chunk_slice for faster sharded writes#3713

perf: Vectorize get_chunk_slice for faster sharded writes#3713
d-v-b merged 26 commits intozarr-developers:mainfrom
mkitti:mkitti-get-chunk-slice-vectorization

mkitti commented Feb 17, 2026 •

edited

Loading

Uh oh!

mkitti commented Feb 18, 2026

Uh oh!

codspeed-hq bot commented Feb 18, 2026

Uh oh!

Uh oh!

mkitti commented Feb 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

d-v-b commented Feb 20, 2026

Uh oh!

mkitti commented Feb 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Comments

Conversation

mkitti commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

New Methods

Modified Code Path

Benchmark Results

Single Chunk Write to Large Shard

Profile Breakdown

Checklist

Uh oh!

mkitti commented Feb 18, 2026

Uh oh!

codspeed-hq bot commented Feb 18, 2026

Merging this PR will improve performance by ×7.2

Performance Changes

Footnotes

Uh oh!

Uh oh!

mkitti commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark results after latest commits

Uh oh!

Uh oh!

d-v-b commented Feb 20, 2026

Uh oh!

mkitti commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mkitti commented Feb 17, 2026 •

edited

Loading

mkitti commented Feb 20, 2026 •

edited

Loading

mkitti commented Feb 20, 2026 •

edited

Loading