Skip to content

Comments

perf: Vectorize get_chunk_slice for faster sharded writes#3713

Merged
d-v-b merged 26 commits intozarr-developers:mainfrom
mkitti:mkitti-get-chunk-slice-vectorization
Feb 20, 2026
Merged

perf: Vectorize get_chunk_slice for faster sharded writes#3713
d-v-b merged 26 commits intozarr-developers:mainfrom
mkitti:mkitti-get-chunk-slice-vectorization

Conversation

@mkitti
Copy link
Contributor

@mkitti mkitti commented Feb 17, 2026

Summary

This PR adds vectorized methods to _ShardIndex and _ShardReader for batch chunk slice lookups, significantly reducing per-chunk function call overhead when writing to shards.

Changes

New Methods

_ShardIndex.get_chunk_slices_vectorized: Batch lookup of chunk slices using NumPy vectorized operations instead of per-chunk Python calls.

_ShardReader.to_dict_vectorized: Build a chunk dictionary using vectorized lookup instead of iterating with individual get() calls.

Modified Code Path

In _encode_partial_single, replaced:

shard_dict = {k: shard_reader.get(k) for k in morton_order_iter(chunks_per_shard)}

With vectorized approach:

morton_coords = _morton_order(chunks_per_shard)
chunk_coords_array = np.array(morton_coords, dtype=np.uint64)
shard_dict = shard_reader.to_dict_vectorized(chunk_coords_array, morton_coords)

Benchmark Results

Single Chunk Write to Large Shard

Writing a single 1x1x1 chunk to a shard with 32³ chunks (using test_sharded_morton_write_single_chunk from PR #3712):

Optimization Time Speedup vs Main
Main branch (original) 422ms -
+ Morton optimization (PR #3708) 261ms 1.6x
+ Vectorized get_chunk_slice 95ms 4.4x

Profile Breakdown

Function Before After
get_chunk_slice + _localize_chunk 215ms 3ms
to_dict_vectorized loop 81ms 9ms
Total function calls 299k 37k

Checklist

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

mkitti and others added 21 commits February 13, 2026 00:15
Add benchmarks that clear the _morton_order LRU cache before each
iteration to measure the full Morton computation cost:

- test_sharded_morton_indexing: 512-4096 chunks per shard
- test_sharded_morton_indexing_large: 32768 chunks per shard

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add vectorized methods to _ShardIndex and _ShardReader for batch
chunk slice lookups, reducing per-chunk function call overhead
when writing to shards.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@mkitti
Copy link
Contributor Author

mkitti commented Feb 18, 2026

@d-v-b add the benchmark tag here as well please.

@d-v-b d-v-b added the benchmark Code will be benchmarked in a CI job. label Feb 18, 2026
@codspeed-hq
Copy link

codspeed-hq bot commented Feb 18, 2026

Merging this PR will improve performance by ×7.2

⚡ 4 improved benchmarks
✅ 52 untouched benchmarks
⏩ 6 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
WallTime test_morton_order_iter[(8, 8, 8)] 6.2 ms 1.1 ms ×5.5
WallTime test_morton_order_iter[(16, 16, 16)] 56.2 ms 8.1 ms ×6.9
WallTime test_morton_order_iter[(32, 32, 32)] 502.6 ms 70.2 ms ×7.2
WallTime test_sharded_morton_write_single_chunk[(32, 32, 32)-memory] 953.3 ms 164.4 ms ×5.8

Comparing mkitti:mkitti-get-chunk-slice-vectorization (157d283) with main (36caf1f)

Open in CodSpeed

Footnotes

  1. 6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

mkitti and others added 4 commits February 19, 2026 17:17
_morton_order now returns a read-only npt.NDArray[np.intp] (annotated as
Iterable[Sequence[int]]) instead of a tuple of tuples, eliminating the
intermediate list-of-tuples allocation. morton_order_iter converts rows to
tuples on the fly. to_dict_vectorized no longer requires a redundant
chunk_coords_tuples argument; tuple conversion happens inline during dict
population. get_chunk_slices_vectorized accepts any integer array dtype
(npt.NDArray[np.integer[Any]]) and casts to uint64 internally.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add _morton_order_keys() as a second lru_cache that converts the ndarray
returned by _morton_order into a tuple of tuples. This restores cached
access to hashable chunk coordinate keys without reverting to the old
dual-argument interface. morton_order_iter now uses _morton_order_keys,
and to_dict_vectorized derives its keys from _morton_order_keys internally
using the shard index shape, keeping the call site single-argument.

Result: test_sharded_morton_write_single_chunk[(32,32,32)] improves from
~33ms to ~7ms (~5x speedup over prior to this PR's changes).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hmarks

All benchmark functions that call _morton_order.cache_clear() now also
call _morton_order_keys.cache_clear() to ensure both caches are reset
before each benchmark iteration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mkitti
Copy link
Contributor Author

mkitti commented Feb 20, 2026

Benchmark results after latest commits

Compared against main with cache cleared before each iteration (_morton_order and _morton_order_keys):

Benchmark main This PR Speedup
test_sharded_morton_write_single_chunk[(32,32,32)] 76,235 µs 6,802 µs 11.2×
test_morton_order_iter[(16,16,16)] 2,092 µs < threshold >8×
test_morton_order_iter[(32,32,32)] 17,032 µs < threshold >8×
test_sharded_morton_indexing[(16,16,16)] 21,925 µs 21,788 µs ~1× (unaffected)
test_sharded_morton_indexing[(32,32,32)] 175,284 µs 177,995 µs ~1× (unaffected)

Profiling breakdown (test_sharded_morton_write_single_chunk, 32³ shard, 32,768 chunks):

Cost main This PR
_morton_order computation 18ms (scalar loop × 32,768) 1ms (vectorized)
get_chunk_slice / _localize_chunk 67ms (32,768 Python calls) 1ms (get_chunk_slices_vectorized)
Total function calls 297,531 166,468

The 11× speedup comes from two changes: vectorizing _morton_order (eliminating 32,768 scalar decode_morton calls) and vectorizing the chunk slice lookup (eliminating 32,768 per-chunk get_chunk_slice + _localize_chunk calls). Read paths are unaffected.

Claude generated this report.

@d-v-b
Copy link
Contributor

d-v-b commented Feb 20, 2026

this looks great, just 1 nit about narrowing the return type of a function and I think this is good to go

More precise than Iterable[Sequence[int]] and accurately reflects the
actual return value. Remove the now-unused Iterable import.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@d-v-b d-v-b enabled auto-merge (squash) February 20, 2026 18:17
@mkitti
Copy link
Contributor Author

mkitti commented Feb 20, 2026

We seem to be stuck due to codspeed running out of time.

The usage of Macro Runners on zarr-developers exceeded the limit of 600 minutes.

The current period ends on February 28, 2026.

This may need to be merged manually.

Also, I have mainly optimized the power-of-2 case. I am preparing additonal benchmarks beyond a power of 2.

@d-v-b d-v-b disabled auto-merge February 20, 2026 20:09
@d-v-b d-v-b merged commit 2f9b0b3 into zarr-developers:main Feb 20, 2026
24 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark Code will be benchmarked in a CI job.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants