Skip to content

Fix CuPy Bellman-Ford iteration limit in cost_distance#1192

Merged
brendancol merged 4 commits intomasterfrom
issue-1191
Apr 14, 2026
Merged

Fix CuPy Bellman-Ford iteration limit in cost_distance#1192
brendancol merged 4 commits intomasterfrom
issue-1191

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Closes #1191.

Summary

  • The CuPy parallel Bellman-Ford loop in cost_distance used max_iterations = height + width. On maze-like friction surfaces where the only passable route zigzags across the grid, shortest paths can have up to height * width - 1 edges. The old limit caused early termination -- reachable pixels were incorrectly reported as NaN.
  • Changed to height * width, the standard Bellman-Ford V-1 bound. The early-exit changed flag still short-circuits on open grids, so this only costs extra iterations when they're actually needed.

Test plan

  • New test_snake_maze_long_path -- 5x5 snake maze with a 16-edge shortest path (exceeds old h+w=10 limit), verified across all four backends
  • Full test_cost_distance.py suite passes (48 tests)

Three performance fixes from the Phase 2 sweep targeting WILL OOM
verdicts under 30TB workloads:

geotiff: read_geotiff_dask() was reading the entire file into RAM just
to extract metadata before building the lazy dask graph. Now uses
_read_geo_info() which parses only the IFD via mmap -- O(1) memory
regardless of file size. Peak memory during dask setup dropped from
4.41 MB to 0.21 MB at 512x512 (21x reduction).

sieve: region_val_buf was allocated at rows*cols (16 GB for a 46K x 46K
raster) when the actual region count is typically orders of magnitude
smaller. Now counts regions first, allocates at actual size. Also reuses
the dead rank array as root_to_id, saving another 4 bytes/pixel. Memory
guard fixed from a misleading 5x multiplier to an accurate 28
bytes/pixel estimate.

reproject: _reproject_dask_cupy pre-allocated the full output on GPU via
cp.full(out_shape), which OOMs for large outputs. Now checks available
GPU memory and falls back to the existing map_blocks path (with
is_cupy=True) when the output exceeds VRAM. Fast path preserved for
outputs that fit.
Four more performance fixes from the Phase 2 sweep:

polygonize: _polygonize_dask called dask.compute(*delayed_results) which
held all chunk polygon data in memory at once. Now processes chunks
incrementally -- interior polygons go straight to the output list and
only boundary polygons accumulate for the merge step.

polygon_clip: clip_polygon called mask.compute() to materialize the
entire rasterized mask before applying it. For a polygon covering most
of a 30TB raster, the uint8 mask alone would be multi-TB. Now keeps the
mask lazy for dask paths and applies it via xarray.where (dask+numpy)
or da.map_blocks (dask+cupy).

kde: Both dask paths captured the full point arrays (xs, ys, ws) in every
tile task's closure, serializing O(n_tiles * n_points) data. Now
pre-filters points per tile using a bounding-box + cutoff-radius check,
so each task receives only nearby points.

pathfinding: When friction=None, the A* kernel allocated a dummy
np.ones((h, w)) array that was never read (use_friction=False skips all
friction lookups). For a 100K x 100K grid that's 80 GB of wasted
allocation. Now passes a 1x1 dummy instead.
The parallel Bellman-Ford loop used max_iterations = height + width,
which is too low for maze-like friction surfaces where shortest paths
can snake across the entire grid (up to height * width - 1 edges).
Changed to height * width, the standard Bellman-Ford V-1 bound.
Tests a maze where the shortest path has 16 edges on a 5x5 grid,
which requires more than height + width Bellman-Ford iterations.
Covers all four backends (numpy, cupy, dask+numpy, dask+cupy).
@github-actions github-actions bot added the performance PR touches performance-sensitive code label Apr 13, 2026
@brendancol brendancol merged commit f3e8603 into master Apr 14, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CuPy cost_distance Bellman-Ford terminates early on maze-like friction surfaces

1 participant