Skip to content

Use cumsum from flox#10987

Open
Illviljan wants to merge 159 commits intopydata:mainfrom
Illviljan:cumsum_flox
Open

Use cumsum from flox#10987
Illviljan wants to merge 159 commits intopydata:mainfrom
Illviljan:cumsum_flox

Conversation

@Illviljan
Copy link
Contributor

@Illviljan Illviljan commented Dec 6, 2025

The non-flox version reduces chunksizes significantly:

x = xr.DataArray([1, 1, 1, 1, 1], name="x").chunk()
grp_idx = xr.DataArray([-1, 0, 0, -1, 1])
with xr.set_options(use_flox=False):
    print(x.groupby(grp_idx).cumsum())
<xarray.DataArray 'x' (dim_0: 5)> Size: 40B
dask.array<getitem, shape=(5,), dtype=int64, chunksize=(2,), chunktype=numpy.ndarray>
Dimensions without coordinates: dim_0

With flox the chunksize is retained:

x = xr.DataArray([1, 1, 1, 1, 1], name="x").chunk()
grp_idx = xr.DataArray([-1, 0, 0, -1, 1])
with xr.set_options(use_flox=True):
    print(x.groupby(grp_idx).cumsum())
<xarray.DataArray 'x' (dim_0: 5)> Size: 40B
dask.array<_finalize_scan, shape=(5,), dtype=int64, chunksize=(5,), chunktype=numpy.ndarray>
Dimensions without coordinates: dim_0

Other changes:

  • Changes DataArray.cumsum/Dataset.cumsum/DataTree.cumsum/DataArray.groupby.cumsum/Dataset.groupby.cumsum etc.
  • Coordinates are now retained

Notes
groupby_scan was added in: https://github.com/xarray-contrib/flox/releases/tag/v0.9.9
cumsum was added in: https://github.com/xarray-contrib/flox/releases/tag/v0.10.5

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
@Illviljan
Copy link
Contributor Author

Writing down these flox issues before I forget:

ds = xr.Dataset(
    {
        "foo": (
            ("test", "time"),
            [[7, 2, 0, 1, 2, np.nan], [1, 1, 1, 1, 1, 1], [2, 2, 2, 2, 2, 2]],
        )
    },
    coords={
        "time": [0, 1 / 6, 2 / 6, 3 / 6, 4 / 6, 5 / 6],
        "test": ["a", "b", "b"],
        "group_idx": ("time", [0, 0, 1, 1, 2, 2]),
        "group_idx2": ("time", [0, 1, 1, 1, 1, 1]),
    },
)

# group_idx along 1 dim and cumsum dim along another fails with flox:
ds.groupby("group_idx").cumsum("test") 

# cumsum along multple dims fails with flox:
ds.groupby("group_idx").cumsum(...)

@Illviljan
Copy link
Contributor Author

I wanted to give pytest accept a final try within pixi. But I'm still unable to get it to correctly indent in win10

    pixi shell
    python xarray/util/generate_aggregations.py
    pytest --doctest-modules xarray/core/_aggregations.py --accept
    pytest --doctest-modules xarray/core/_aggregations.py
image

@Illviljan Illviljan added the plan to merge Final call for comments label Feb 21, 2026
@dcherian
Copy link
Contributor

dcherian commented Mar 4, 2026

Sorry I've lost track here. Are there upstream fixes required in flox? My other concern is that resample().cumsum() is probably quite bad because I haven't implemented method="cohorts" for scans upstream

@Illviljan
Copy link
Contributor Author

No I think this PR is finished. Mostly waiting on the green CI and maybe a release to give it chance to sit on main for a while.

I haven't dug into the performance of resample().cumsum(), is it a stopper for you? Are you expecting it to be worse than what we had previously?
At least with groupby().cumsum() it was unusable for me, I just do a early .compute() to avoid the dask arrays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

plan to merge Final call for comments run-benchmark Run the ASV benchmark workflow topic-DataTree Related to the implementation of a DataTree class topic-groupby topic-performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cumsum drops index coordinates

2 participants