Skip to content

feat: Mean aggregate#7201

Open
blaginin wants to merge 14 commits intodevelopfrom
db/count-and-mean
Open

feat: Mean aggregate#7201
blaginin wants to merge 14 commits intodevelopfrom
db/count-and-mean

Conversation

@blaginin
Copy link
Copy Markdown
Member

@blaginin blaginin commented Mar 29, 2026

No description provided.

Signed-off-by: blaginin <dima@spiraldb.com>
Co-authored-by: Claude <noreply@anthropic.com>
@blaginin blaginin self-assigned this Mar 29, 2026
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: blaginin <dima@spiraldb.com>
/// Return the count of non-null elements in an array.
///
/// See [`Count`] for details.
pub fn count(array: &ArrayRef, ctx: &mut ExecutionCtx) -> VortexResult<u64> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is counting things in an array, should it return a usize?

false
}

fn accumulate(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to register a generic aggregate kernel to reduce count-non-null to be Array.validity().sum(), then we avoid decompressing all the data.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

575672a

added generic kernels first, but then it's hard to express "I never want Count::accumulate to be called" - so switched to try_accumulate instead

}

fn partial_struct_dtype() -> DType {
DType::Struct(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you static LazyLock this entire dtype inside this function?

@gatesn gatesn added the changelog/feature A new feature label Mar 30, 2026
Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be decimal for decimals?

Comment on lines +53 to +56
pub struct MeanPartial {
sum: f64,
count: u64,
}
Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will compute with a pretty large error?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$$\mu_n = \sum_{i=0}^n x_i/n \iff \mu_{n+1} = \mu_n + \frac{x_{n+1} - \mu_n}{1+n}$$

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need only hold the partial mean and count

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can incrementally push the next value into this partial, but you cannot combine two partials afaik?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For combining you do. $$\mu_{AB} = \frac{n_A \mu_A + n_B \mu_B}{n_A + n_B}$$

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DF sum is T and ours is f64

Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error comes from numbers of different scales being combined, not from number of OPS.

Since we are storing these on disk we cannot "just change it later".

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DF only supports AVG for floats, decimals, and durations. And coerces all floats to f64.

Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| Input Type   | DuckDB | DataFusion | Velox      |
|--------------|--------|------------|------------|
| Integer types| DOUBLE | Float64    | DOUBLE     |
| FLOAT/REAL   | DOUBLE | Float64    | REAL       |
| DOUBLE       | DOUBLE | Float64    | DOUBLE     |
| DECIMAL(p,s) | DOUBLE | Float64    | DECIMAL(p,s)|

Copy link
Copy Markdown
Member Author

@blaginin blaginin Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought about this more - i think it should decimal for decimals? - happy to change

blaginin added 3 commits April 2, 2026 14:24
Signed-off-by: blaginin <github@blaginin.me>
Signed-off-by: blaginin <github@blaginin.me>
@blaginin blaginin changed the title Count and Mean aggregates feat: Count and Mean aggregates Apr 2, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 2, 2026

Merging this PR will improve performance by 27.66%

⚡ 2 improved benchmarks
✅ 1120 untouched benchmarks
⏩ 1530 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation varbinview_zip_fragmented_mask 7.1 ms 6.4 ms +11.74%
Simulation varbinview_zip_block_mask 3.7 ms 2.9 ms +27.66%

Comparing db/count-and-mean (be4fa68) with develop (c660607)

Open in CodSpeed

Footnotes

  1. 1530 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Signed-off-by: blaginin <github@blaginin.me>
@blaginin blaginin changed the title feat: Count and Mean aggregates feat: Mean aggregate Apr 2, 2026
blaginin and others added 4 commits April 2, 2026 17:56
Signed-off-by: blaginin <github@blaginin.me>
# Conflicts:
#	vortex-array/public-api.lock
#	vortex-array/src/aggregate_fn/fns/mod.rs
Signed-off-by: blaginin <github@blaginin.me>
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: blaginin <github@blaginin.me>
blaginin added 3 commits April 8, 2026 16:23
Signed-off-by: blaginin <github@blaginin.me>
Signed-off-by: blaginin <github@blaginin.me>
Signed-off-by: blaginin <github@blaginin.me>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can factor this out of this PR, we could at a later date add the back to on disk version will be the same?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disk version will be the same

i blocked serialization for now?

we can factor this out of this PR

why?

Comment on lines +49 to +51
/// - Decimals are kept as decimals with widened precision and scale
/// (`+4` each, capped at [`MAX_PRECISION`] / [`MAX_SCALE`]), matching
/// DataFusion's `coerce_avg_type`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you definitely want higher for the sum, that is only 10k max sized values for an overflow

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but Sum.partial_dtype() also adds +10

@blaginin blaginin marked this pull request as ready for review April 9, 2026 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants