Conversation
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
06ce5c2 to
e5d6ea0
Compare
Merging this PR will improve performance by ×10
Performance Changes
Comparing Footnotes
|
| let offsets = new_offsets.into_array(); | ||
| let sizes = new_sizes.into_array(); | ||
| let elements = new_elements_builder.finish(); | ||
| let elements = self.elements().take(take_indices.into_array())?; |
There was a problem hiding this comment.
this builds a lazy take plan, do you want this?
There was a problem hiding this comment.
I think you want a execute here?
| let offsets = new_offsets.into_array(); | ||
| let sizes = new_sizes.into_array(); | ||
| let elements = new_elements_builder.finish(); | ||
| let elements = self.elements().take(take_indices.into_array())?; |
There was a problem hiding this comment.
I think you want a execute here?
|
Its faster if you don't do the work 🙈 |
|
oh right all of our compute is lazy now haha But I'll be surprised if this is not significantly better than the old naive version for primitive types Edit: Actually are we sure we want to execute here? I don't think it is absolutely necessary to do so, especially if we want to filter/take after a rebuild has occurred (in which case we can optimize that plan) |
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
|
Yeah my bad, didn't know take was lazy. It's still a lot faster than previous impl though |
| let elements = self | ||
| .elements() | ||
| .take(take_indices.into_array())? | ||
| .to_canonical()? | ||
| .into_array(); |
There was a problem hiding this comment.
In the case that we read ListViewArray from disk and we want to export to datafusion (basically if we want to export as a ListArray to Arrow), I believe that not executing here might be more beneficial.
If we want to push down a filter/take into a ListView from Datafusion/arrow, then we definitely do not want to execute here since it would be better to combine the filter/take with this take here.
I also think that this is the more common (perhaps only?) path that this naive_rebuild exists in.
There was a problem hiding this comment.
Yeah, not sure what the caller expects to get here. I think it makes sense to keep it lazy and leave canonicalization up to the caller, wdyt @joseph-isaacs ?
There was a problem hiding this comment.
I think it depends if you clean up a large amount of element by being eager.
There was a problem hiding this comment.
Not sure why this would clean up more than if we wait? It's not like a join where we might balloon the data out, later takes and filters can only make this smaller right?
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
7bce0f4 to
66b9f69
Compare
Deploying vortex-bench with
|
| Latest commit: |
29bdc23
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://104c2692.vortex-93b.pages.dev |
| Branch Preview URL: | https://bp-listview-rebuild.vortex-93b.pages.dev |
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
…nto bp/listview-rebuild
connortsui20
left a comment
There was a problem hiding this comment.
I think this is probably fine, though because this is very heuristic-y I think we need some comments giving a general idea of why we are doing this.
Also, we still need to figure out if we want to do the eager or lazy take (do we canonicalize or let someone else do that so that we can potentially optimize the take?)
| } | ||
|
|
||
| /// Decides whether the per-list-copy strategy should be used over bulk-take. | ||
| /// |
There was a problem hiding this comment.
I think this is good, would appreciate a small high-level comment here explaining the general gist of this, something like:
Rebuilding list by list can introduce unnecessary overhead if there are a lot of small lists. In those cases, it is better to model this rebuild as a take over indices.
Beyond that, the size of the elements can also determine which is faster. Generally, if the sizes of the lists are large, we want to do list-by-list `memcpy`s.
| // - N >= 2: threshold halved (e.g. FSL<i32,4> → (8+4)*32 = 384) | ||
| // - N == 1: threshold unchanged (same as flat types) | ||
| // Uses the *inner* element size, not the total FSL size, because per-list-copy | ||
| // extends the inner flat buffer directly. | ||
| DType::FixedSizeList(inner, n, ..) => inner | ||
| .element_size() | ||
| .is_some_and(|e| avg >= (8 + e as u64) * 64 / (*n as u64).clamp(1, 2)), |
There was a problem hiding this comment.
I'm not sure I understand the clamping here, and also for the example in the comment if we have an FSL<i32, 4>, then the element size is sizeof(i32) * 4 = 4 * 4 = 16 bytes, not 4.
There was a problem hiding this comment.
The e here is the inner element size (4 bytes for i32), not the total FSL size (16 bytes for FSL<i32, 4>). This is because rebuild_list_by_list operates on the inner flat buffer directly — it extends the underlying i32 buffer, not a buffer of 4-element arrays.
The clamp was there to just cap the divisor because based on initial observation, there wasn't any difference between N=2 vs N=4. I ran a sweep with different N values (up to 4096) and Claude came up with a better formula. At high N's (256+), both strategies converge in cost
Signed-off-by: Baris Palaska <barispalaska@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Polar Signals Profiling ResultsLatest Run
Previous Runs (1)
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingSummary
Detailed Results Table
|
Benchmarks: TPC-H SF=1 on NVMESummary
Detailed Results Table
|
Benchmarks: FineWeb NVMeSummary
Detailed Results Table
|
Benchmarks: TPC-H SF=1 on S3Summary
Detailed Results Table
|
Benchmarks: TPC-DS SF=1 on NVMESummary
Detailed Results Table
|
Benchmarks: TPC-H SF=10 on NVMESummary
Detailed Results Table
|
Benchmarks: FineWeb S3Summary
Detailed Results Table
|
Benchmarks: Statistical and Population GeneticsSummary
Detailed Results Table
|
Benchmarks: TPC-H SF=10 on S3Summary
Detailed Results Table
|
|
Ok so it seems like there's no difference in our SQL benchmarks, can you make the change to the lazy take and see how that affects things? You can just add the action/benchmark-sql tag again and we can see if that causes anything |
…nto bp/listview-rebuild
|
is |
Benchmarks: Clickbench on NVMESummary
Detailed Results Table
|
|
@palaska might be good to check? Here is the query from I will check how long they actually are |
Replace per-list
extend_from_array(slice(...))with a singletakeinnaive_rebuild.The old approach canonicalized elements upfront, then called
extend_from_arrayper list — each call triggered buffer cloning for VarBinView. The new approach collects element indices into aBufferMut<u64>anddoes one bulk
take.