Rewrite FileStream in terms of Morsel API#21342
Conversation
816d243 to
3346af7
Compare
| /// This groups together ready planners, ready morsels, the active reader, | ||
| /// pending planner I/O, the remaining files and limit, and the metrics | ||
| /// associated with processing that work. | ||
| pub(super) struct ScanState { |
There was a problem hiding this comment.
This is the new inner state machine for FileStream
There was a problem hiding this comment.
I think some more diagrams in the docstring of the struct and/or fields could help. I'm trying to wrap my head around how the IO queue and such work.
| use std::sync::Arc; | ||
| use std::sync::mpsc::{self, Receiver, TryRecvError}; | ||
|
|
||
| /// Adapt a legacy [`FileOpener`] to the morsel API. |
There was a problem hiding this comment.
This is an adapter so that existing FileOpeners continue to have the same behavior
| @@ -0,0 +1,556 @@ | |||
| // Licensed to the Apache Software Foundation (ASF) under one | |||
There was a problem hiding this comment.
This is testing infrastructure to write the snapshot tests
| return Poll::Ready(Some(Err(err))); | ||
| } | ||
| } | ||
| FileStreamState::Scan { scan_state: queue } => { |
There was a problem hiding this comment.
moved the inner state machine into a separate module/struct to try and keep indenting under control and encapsualte the complexity somewhat
| assert!(err.contains("FileStreamBuilder invalid partition index: 1")); | ||
| } | ||
|
|
||
| /// Verifies the simplest morsel-driven flow: one planner produces one |
There was a problem hiding this comment.
Here are tests showing the sequence of calls to the various morsel APIs. I intend to use this framework to show how work can migrate from one stream to the other
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
b5c452a to
d5a1f74
Compare
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing alamb/file_stream_split (d5a1f74) to 1e93a67 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
| all-features = true | ||
|
|
||
| [features] | ||
| backtrace = ["datafusion-common/backtrace"] |
There was a problem hiding this comment.
I added this while debugging why the tests failed on CI and not locally (it was when this feature flag was on the Error messages got mangled).
I added a crate level feature to enable the feature in datafusion-common so I could reproduce the error locally
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing alamb/file_stream_split (d5a1f74) to 1e93a67 (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing alamb/file_stream_split (d5a1f74) to 1e93a67 (merge-base) diff using: tpcds File an issue against this benchmark runner |
d5a1f74 to
b2c9bd6
Compare
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
adriangb
left a comment
There was a problem hiding this comment.
Ran out of time for the last couple of files. A lot of the comments are just tracking my thought process, I plan to go over them again to clarify my own understanding but maybe they're helpful as input on how the code reads top to bottom for a first time reader.
| /// Creates a `dyn Morselizer` based on given parameters. | ||
| /// | ||
| /// The default implementation preserves existing behavior by adapting the | ||
| /// legacy [`FileOpener`] API into a [`Morselizer`]. | ||
| /// | ||
| /// It is preferred to implement the [`Morselizer`] API directly by | ||
| /// implementing this method. | ||
| fn create_morselizer( | ||
| &self, | ||
| object_store: Arc<dyn ObjectStore>, | ||
| base_config: &FileScanConfig, | ||
| partition: usize, | ||
| ) -> Result<Box<dyn Morselizer>> { | ||
| let opener = self.create_file_opener(object_store, base_config, partition)?; | ||
| Ok(Box::new(FileOpenerMorselizer::new(opener))) | ||
| } |
| _partition: usize, | ||
| ) -> datafusion_common::Result<Arc<dyn FileOpener>> { | ||
| datafusion_common::internal_err!( | ||
| "ParquetSource::create_file_opener called but it supports the Morsel API" |
There was a problem hiding this comment.
| "ParquetSource::create_file_opener called but it supports the Morsel API" | |
| "ParquetSource::create_file_opener called but it supports the Morsel API, please use that instead" |
Note that this will be a breaking change for folks using ParquetSource directly (which I believe @xudong963 / @zhuqi-lucas are based on #21290).
| /// Configure the [`FileOpener`] used to open files. | ||
| /// | ||
| /// This will overwrite any setting from [`Self::with_morselizer`] | ||
| pub fn with_file_opener(mut self, file_opener: Arc<dyn FileOpener>) -> Self { |
There was a problem hiding this comment.
While I think it could make sense to keep FileOpener as a public API for building data sources (if we consider it simpler, for folks who don't care about perf), this method in particular seems like a mostly internal method (even if it is pub) on we might as well deprecate / remove.
| if let FileStreamState::Scan { scan_state } = &mut self.state { | ||
| scan_state.set_on_error(on_error); | ||
| } | ||
| self |
There was a problem hiding this comment.
Currently this is the only state it makes sense to modify (the others are terminal states). But I did have to go check the FileStreamState enum to confirm. Might be worth either adding a comment here or just doing a match with FileStreamState::Error(_) | FileStreamState::Done(_) and add a comment on top explaining those are terminal states + to force ourselves to handle new cases in the future if they were added. It would be an annoying bug to debug, worth the 1 LOC IMO.
| /// The active reader, if any. | ||
| reader: Option<BoxStream<'static, Result<RecordBatch>>>, |
There was a problem hiding this comment.
Is there one ScanState across all partitions or one per partition? I'm guessing the latter: file_iter: VecDeque<PartitionedFile> is the files for this partition, we pump all of the files into one output stream of RecordBatch (reader). But we can have multiple planners / morsels ready and merge those all into a single stream of RecordBatch on the way out.
| Poll::Pending => { | ||
| self.pending_open = Some(PendingOpen { planner, io_future }); | ||
| return ScanAndReturn::Return(Poll::Pending); | ||
| } |
There was a problem hiding this comment.
I'd find some comments here helpful, e.g:
| } | |
| // We polled the IO future but it didn't complete | |
| // Back to the same state and wait until the next round of polling | |
| self.pending_open = Some(PendingOpen { planner, io_future }); | |
| return ScanAndReturn::Return(Poll::Pending); |
| self.ready_planners.push_back(planner); | ||
| return ScanAndReturn::Continue; |
There was a problem hiding this comment.
Similar to above. Although the abstractions help encapsulate really I can't tell what is supposed to happen just because ScanAndReturn::Continue is returned from here. There is a complex chain of FileStream::poll -> FileStream::poll_inner -> ScanState::poll_scan that is hard to track. I think LLMs will have no problem tracking through it but us mere humans could be helped by some summary comments on each branch.
| self.ready_planners.push_back(planner); | |
| return ScanAndReturn::Continue; | |
| // We polled the file open IO future and it completed. | |
| // It yielded us a `MorselPlanner` which we store. | |
| // Now we can move onto polling the next file open. | |
| self.ready_planners.push_back(planner); | |
| return ScanAndReturn::Continue; |
| } | ||
|
|
||
| if let Some(morsel) = self.ready_morsels.pop_front() { | ||
| self.metrics.files_opened.add(1); |
There was a problem hiding this comment.
Does a morsel map to a file opened? I thought opening a file produces the morsels (i.e. this metric should be incremented elsewhere).
| self.ready_morsels.extend(plan.take_morsels()); | ||
| self.ready_planners.extend(plan.take_planners()); |
There was a problem hiding this comment.
I see now, a planner can produce more planners (this is how it cycles through IO and CPU)
Stacked on
Which issue does this PR close?
Rationale for this change
The Morsel API allows for finer grain parallelism (and IO). It is important to have the FileStream work in terms of the Morsel API to allow future features (like workstealing, etc)
What changes are included in this PR?
Are these changes tested?
Yes by existing functional and benchmark tests, as well as new functional tests
Are there any user-facing changes?
No (not yet)