alamb commented on issue #20529: URL: https://github.com/apache/datafusion/issues/20529#issuecomment-4023988876
Here are some thoughts (it is getting hard to keep track of what is going on on the PR https://github.com/apache/datafusion/pull/20481). I have been looking at https://github.com/apache/datafusion/pull/20481 in order to figure out how we can most smoothly structure the code / ideas in this PR into the existing code of DataFusion Here is my summary of the architectural changes to https://github.com/apache/datafusion/pull/20481: * The [`FileStream`](https://github.com/alamb/datafusion/blob/df6c035e68e9508029c9ba5b0979dad428573e63/datafusion/datasource/src/file_stream.rs#L60-L59) is updated to know about "Morsels" * A new parallel API is added to the [`FileOpener` API ](https://github.com/alamb/datafusion/blob/df6c035e68e9508029c9ba5b0979dad428573e63/datafusion/datasource/src/file_stream.rs#L588-L587): "morselize" that takes a single PartitionedFile and breaks it into Morsels (which are smaller `PartitionedFiles`) Challenges I see with this design (all can be overcome with some more code): 1. As there is a parallel API in FileOpener that has a parallel code path this may be hard to test 2. It may also be hard to apply the morsel idea to non parquet paths (even though the idea is absolutely applicable). However, it might also be ok 3. It uses the "extensions" field to [stash the morsels](https://github.com/alamb/datafusion/blob/df6c035e68e9508029c9ba5b0979dad428573e63/datafusion/datasource-parquet/src/opener.rs#L253-L252), which I think will break some downstream users who use that for other things (like trace context, for example) 4. It is not clear to me how we will add other features like configurable IO prefetching without hard coding more into FileOpener Ideas going forward: 1. Rather than add a parallel API to the FileOpener API, I think we should try and make Morsels explicit in the pipeline somehow (perhaps via a `Morselizer` and rename `FileOpener` to`MorselOpener` trait 🤔 ) 3. Generalize the existing code that tries to statically split files based on byte offsets -- perhaps create morsels there as well. 4. Isolate the work stealing to a new structure / trait (as we will need to turn it off for some plans, e.g. those that require sortedness) I am going to try and prototype what a more explicit "morselizer" / "FileOpener' might look like -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
