Dandandan commented on PR #20481:
URL: https://github.com/apache/datafusion/pull/20481#issuecomment-4010938363
> I spent some more time this evening reading this PR -- I think if we did
some rearranging of the code it would likely be easier to read / understand
(and possibly test)
>
> Specifically I am thinking of trying to pull it out into a more explicit
pipeline like this
>
> ```
> ┌────────────────────┐ ┌────────────────────┐
┌───────────────────────┐ ┌────────────────────┐
> │ Open File │ │ Plan │ │ Fetch Data
│ │ │
> │ (fetch / decode │─────▶│(Apply statistics + │───▶│(fetch Row Group
data) │──▶│ Decode Data │
> │ ParquetMetadata) │ │ pruning) │ │
│ │ │
> └────────────────────┘ └────────────────────┘
└───────────────────────┘ └────────────────────┘
>
> IO+CPU CPU IO
CPU
> ```
>
> I will try and spend some of my time on the airplane tomorrow working to
see if I can rearrange the opener code towards this direction.
>
> My idea is to split the implementation into a few steps:
>
> 1. Introduce the abstraction of ParquetMorsels (but don't yet implement
work stealing across partitions)
> 2. (maybe) restructure the existing FileOpener so it produces a Stream of
ParquetMorsels
>
> I think then we would be in a better position to add in the (key) idea of
"steal morsels from other partitions" idea if the next file/row group wasn't
yet ready
>
> But I realize this is all just a theory until I actually produce code
I like the direction it is going to split the control flow into isolated
steps so we can control (and minimize) the IO.
@adriangb I also did some profiling for ClickBench.
The queries it seems to benefit the most from morselizing currently are the
ones with enough row groups to read (see e.g. picture from my presentation
below).
But there are also some with only a _few_ row groups (and the upstream)
One strategy that might be useful for parquet is:
* first split the query into row group morsels
* when the remaining queue is small, further split the morsels into smaller
ones (by changing).
I think we should carefully measure this to avoid any overhead, I think we
have to split such changes into a couple of PRs so we can handle each change
individually.
<img width="981" height="361" alt="image"
src="https://github.com/user-attachments/assets/6b0c7748-fd50-4a86-8d3b-62c79d59da6d"
/>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]