tustvold commented on issue #3463:
URL: https://github.com/apache/datafusion/issues/3463#issuecomment-3708285486

   It depends what you mean by IO 😅, if you mean fetching data from disk / 
network, you are correct predicate pushdown being discussed here (late 
materialization) does not influence IO. The only predicate pushdown that 
influences IO is using statistics to generate a RowSelection that filters out 
entire pages based on the page index. This is by design, as it allows for 
vectored IO / read coalescing which is critical for decent performance on 
object stores. Or to phrase it differently - DF enabling predicate pushdown 
will not influence the IO pattern to disk, and therefore this cannot be 
responsible for the regression in performance.
   
   What https://github.com/apache/arrow-rs/pull/8733 does do is change the way 
the parquet process actually decodes the fetched bytes, allowing it to 
effectively give-up on trying to use a filter that isn't proving to be very 
selective. This improves the worst case regression for pushing down a "bad" 
filter, although is still not as cheap as not pushing the filter down at all.
   
   It's also worth noting that the parquet reader doesn't really care about 
selectivity, what it cares about is how contiguous the filter is. If the filter 
only filters out 1% of the rows, but they're all consecutive, that is still a 
good filter to push down.
   
   _This is based on knowledge of the parquet reader that may be a year out of 
date so might be slightly outdated_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to